clm-v1-ref-pytorch-cuda β€” Lane-G-ref PyTorch+CUDA BASELINE REFERENCE

substrate = PyTorch-CUDA Β· lane = Lane-G-ref

THIS IS A BASELINE REFERENCE PROBE β€” NOT the production artifact. The production / PUBLIC-grade Lane-G CLM MUST be the hexa-native flame+forge stack (compiler-only NN, NO PyTorch / ATen / Python in the trained binary) per anima governance a_train_flame_forge. This PyTorch+CUDA model exists ONLY to set a throughput / GPU-utilization reference number β€” what a well-fed H100 trivially achieves on this byte-level char-LM workload β€” the bar the forge line's util-GREEN endgame is implicitly chasing. It does NOT satisfy or replace the forge PUBLIC artifact (a_completeness_over_cheap: an optional baseline probe, never the primary). It is NOT merged with Lane A / AKIDA (a_lane_akida_gpu_split).

What this is

A clean byte-level (V=256) decoder-only GPT trained with PyTorch AMP/bf16 on the same 5-lang c4 backbone corpus the forge line uses (dancinlab/clm-backbone-5lang-sample, 67.7 MB, ODC-BY), so the H100 utilization it reaches is an apples-ish reference for the forge util-GREEN goal.

Config

field value
arch byte-level decoder-only GPT (tied embeddings)
vocab 256 (byte-level β€” matches the forge int4-envelope corpus)
d_model 768
n_layer 12
n_head 12
block (ctx) 512
batch 32
params 85,645,824 (~85.6M)
precision bf16 AMP, TF32 matmul
steps 3000
optimizer AdamW (cosine LR, warmup 100)

Scale scoped honestly (a_scale_honest_scope): ~85.6M-param baseline, broadly comparable to the forge d768/12L rung (44.68M) in width/depth; the byte vocab matches the forge int4-envelope corpus.

Reference numbers (verbatim, this run)

  • GPU utilization: PEAK = 100.0 % Β· MEAN = 98.85 % (n=89 nvidia-smi samples, H100 80GB HBM3), mem_peak = 8529 MiB, mean power 587.8 W.
  • Throughput: 272,622 tok/s (49.15M tokens in 180.3 s wall).
  • CE descent: PASS β€” val CE 5.58041 β†’ 1.56885 (F-CLM-REF-DESCENT = 1).

Reference vs the forge line (Lane-G, hexa-native flame+forge)

The forge production line's MEASURED util on the same corpus family is RED (host-feed-bound): the d768 forge rung hit util MEAN β‰ˆ 0.78 % (PEAK 5 %), and the d1536/T512 lever-2 rung MEAN β‰ˆ 0.50 % (PEAK 19 %). This PyTorch+CUDA baseline reaches ~99 % MEAN util on the equivalent byte-LM workload β€” i.e. a well-fed H100 trivially saturates on this task. That ~99 % is the reference bar the forge util-GREEN endgame is chasing (target β‰₯20 %). This model does NOT replace the forge artifact; the forge util-GREEN + the forge PUBLIC CLM remain the production target, unchanged and primary.

Files

  • clm_ref_pytorch_cuda.pt β€” PyTorch state_dict + config (sha256 9882f5cb…371d321).
  • clm_ref_train.log.json β€” full training curve + util/throughput/descent summary.
  • clm_ref_pytorch_cuda.py β€” the trainer (BASELINE tool, not the production trainer).

Provenance

  • Trained 2026-06-02, vast.ai H100 80GB HBM3, image pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel.
  • Corpus: dancinlab/clm-backbone-5lang-sample (c4 mC4 5-lang backbone, ODC-BY).
  • anima domain: CLM+KOSMOS, Lane-G-ref line.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including dancinlab/clm-v1-ref-pytorch-cuda