clm-v1-ref-pytorch-cuda β Lane-G-ref PyTorch+CUDA BASELINE REFERENCE
substrate = PyTorch-CUDA Β· lane = Lane-G-ref
THIS IS A BASELINE REFERENCE PROBE β NOT the production artifact. The production / PUBLIC-grade Lane-G CLM MUST be the hexa-native flame+forge stack (compiler-only NN, NO PyTorch / ATen / Python in the trained binary) per anima governance
a_train_flame_forge. This PyTorch+CUDA model exists ONLY to set a throughput / GPU-utilization reference number β what a well-fed H100 trivially achieves on this byte-level char-LM workload β the bar the forge line's util-GREEN endgame is implicitly chasing. It does NOT satisfy or replace the forge PUBLIC artifact (a_completeness_over_cheap: an optional baseline probe, never the primary). It is NOT merged with Lane A / AKIDA (a_lane_akida_gpu_split).
What this is
A clean byte-level (V=256) decoder-only GPT trained with PyTorch AMP/bf16 on the
same 5-lang c4 backbone corpus the forge line uses
(dancinlab/clm-backbone-5lang-sample, 67.7 MB, ODC-BY), so the H100 utilization
it reaches is an apples-ish reference for the forge util-GREEN goal.
Config
| field | value |
|---|---|
| arch | byte-level decoder-only GPT (tied embeddings) |
| vocab | 256 (byte-level β matches the forge int4-envelope corpus) |
| d_model | 768 |
| n_layer | 12 |
| n_head | 12 |
| block (ctx) | 512 |
| batch | 32 |
| params | 85,645,824 (~85.6M) |
| precision | bf16 AMP, TF32 matmul |
| steps | 3000 |
| optimizer | AdamW (cosine LR, warmup 100) |
Scale scoped honestly (a_scale_honest_scope): ~85.6M-param baseline, broadly
comparable to the forge d768/12L rung (44.68M) in width/depth; the byte vocab
matches the forge int4-envelope corpus.
Reference numbers (verbatim, this run)
- GPU utilization: PEAK = 100.0 % Β· MEAN = 98.85 % (n=89 nvidia-smi samples, H100 80GB HBM3), mem_peak = 8529 MiB, mean power 587.8 W.
- Throughput: 272,622 tok/s (49.15M tokens in 180.3 s wall).
- CE descent: PASS β val CE 5.58041 β 1.56885 (F-CLM-REF-DESCENT = 1).
Reference vs the forge line (Lane-G, hexa-native flame+forge)
The forge production line's MEASURED util on the same corpus family is RED (host-feed-bound): the d768 forge rung hit util MEAN β 0.78 % (PEAK 5 %), and the d1536/T512 lever-2 rung MEAN β 0.50 % (PEAK 19 %). This PyTorch+CUDA baseline reaches ~99 % MEAN util on the equivalent byte-LM workload β i.e. a well-fed H100 trivially saturates on this task. That ~99 % is the reference bar the forge util-GREEN endgame is chasing (target β₯20 %). This model does NOT replace the forge artifact; the forge util-GREEN + the forge PUBLIC CLM remain the production target, unchanged and primary.
Files
clm_ref_pytorch_cuda.ptβ PyTorch state_dict + config (sha2569882f5cbβ¦371d321).clm_ref_train.log.jsonβ full training curve + util/throughput/descent summary.clm_ref_pytorch_cuda.pyβ the trainer (BASELINE tool, not the production trainer).
Provenance
- Trained 2026-06-02, vast.ai H100 80GB HBM3, image
pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel. - Corpus:
dancinlab/clm-backbone-5lang-sample(c4 mC4 5-lang backbone, ODC-BY). - anima domain: CLM+KOSMOS, Lane-G-ref line.