clm-v1-ref-pytorch-cuda-3b β Lane-G-ref PyTorch+CUDA 3B-scale REFERENCE rung
substrate = PyTorch-CUDA Β· lane = Lane-G-ref Β· rung = 3B reference
PyTorch+CUDA 3B-scale REFERENCE rung β NOT forge production, bounded-budget not converged. This is a bounded-budget 3B-scale reference, NOT a converged production model, and NOT the hexa-native flame+forge PUBLIC-grade production artifact (anima governance
a_train_flame_forgeβ the production / PUBLIC-grade Lane-G CLM MUST be the compiler-only flame+forge stack, NO PyTorch / ATen / Python in the trained binary). This torch model exists ONLY to demonstrate, at ~3B params on a bounded N steps, that the same ByteGPT/Transformer architecture (a) trains (CE descends) and (b) saturates the GPU (util β« 20 %) at 3B scale β a throughput-justified 3B reference (a_completeness_over_cheap: an optional baseline/reference, never the primary). It does NOT satisfy or replace the forge PUBLIC artifact, and is NOT merged with Lane A / AKIDA (a_lane_akida_gpu_split).
What this is
The 3B rung of the Lane-G-ref ladder (85.6M β 3B). Same clean byte-level
(V=256) decoder-only GPT as the 85.6M PUBLIC reference
(dancinlab/clm-v1-ref-pytorch-cuda), scaled to ~3.15B params, trained with
PyTorch AMP/bf16 + gradient checkpointing on the same 5-lang c4 backbone
corpus (dancinlab/clm-backbone-5lang-sample, 67.7 MB, ODC-BY).
Scale honesty (a_scale_honest_scope): 3B-scale reference rung, bounded
N=400 steps, descent + util demonstrated, NOT converged.
Config
| field | value |
|---|---|
| arch | byte-level decoder-only GPT (tied embeddings) |
| vocab | 256 (byte-level β matches the forge int4-envelope corpus) |
| d_model | 2560 |
| n_layer | 40 |
| n_head | 20 (head_dim 128) |
| block (ctx) | 512 |
| batch | 12 |
| params | 3,149,030,400 (~3.149B) |
| precision | bf16 AMP, TF32 matmul |
| grad checkpointing | on (fits 80 GB at modest batch) |
| steps | 400 (bounded β NOT converged) |
| optimizer | AdamW (cosine LR, warmup 20) |
Reference numbers (verbatim, this run)
- GPU utilization: PEAK = 100.0 % Β· MEAN = 99.15 % (n=108 nvidia-smi samples, H100 80GB HBM3), mem_peak = 63921 MiB (β 62.4 GB of 80 GB), mean power 653.0 W.
- Throughput: 11,183 tok/s (2.46M tokens in 219.8 s wall).
- CE descent: PASS β val CE 7.16861 β 2.45871 (F-CLM-REF-3B-DESCENT = 1). (NOT converged β bounded 400-step reference; descent is monotone-ish over the run.)
Reference vs the forge line (Lane-G, hexa-native flame+forge)
The forge production line's MEASURED util on the same corpus family is RED (host-feed-bound): the d768 forge rung hit util MEAN β 0.78 % (PEAK 5 %), the d1536/T512 lever-2 rung MEAN β 0.50 % (PEAK 19 %). This PyTorch+CUDA reference reaches ~99 % MEAN util at 3B scale β i.e. a well-fed H100 trivially saturates on this byte-LM workload even at 3B params. That ~99 % is the reference bar the forge util-GREEN endgame is chasing (target β₯20 %). This model does NOT replace the forge artifact; the forge util-GREEN + the forge PUBLIC CLM remain the production target, unchanged and primary.
Files
clm_ref_pytorch_cuda_3b.ptβ PyTorch state_dict + config (sha256ebe56db7β¦33c4d24c9, 12,596,300,742 B).clm_ref_3b_train.log.jsonβ full training curve + util/throughput/descent summary.clm_ref_pytorch_cuda_3b.pyβ the trainer (BASELINE/reference tool, not the production trainer).
Provenance
- Trained 2026-06-02, vast.ai H100 80GB HBM3, image
pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel. - Corpus:
dancinlab/clm-backbone-5lang-sample(c4 mC4 5-lang backbone, ODC-BY). - anima domain: CLM+KOSMOS, Lane-G-ref line, 3B rung.