clm-v1-ref-pytorch-cuda-3b — Lane-G-ref PyTorch+CUDA 3B-scale REFERENCE rung

substrate = PyTorch-CUDA · lane = Lane-G-ref · rung = 3B reference

PyTorch+CUDA 3B-scale REFERENCE rung — NOT forge production, bounded-budget not converged. This is a bounded-budget 3B-scale reference, NOT a converged production model, and NOT the hexa-native flame+forge PUBLIC-grade production artifact (anima governance a_train_flame_forge — the production / PUBLIC-grade Lane-G CLM MUST be the compiler-only flame+forge stack, NO PyTorch / ATen / Python in the trained binary). This torch model exists ONLY to demonstrate, at ~3B params on a bounded N steps, that the same ByteGPT/Transformer architecture (a) trains (CE descends) and (b) saturates the GPU (util ≫ 20 %) at 3B scale — a throughput-justified 3B reference (a_completeness_over_cheap: an optional baseline/reference, never the primary). It does NOT satisfy or replace the forge PUBLIC artifact, and is NOT merged with Lane A / AKIDA (a_lane_akida_gpu_split).

What this is

The 3B rung of the Lane-G-ref ladder (85.6M → 3B). Same clean byte-level (V=256) decoder-only GPT as the 85.6M PUBLIC reference (dancinlab/clm-v1-ref-pytorch-cuda), scaled to ~3.15B params, trained with PyTorch AMP/bf16 + gradient checkpointing on the same 5-lang c4 backbone corpus (dancinlab/clm-backbone-5lang-sample, 67.7 MB, ODC-BY).

Scale honesty (a_scale_honest_scope): 3B-scale reference rung, bounded N=400 steps, descent + util demonstrated, NOT converged.

Config

field	value
arch	byte-level decoder-only GPT (tied embeddings)
vocab	256 (byte-level — matches the forge int4-envelope corpus)
d_model	2560
n_layer	40
n_head	20 (head_dim 128)
block (ctx)	512
batch	12
params	3,149,030,400 (~3.149B)
precision	bf16 AMP, TF32 matmul
grad checkpointing	on (fits 80 GB at modest batch)
steps	400 (bounded — NOT converged)
optimizer	AdamW (cosine LR, warmup 20)

Reference numbers (verbatim, this run)

GPU utilization: PEAK = 100.0 % · MEAN = 99.15 % (n=108 nvidia-smi samples, H100 80GB HBM3), mem_peak = 63921 MiB (≈ 62.4 GB of 80 GB), mean power 653.0 W.
Throughput: 11,183 tok/s (2.46M tokens in 219.8 s wall).
CE descent: PASS — val CE 7.16861 → 2.45871 (F-CLM-REF-3B-DESCENT = 1). (NOT converged — bounded 400-step reference; descent is monotone-ish over the run.)

Reference vs the forge line (Lane-G, hexa-native flame+forge)

The forge production line's MEASURED util on the same corpus family is RED (host-feed-bound): the d768 forge rung hit util MEAN ≈ 0.78 % (PEAK 5 %), the d1536/T512 lever-2 rung MEAN ≈ 0.50 % (PEAK 19 %). This PyTorch+CUDA reference reaches ~99 % MEAN util at 3B scale — i.e. a well-fed H100 trivially saturates on this byte-LM workload even at 3B params. That ~99 % is the reference bar the forge util-GREEN endgame is chasing (target ≥20 %). This model does NOT replace the forge artifact; the forge util-GREEN + the forge PUBLIC CLM remain the production target, unchanged and primary.

Files

clm_ref_pytorch_cuda_3b.pt — PyTorch state_dict + config (sha256 ebe56db7…33c4d24c9, 12,596,300,742 B).
clm_ref_3b_train.log.json — full training curve + util/throughput/descent summary.
clm_ref_pytorch_cuda_3b.py — the trainer (BASELINE/reference tool, not the production trainer).

Provenance

Trained 2026-06-02, vast.ai H100 80GB HBM3, image pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel.
Corpus: dancinlab/clm-backbone-5lang-sample (c4 mC4 5-lang backbone, ODC-BY).
anima domain: CLM+KOSMOS, Lane-G-ref line, 3B rung.

Downloads last month: -; Downloads are not tracked for this model. How to track