clm-v1-ref-pytorch-cuda — Lane-G-ref PyTorch+CUDA BASELINE REFERENCE

substrate = PyTorch-CUDA · lane = Lane-G-ref

THIS IS A BASELINE REFERENCE PROBE — NOT the production artifact. The production / PUBLIC-grade Lane-G CLM MUST be the hexa-native flame+forge stack (compiler-only NN, NO PyTorch / ATen / Python in the trained binary) per anima governance a_train_flame_forge. This PyTorch+CUDA model exists ONLY to set a throughput / GPU-utilization reference number — what a well-fed H100 trivially achieves on this byte-level char-LM workload — the bar the forge line's util-GREEN endgame is implicitly chasing. It does NOT satisfy or replace the forge PUBLIC artifact (a_completeness_over_cheap: an optional baseline probe, never the primary). It is NOT merged with Lane A / AKIDA (a_lane_akida_gpu_split).

What this is

A clean byte-level (V=256) decoder-only GPT trained with PyTorch AMP/bf16 on the same 5-lang c4 backbone corpus the forge line uses (dancinlab/clm-backbone-5lang-sample, 67.7 MB, ODC-BY), so the H100 utilization it reaches is an apples-ish reference for the forge util-GREEN goal.

Config

field	value
arch	byte-level decoder-only GPT (tied embeddings)
vocab	256 (byte-level — matches the forge int4-envelope corpus)
d_model	768
n_layer	12
n_head	12
block (ctx)	512
batch	32
params	85,645,824 (~85.6M)
precision	bf16 AMP, TF32 matmul
steps	3000
optimizer	AdamW (cosine LR, warmup 100)

Scale scoped honestly (a_scale_honest_scope): ~85.6M-param baseline, broadly comparable to the forge d768/12L rung (44.68M) in width/depth; the byte vocab matches the forge int4-envelope corpus.

Reference numbers (verbatim, this run)

GPU utilization: PEAK = 100.0 % · MEAN = 98.85 % (n=89 nvidia-smi samples, H100 80GB HBM3), mem_peak = 8529 MiB, mean power 587.8 W.
Throughput: 272,622 tok/s (49.15M tokens in 180.3 s wall).
CE descent: PASS — val CE 5.58041 → 1.56885 (F-CLM-REF-DESCENT = 1).

Reference vs the forge line (Lane-G, hexa-native flame+forge)

The forge production line's MEASURED util on the same corpus family is RED (host-feed-bound): the d768 forge rung hit util MEAN ≈ 0.78 % (PEAK 5 %), and the d1536/T512 lever-2 rung MEAN ≈ 0.50 % (PEAK 19 %). This PyTorch+CUDA baseline reaches ~99 % MEAN util on the equivalent byte-LM workload — i.e. a well-fed H100 trivially saturates on this task. That ~99 % is the reference bar the forge util-GREEN endgame is chasing (target ≥20 %). This model does NOT replace the forge artifact; the forge util-GREEN + the forge PUBLIC CLM remain the production target, unchanged and primary.

Files

clm_ref_pytorch_cuda.pt — PyTorch state_dict + config (sha256 9882f5cb…371d321).
clm_ref_train.log.json — full training curve + util/throughput/descent summary.
clm_ref_pytorch_cuda.py — the trainer (BASELINE tool, not the production trainer).

Provenance

Trained 2026-06-02, vast.ai H100 80GB HBM3, image pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel.
Corpus: dancinlab/clm-backbone-5lang-sample (c4 mC4 5-lang backbone, ODC-BY).
anima domain: CLM+KOSMOS, Lane-G-ref line.

Downloads last month: -; Downloads are not tracked for this model. How to track