clm-v1-ref-pytorch-cuda-7b

One-line summary: Lane-G-ref · substrate = PyTorch-CUDA · 7.25B-param byte-level GPT REFERENCE rung (bounded-budget, NOT converged).

Family: clm (byte-level causal LM, V=256)
Stage: ref (from-scratch PyTorch-CUDA reference ladder)
Step: step-400 (bounded reference run, NOT converged)
Substrate: PyTorch-CUDA (single H100 80GB), NOT the hexa-native flame+forge production stack

Origin

What this checkpoint is and how it was produced.

This is the 7.25B-parameter rung of the anima Lane-G reference ladder (85.6M → 3.149B → 7.25B). It is a byte-level (V=256) decoder-only GPT (Llama-7B-ish shape adapted to byte vocab) trained from scratch with PyTorch + CUDA AMP/bf16 + gradient checkpointing + 8-bit AdamW on a single H100 80GB.

Base model: none (from-scratch byte-level GPT)
Training data: dancinlab/clm-backbone-5lang-sample (same 5-lang c4 backbone as the 85.6M PUBLIC ref and the 3.149B ref), flattened to a UTF-8 byte stream — 6,553,600 tokens seen
Training recipe: 400 steps (warmup 20, cosine LR, base lr 1.6e-4), batch 32, block 512, grad_accum 1
Compute: 1× NVIDIA H100 80GB HBM3 (vast.ai pod 39115197), AMP bf16 + grad-ckpt, wall 884.9 s
Trainer: clm_ref_pytorch_cuda_7b.py (PyTorch-CUDA, CUDA-required) — included in this repo
Final metric: val_CE 5.36063 → 2.41208 (F_CLM_REF_7B_DESCENT=1, descent PASS)
Lineage: anima ENGINE+CLM+KOSMOS · branch lane-g/d768-cuda-fire · Lane-G-ref ladder 85.6M (dancinlab/clm-v1-ref-pytorch-cuda) → 3.149B (dancinlab/clm-v1-ref-pytorch-cuda-3b) → 7.25B (this repo)

Result (verbatim from `clm_ref_7b_train.log.json`)

metric	value	verdict
descent	val_CE 5.360630989 → 2.412078857 (F_CLM_REF_7B_DESCENT=1)	🟢 PASS
GPU util	PEAK 100.0% · MEAN 99.1788990825688% (n=436)	🟢 PASS (≫20%)
throughput	7406.1 tok/s final · 6,553,600 tok seen	—
mem peak	46,025 MiB	—
power mean	651.3842201834855 W	—
wall	884.9 s	—

Closure = PASS (descent 🟢 AND util 🟢) → this reference rung is PUBLIC. It is still NOT converged (bounded 400 steps); do not deploy.

Falsifiers

Concrete reproducible tests this checkpoint passes.

F-CLM-REF-7B-DESCENT: from-scratch byte GPT at 7.25B params trains (CE descends) under the reference recipe.
- Pass criterion: last_val_CE < first_val_CE.
- Last result: PASS — 5.360630989 → 2.412078857 (F_CLM_REF_7B_DESCENT=1), run pod 39115197.
F-CLM-REF-7B-UTIL: the PyTorch-CUDA reference trainer saturates the GPU at 7B scale (util ≫ 20%).
- Pass criterion: util_mean > 20% over the nvidia-smi sample window.
- Last result: PASS — PEAK 100.0% · MEAN 99.1788990825688% (n=436), run pod 39115197.
F-CLM-REF-7B-NOT-CONVERGED (deliberate honesty falsifier): this is NOT a converged production artifact.
- Pass criterion: the run is bounded (N=400 steps) and makes no convergence claim.
- Last result: HELD — bounded 400-step reference rung only (a_scale_honest_scope).

Substrate

Hardware / software / data dependencies.

Inference VRAM (bf16): ~14.5 GB weights (state_dict) + activation memory; train mem peak measured 46,025 MiB on H100 80GB
Min Python: 3.10
Required: torch (CUDA build) ≥ 2.4, bitsandbytes (AdamW8bit, training only)
Input format: raw UTF-8 byte stream (V=256, no tokenizer)
Context window: 512 bytes (block)
Tokenizer: none — byte-level (byte == token id, 0..255)

Architecture

field	value
vocab	256 (byte-level)
d_model	4096
n_layer	36
n_head	32 (head_dim 128)
block	512
params	7,252,828,160 (7.25B)
dtype	bf16 (master weights + grads)
optimizer	bitsandbytes AdamW8bit (8-bit states)
grad_ckpt	true

Caveats

Honest limitations (raw#10).

NOT a converged production artifact. Bounded N=400 steps. It demonstrates only that (a) the same ByteGPT/Transformer arch trains (CE descends) and (b) it saturates the GPU (util ≫ 20%) at 7B scale. Do NOT deploy.
NOT the production / PUBLIC-grade model. The production Lane-G CLM must be the hexa-native flame+forge stack (compiler-only NN; no PyTorch/ATen/Python in the trained binary) per governance a_train_flame_forge. This torch trainer is an a_completeness_over_cheap optional baseline/reference, never the primary, and never claimed as the forge artifact.
NEVER merged with Lane A / AKIDA (a_lane_akida_gpu_split). This is a pure GPU (Lane-G) reference; AKIDA on-chip (Lane A) results are tracked separately and must never be blended into one verdict.
Scale-honest scope (a_scale_honest_scope): a single 400-step rung is not a convergence or generalization claim; it is the 7.25B point of the 85.6M→3.149B→7.25B reference ladder.

Composability

Same arch/recipe as the 85.6M (dancinlab/clm-v1-ref-pytorch-cuda) and 3.149B (dancinlab/clm-v1-ref-pytorch-cuda-3b) ladder rungs — directly comparable scale points (only d_model / n_layer / n_head differ).
The trainer (clm_ref_pytorch_cuda_7b.py) + corpus-prep (prep_corpus_7b.py) are included so the rung is fully reproducible.
This reference is composable as a baseline/control against the forge production CLM (Lane-G flame+forge) — it is NOT a substitute for it.

Files

clm_ref_pytorch_cuda_7b.pt — model state_dict + config (bf16); sha256 38ef2ed55b47b670fa915bba0c2827782799a9070ba087210cd44db1fddb4d41, 14,505,817,922 bytes
clm_ref_7b_train.log.json — full training curve + util/throughput
clm_ref_pytorch_cuda_7b.py — the trainer (PyTorch-CUDA, CUDA-required)
prep_corpus_7b.py — corpus prep (5-lang c4 backbone → byte stream)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for dancinlab/clm-v1-ref-pytorch-cuda-7b

Finetunes

1 model