clm-v1-ref-pytorch-cuda-7b

One-line summary: Lane-G-ref Β· substrate = PyTorch-CUDA Β· 7.25B-param byte-level GPT REFERENCE rung (bounded-budget, NOT converged).

  • Family: clm (byte-level causal LM, V=256)
  • Stage: ref (from-scratch PyTorch-CUDA reference ladder)
  • Step: step-400 (bounded reference run, NOT converged)
  • Substrate: PyTorch-CUDA (single H100 80GB), NOT the hexa-native flame+forge production stack

Origin

What this checkpoint is and how it was produced.

This is the 7.25B-parameter rung of the anima Lane-G reference ladder (85.6M β†’ 3.149B β†’ 7.25B). It is a byte-level (V=256) decoder-only GPT (Llama-7B-ish shape adapted to byte vocab) trained from scratch with PyTorch + CUDA AMP/bf16 + gradient checkpointing + 8-bit AdamW on a single H100 80GB.

  • Base model: none (from-scratch byte-level GPT)
  • Training data: dancinlab/clm-backbone-5lang-sample (same 5-lang c4 backbone as the 85.6M PUBLIC ref and the 3.149B ref), flattened to a UTF-8 byte stream β€” 6,553,600 tokens seen
  • Training recipe: 400 steps (warmup 20, cosine LR, base lr 1.6e-4), batch 32, block 512, grad_accum 1
  • Compute: 1Γ— NVIDIA H100 80GB HBM3 (vast.ai pod 39115197), AMP bf16 + grad-ckpt, wall 884.9 s
  • Trainer: clm_ref_pytorch_cuda_7b.py (PyTorch-CUDA, CUDA-required) β€” included in this repo
  • Final metric: val_CE 5.36063 β†’ 2.41208 (F_CLM_REF_7B_DESCENT=1, descent PASS)
  • Lineage: anima ENGINE+CLM+KOSMOS Β· branch lane-g/d768-cuda-fire Β· Lane-G-ref ladder 85.6M (dancinlab/clm-v1-ref-pytorch-cuda) β†’ 3.149B (dancinlab/clm-v1-ref-pytorch-cuda-3b) β†’ 7.25B (this repo)

Result (verbatim from clm_ref_7b_train.log.json)

metric value verdict
descent val_CE 5.360630989 β†’ 2.412078857 (F_CLM_REF_7B_DESCENT=1) 🟒 PASS
GPU util PEAK 100.0% Β· MEAN 99.1788990825688% (n=436) 🟒 PASS (≫20%)
throughput 7406.1 tok/s final Β· 6,553,600 tok seen β€”
mem peak 46,025 MiB β€”
power mean 651.3842201834855 W β€”
wall 884.9 s β€”

Closure = PASS (descent 🟒 AND util 🟒) β†’ this reference rung is PUBLIC. It is still NOT converged (bounded 400 steps); do not deploy.

Falsifiers

Concrete reproducible tests this checkpoint passes.

  • F-CLM-REF-7B-DESCENT: from-scratch byte GPT at 7.25B params trains (CE descends) under the reference recipe.
    • Pass criterion: last_val_CE < first_val_CE.
    • Last result: PASS β€” 5.360630989 β†’ 2.412078857 (F_CLM_REF_7B_DESCENT=1), run pod 39115197.
  • F-CLM-REF-7B-UTIL: the PyTorch-CUDA reference trainer saturates the GPU at 7B scale (util ≫ 20%).
    • Pass criterion: util_mean > 20% over the nvidia-smi sample window.
    • Last result: PASS β€” PEAK 100.0% Β· MEAN 99.1788990825688% (n=436), run pod 39115197.
  • F-CLM-REF-7B-NOT-CONVERGED (deliberate honesty falsifier): this is NOT a converged production artifact.
    • Pass criterion: the run is bounded (N=400 steps) and makes no convergence claim.
    • Last result: HELD β€” bounded 400-step reference rung only (a_scale_honest_scope).

Substrate

Hardware / software / data dependencies.

  • Inference VRAM (bf16): ~14.5 GB weights (state_dict) + activation memory; train mem peak measured 46,025 MiB on H100 80GB
  • Min Python: 3.10
  • Required: torch (CUDA build) β‰₯ 2.4, bitsandbytes (AdamW8bit, training only)
  • Input format: raw UTF-8 byte stream (V=256, no tokenizer)
  • Context window: 512 bytes (block)
  • Tokenizer: none β€” byte-level (byte == token id, 0..255)

Architecture

field value
vocab 256 (byte-level)
d_model 4096
n_layer 36
n_head 32 (head_dim 128)
block 512
params 7,252,828,160 (7.25B)
dtype bf16 (master weights + grads)
optimizer bitsandbytes AdamW8bit (8-bit states)
grad_ckpt true

Caveats

Honest limitations (raw#10).

  • NOT a converged production artifact. Bounded N=400 steps. It demonstrates only that (a) the same ByteGPT/Transformer arch trains (CE descends) and (b) it saturates the GPU (util ≫ 20%) at 7B scale. Do NOT deploy.
  • NOT the production / PUBLIC-grade model. The production Lane-G CLM must be the hexa-native flame+forge stack (compiler-only NN; no PyTorch/ATen/Python in the trained binary) per governance a_train_flame_forge. This torch trainer is an a_completeness_over_cheap optional baseline/reference, never the primary, and never claimed as the forge artifact.
  • NEVER merged with Lane A / AKIDA (a_lane_akida_gpu_split). This is a pure GPU (Lane-G) reference; AKIDA on-chip (Lane A) results are tracked separately and must never be blended into one verdict.
  • Scale-honest scope (a_scale_honest_scope): a single 400-step rung is not a convergence or generalization claim; it is the 7.25B point of the 85.6Mβ†’3.149Bβ†’7.25B reference ladder.

Composability

  • Same arch/recipe as the 85.6M (dancinlab/clm-v1-ref-pytorch-cuda) and 3.149B (dancinlab/clm-v1-ref-pytorch-cuda-3b) ladder rungs β€” directly comparable scale points (only d_model / n_layer / n_head differ).
  • The trainer (clm_ref_pytorch_cuda_7b.py) + corpus-prep (prep_corpus_7b.py) are included so the rung is fully reproducible.
  • This reference is composable as a baseline/control against the forge production CLM (Lane-G flame+forge) β€” it is NOT a substitute for it.

Files

  • clm_ref_pytorch_cuda_7b.pt β€” model state_dict + config (bf16); sha256 38ef2ed55b47b670fa915bba0c2827782799a9070ba087210cd44db1fddb4d41, 14,505,817,922 bytes
  • clm_ref_7b_train.log.json β€” full training curve + util/throughput
  • clm_ref_pytorch_cuda_7b.py β€” the trainer (PyTorch-CUDA, CUDA-required)
  • prep_corpus_7b.py β€” corpus prep (5-lang c4 backbone β†’ byte stream)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including dancinlab/clm-v1-ref-pytorch-cuda-7b