Update README: reflect actual upload state (20 runs), r2/r3 status, full audit table

77d5726 verified 8 days ago

7.86 kB

language:
  - en
license: apache-2.0
tags:
  - pytorch
  - text-generation
  - 1.58bit
  - ternary
  - byte-level
  - mlgru
  - ablation
library_name: pytorch
pipeline_tag: text-generation
model_type: custom

CPU-1 Ablation Study — Ready-to-use Checkpoints (fp32 unpacked)

Repo: Cukinator/cpu1-ablations-final Source: Cukinator/cpu1-ablation-checkpoints (compact 2-bit, raw training output) Code: github.com/Cukinator/1.58bits Dataset: Cukinator/cpu1-ablation-dataset

Each checkpoint is a standard PyTorch .pt file with float32 weights — no unpacking needed. Compatible with train_ablation.py from the 1.58bits repo.

Currently uploaded: 20 trained runs. The remaining *_r2 39M runs and the four *_r3 configs are queued but not yet trained / not yet unpacked.

Quick start

import torch, sys
sys.path.insert(0, "/path/to/1.58bits")
from train_ablation import build_ablation_model, generate

ckpt = torch.load("run_02/model.pt", map_location="cpu")
model = build_ablation_model(ckpt["config"])
model.load_state_dict(ckpt["state_dict"])
model.eval()

text = generate(model, "The quick brown fox", 128, ckpt["config"], torch.device("cpu"))
print(text)

Ablation chain — 50M runs (round 1, 2 tok/param)

Run	Architecture	Params	Val Loss	Perplexity	Throughput (P/s, CPU)
run_01	Transformer + BPE (16K vocab) + FP16	54.7M	4.66	106.1	72.4
run_02a_byte_only_heads	Transformer + Byte + 4 indep. heads (no LBD)	38.5M	2.31	10.1	84.8
run_02	Transformer + Byte + LocalByteDecoder	38.8M	1.72	5.56	75.4
run_03	MLGRU + Byte + FP16	38.8M	1.87	6.49	58.0
run_04	MLGRU + Byte + Ternary	38.9M	5.57	261.7	9.1
run_05	+ FPResidual	39.0M	5.55	257.7	9.7
run_05b_kernel_strict	MLGRU kernel-strict (no W_o)	35.8M	5.59	268.8	10.2
run_06	+ Bolmo patch embedding	39.0M	5.56	258.8	9.1
run_07	+ DeleteGate (CPU-1 complete)	39.0M	5.56	258.8	7.3
run_08	Folded Transformer + Byte + Ternary	38.9M	5.56	260.4	9.7
run_09	+ PFNet	39.4M	5.53	252.9	8.5
run_10	+ learned per-channel decay	39.4M	5.53	253.1	8.7

Small runs — 10M (round 1, 2 tok/param)

Run	Architecture	Params	Val Loss	Perplexity
run_13	Small CPU-1 + BPE (4K vocab)	12.5M	30.54	4.85×10⁸
run_14	Small CPU-1 byte-level (Qwen logprob distillation)	10.7M	5.58	263.9
run_15	Small CPU-1 byte + hidden distillation (EmbeddingAligner)	10.7M	5.58	263.9
run_16	Small CPU-1 raw bytes, no teacher (FineWeb)	10.7M	5.58	263.9

Round 2 — re-runs at 15 tok/param (7.5× the original budget)

Re-trained at 15 tok/param (7.5× the original budget) to test whether the ~5.55 nat floor on ternary runs was caused by under-training.

10M re-runs (1336 steps each, all uploaded)

Run	Architecture	Val Loss (r1)	Val Loss (r2)	Δ
run_13_r2	Small CPU-1 + BPE	30.54	25.43	−5.1
run_14_r2	Small CPU-1 byte-level	5.5754	5.5755	+0.0001
run_15_r2	Small CPU-1 byte + hidden distill	5.5755	5.5755	0
run_16_r2	Small CPU-1 raw bytes	5.5754	5.5754	0

39M re-runs (in progress)

run_04_r2, run_07_r2, run_05_r2, run_08_r2, run_09_r2, run_10_r2 were started at 15 tok/param. As of 2026-05 only run_04_r2 and run_07_r2 have intermediate step checkpoints (≈97% of training); none of them have been unpacked to fp32 in this repo yet because the audit below showed they were converging to the same uniform floor as their r1 counterparts. They remain available in the source compact_2bit repo Cukinator/cpu1-ablation-checkpoints.

Manually-measured byte CE loss on the latest run_04_r2 and run_07_r2 step checkpoints (sample of English narrative):

Run	Loss (sample)	Δ vs uniform (5.545)
run_02 (FP16 Transformer, reference)	4.37	−1.18
run_03 (FP16 MLGRU, reference)	3.97	−1.58
run_04 (Ternary r1, 2 tok/p)	5.55	+0.01
run_04_r2 (Ternary r2, step 4840 / ~5000)	5.57	+0.02
run_07 (CPU-1 complete r1)	5.56	+0.02
run_07_r2 (CPU-1 complete r2, step 4860)	5.56	+0.02

Round 3 — cold-start rescue (queued, not uploaded)

After the audit, four configurations were added to RUN_CONFIGS / SMALL_RUN_CONFIGS that apply all four corrections in concert:

bf16 AMP instead of fp16 (no GradScaler underflow on BitLinear rescale chain)
lr_scale=2.0 on BitLinear (BitNet b1.58 §3.1 prescription)
CE-only training signal (drop the 90/10 KL distillation + boundary/emb_align auxiliary losses)
50 tok/param (3.3× r2, close to BitNet's lower bound at 3B scale)

Configs: run_04_r3, run_07_r3, run_14_r3, run_15_r3. Not yet trained. If they reach val_loss < ~5.0 they will be uploaded here.

Key finding (2026-05 audit): ternary cold-start collapses to the uniform baseline

The reference value is ln(256) = 5.545 nats, the entropy of a uniform distribution over the 256 byte vocabulary. All byte-level ternary runs (round 1 and round 2, both scales) plateau within 0.05 nats of this floor — the models are effectively producing uniform predictions over bytes.

More training does not help. A 7.5× increase in tokens-per-parameter moves the loss by 0.0001 nats. A weight-evolution audit on run_14_r2 showed that >99.9% of BitLinear weights are in the same ternary state at step 10 and at step 1326 (the entire training). The bottleneck is the cold-start dynamics of straight-through-estimator (STE) ternary quantisation, not the data budget.

Compare against the FP16 sibling architectures:

run_02 (Transformer + byte + FP16, 38M): 1.72
run_03 (MLGRU + byte + FP16, 38M): 1.87
run_04 (MLGRU + byte + ternary, same 38M): 5.57 ← collapsed

Published 1.58-bit models (BitNet b1.58 at 700M+ with 30–143 tok/param; Slender-Mamba with FP16 warm-start) reach functional performance, but the cold-start regime at <50 tok/param and <100M parameters that this study operates in is not covered by their results.

Throughput is also worth flagging: even with the ideal BitNet.cpp / T-MAC class kernels (4–6× speedup on the matmul fraction), the projected end-to-end throughput of the 39M ternary models is 0.30–0.50× of their FP16 Transformer sibling at the same scale. The 1.58-bit speed advantage only materialises above ~700M parameters where weight RAM bandwidth becomes the bottleneck.

Practical implication. The byte-level ternary chain (runs 04–10, 14–16, and their r2 counterparts) cannot distinguish between architectural variants because all of them are stuck at the same numerical floor. The architectures themselves may be sound; the training recipe needs to either (a) warm-start from FP weights, (b) anneal the quantisation, or (c) use a much larger token budget.

Full mechanistic analysis, weight-flip diagnostics and throughput projections are in the main repository README.

License

Apache-2.0. Same as the source code at github.com/Cukinator/1.58bits.