Update README: reflect actual upload state (20 runs), r2/r3 status, full audit table

77d5726 verified 8 days ago

7.86 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- pytorch
	- text-generation
	- 1.58bit
	- ternary
	- byte-level
	- mlgru
	- ablation
	library_name: pytorch
	pipeline_tag: text-generation
	model_type: custom
	---

	# CPU-1 Ablation Study — Ready-to-use Checkpoints (fp32 unpacked)

	Repo: `Cukinator/cpu1-ablations-final`
	Source: [`Cukinator/cpu1-ablation-checkpoints`](https://huggingface.co/Cukinator/cpu1-ablation-checkpoints) (compact 2-bit, raw training output)
	Code: [github.com/Cukinator/1.58bits](https://github.com/Cukinator/1.58bits)
	Dataset: [`Cukinator/cpu1-ablation-dataset`](https://huggingface.co/datasets/Cukinator/cpu1-ablation-dataset)

	Each checkpoint is a standard PyTorch `.pt` file with float32 weights —
	no unpacking needed. Compatible with `train_ablation.py` from the
	[1.58bits repo](https://github.com/Cukinator/1.58bits).

	Currently uploaded: 20 trained runs. The remaining `*_r2` 39M runs and
	the four `*_r3` configs are queued but not yet trained / not yet unpacked.

	## Quick start

	```python
	import torch, sys
	sys.path.insert(0, "/path/to/1.58bits")
	from train_ablation import build_ablation_model, generate

	ckpt = torch.load("run_02/model.pt", map_location="cpu")
	model = build_ablation_model(ckpt["config"])
	model.load_state_dict(ckpt["state_dict"])
	model.eval()

	text = generate(model, "The quick brown fox", 128, ckpt["config"], torch.device("cpu"))
	print(text)
	```

	## Ablation chain — 50M runs (round 1, 2 tok/param)

	\| Run \| Architecture \| Params \| Val Loss \| Perplexity \| Throughput (P/s, CPU) \|
	\|-----\|-------------\|-------:\|---------:\|-----------:\|----------------------:\|
	\| run_01 \| Transformer + BPE (16K vocab) + FP16 \| 54.7M \| 4.66 \| 106.1 \| 72.4 \|
	\| run_02a_byte_only_heads \| Transformer + Byte + 4 indep. heads (no LBD) \| 38.5M \| 2.31 \| 10.1 \| 84.8 \|
	\| run_02 \| Transformer + Byte + LocalByteDecoder \| 38.8M \| 1.72 \| 5.56 \| 75.4 \|
	\| run_03 \| MLGRU + Byte + FP16 \| 38.8M \| 1.87 \| 6.49 \| 58.0 \|
	\| run_04 \| MLGRU + Byte + Ternary \| 38.9M \| 5.57 \| 261.7 \| 9.1 \|
	\| run_05 \| + FPResidual \| 39.0M \| 5.55 \| 257.7 \| 9.7 \|
	\| run_05b_kernel_strict \| MLGRU kernel-strict (no W_o) \| 35.8M \| 5.59 \| 268.8 \| 10.2 \|
	\| run_06 \| + Bolmo patch embedding \| 39.0M \| 5.56 \| 258.8 \| 9.1 \|
	\| run_07 \| + DeleteGate (CPU-1 complete) \| 39.0M \| 5.56 \| 258.8 \| 7.3 \|
	\| run_08 \| Folded Transformer + Byte + Ternary \| 38.9M \| 5.56 \| 260.4 \| 9.7 \|
	\| run_09 \| + PFNet \| 39.4M \| 5.53 \| 252.9 \| 8.5 \|
	\| run_10 \| + learned per-channel decay \| 39.4M \| 5.53 \| 253.1 \| 8.7 \|

	## Small runs — 10M (round 1, 2 tok/param)

	\| Run \| Architecture \| Params \| Val Loss \| Perplexity \|
	\|-----\|-------------\|-------:\|---------:\|-----------:\|
	\| run_13 \| Small CPU-1 + BPE (4K vocab) \| 12.5M \| 30.54 \| 4.85×10⁸ \|
	\| run_14 \| Small CPU-1 byte-level (Qwen logprob distillation) \| 10.7M \| 5.58 \| 263.9 \|
	\| run_15 \| Small CPU-1 byte + hidden distillation (EmbeddingAligner) \| 10.7M \| 5.58 \| 263.9 \|
	\| run_16 \| Small CPU-1 raw bytes, no teacher (FineWeb) \| 10.7M \| 5.58 \| 263.9 \|

	## Round 2 — re-runs at 15 tok/param (7.5× the original budget)

	Re-trained at 15 tok/param (7.5× the original budget) to test whether
	the ~5.55 nat floor on ternary runs was caused by under-training.

	### 10M re-runs (1336 steps each, all uploaded)

	\| Run \| Architecture \| Val Loss (r1) \| Val Loss (r2) \| Δ \|
	\|-----\|-------------\|---------------\|---------------\|---\|
	\| run_13_r2 \| Small CPU-1 + BPE \| 30.54 \| 25.43 \| −5.1 \|
	\| run_14_r2 \| Small CPU-1 byte-level \| 5.5754 \| 5.5755 \| +0.0001 \|
	\| run_15_r2 \| Small CPU-1 byte + hidden distill \| 5.5755 \| 5.5755 \| 0 \|
	\| run_16_r2 \| Small CPU-1 raw bytes \| 5.5754 \| 5.5754 \| 0 \|

	### 39M re-runs (in progress)

	`run_04_r2`, `run_07_r2`, `run_05_r2`, `run_08_r2`, `run_09_r2`, `run_10_r2`
	were started at 15 tok/param. As of 2026-05 only `run_04_r2` and `run_07_r2`
	have intermediate step checkpoints (≈97% of training); none of them
	have been unpacked to fp32 in this repo yet because the audit below
	showed they were converging to the same uniform floor as their r1
	counterparts. They remain available in the source compact_2bit repo
	[`Cukinator/cpu1-ablation-checkpoints`](https://huggingface.co/Cukinator/cpu1-ablation-checkpoints).

	Manually-measured byte CE loss on the latest `run_04_r2` and `run_07_r2`
	step checkpoints (sample of English narrative):

	\| Run \| Loss (sample) \| Δ vs uniform (5.545) \|
	\|-----\|--------------:\|--------------------:\|
	\| run_02 (FP16 Transformer, reference) \| 4.37 \| −1.18 \|
	\| run_03 (FP16 MLGRU, reference) \| 3.97 \| −1.58 \|
	\| run_04 (Ternary r1, 2 tok/p) \| 5.55 \| +0.01 \|
	\| run_04_r2 (Ternary r2, step 4840 / ~5000) \| 5.57 \| +0.02 \|
	\| run_07 (CPU-1 complete r1) \| 5.56 \| +0.02 \|
	\| run_07_r2 (CPU-1 complete r2, step 4860) \| 5.56 \| +0.02 \|

	## Round 3 — cold-start rescue (queued, not uploaded)

	After the audit, four configurations were added to `RUN_CONFIGS` /
	`SMALL_RUN_CONFIGS` that apply all four corrections in concert:

	1. bf16 AMP instead of fp16 (no GradScaler underflow on BitLinear rescale chain)
	2. `lr_scale=2.0` on BitLinear (BitNet b1.58 §3.1 prescription)
	3. CE-only training signal (drop the 90/10 KL distillation + boundary/emb_align auxiliary losses)
	4. 50 tok/param (3.3× r2, close to BitNet's lower bound at 3B scale)

	Configs: `run_04_r3`, `run_07_r3`, `run_14_r3`, `run_15_r3`. Not yet
	trained. If they reach val_loss < ~5.0 they will be uploaded here.

	## Key finding (2026-05 audit): ternary cold-start collapses to the uniform baseline

	The reference value is ln(256) = 5.545 nats, the entropy of a uniform
	distribution over the 256 byte vocabulary. All byte-level ternary runs
	(round 1 and round 2, both scales) plateau within 0.05 nats of this
	floor — the models are effectively producing uniform predictions over bytes.

	More training does not help. A 7.5× increase in tokens-per-parameter
	moves the loss by 0.0001 nats. A weight-evolution audit on `run_14_r2`
	showed that **>99.9% of BitLinear weights are in the same ternary state
	at step 10 and at step 1326** (the entire training). The bottleneck is the
	cold-start dynamics of straight-through-estimator (STE) ternary
	quantisation, not the data budget.

	Compare against the FP16 sibling architectures:

	- run_02 (Transformer + byte + FP16, 38M): 1.72
	- run_03 (MLGRU + byte + FP16, 38M): 1.87
	- run_04 (MLGRU + byte + ternary, same 38M): 5.57 ← collapsed

	Published 1.58-bit models (BitNet b1.58 at 700M+ with 30–143 tok/param;
	Slender-Mamba with FP16 warm-start) reach functional performance, but the
	cold-start regime at <50 tok/param and <100M parameters that this study
	operates in is not covered by their results.

	Throughput is also worth flagging: even with the ideal BitNet.cpp /
	T-MAC class kernels (4–6× speedup on the matmul fraction), the projected
	end-to-end throughput of the 39M ternary models is 0.30–0.50× of
	their FP16 Transformer sibling at the same scale. The 1.58-bit speed
	advantage only materialises above ~700M parameters where weight RAM
	bandwidth becomes the bottleneck.

	> Practical implication. The byte-level ternary chain (runs 04–10, 14–16,
	> and their r2 counterparts) cannot distinguish between architectural
	> variants because all of them are stuck at the same numerical floor. The
	> architectures themselves may be sound; the training recipe needs to
	> either (a) warm-start from FP weights, (b) anneal the quantisation, or
	> (c) use a much larger token budget.

	Full mechanistic analysis, weight-flip diagnostics and throughput
	projections are in the
	[main repository README](https://github.com/Cukinator/1.58bits/blob/main/README.md#ablation-audit--2026-05-findings).

	## License

	Apache-2.0. Same as the source code at [github.com/Cukinator/1.58bits](https://github.com/Cukinator/1.58bits).