Cukinator
/

cpu1-ablations-final

@@ -1,121 +1,168 @@
----
-language:
-  - en
-license: apache-2.0
-tags:
-  - pytorch
-  - text-generation
-  - 1.58bit
-  - ternary
-  - byte-level
-  - mlgru
-  - ablation
-library_name: pytorch
-pipeline_tag: text-generation
-model_type: custom
----
-# CPU-1 Ablation Study — Ready-to-use Checkpoints
-Repo: `Cukinator/cpu1-ablations-final`
-Source (compact 2-bit): `Cukinator/cpu1-ablation-checkpoints`
-Each checkpoint is a standard PyTorch `.pt` file with **float32 weights** —
-no unpacking needed. Compatible with `train_ablation.py` from the
-[1.58bits repo](https://github.com/Cukinator/1.58bits).
-20 progressive ablation runs across two scales (50M and 10M parameters),
-with a round-2 re-run at 15 tok/param for the runs that failed to converge.
-## Quick start
-```python
-import torch, sys
-sys.path.insert(0, "/path/to/1.58bits")
-from train_ablation import build_ablation_model, generate
-ckpt = torch.load("run_02/model.pt", map_location="cpu")
-model = build_ablation_model(ckpt["config"])
-model.load_state_dict(ckpt["state_dict"])
-model.eval()
-text = generate(model, "The quick brown fox", 128, ckpt["config"], torch.device("cpu"))
-print(text)
-```
-## Ablation chain — 50M runs (round 1: 2 tok/param)
-| Run | Architecture | Params | Val Loss | Perplexity |
-|-----|-------------|--------|----------|-----------|
-| run_01 | Transformer + BPE + FP16 (baseline) | 54.7M | 4.66 | 106.1 |
-| run_02a | Transformer + Byte + 4 heads (no LBD) | 38.5M | 2.31 | 10.1 |
-| run_02 | Transformer + Byte + LocalByteDecoder | 38.8M | **1.72** | **5.56** |
-| run_03 | MLGRU + Byte + FP16 | 38.8M | **1.87** | **6.49** |
-| run_04 | MLGRU + Byte + Ternary | 38.9M | 5.57 | 261.7 |
-| run_05 | + FPResidual | 39.0M | 5.55 | 257.7 |
-| run_05b | MLGRU kernel strict | 35.8M | 5.59 | 268.8 |
-| run_06 | + BolmoPatchEmbedding | 39.0M | 5.56 | 258.8 |
-| run_07 | + DeleteGate (CPU-1 complete) | 39.0M | 5.56 | 258.8 |
-| run_08 | Folded Transformer + Byte + Ternary (sibling) | 38.9M | 5.56 | 260.4 |
-| run_09 | + PFNet (CPU-1 + nonlinear residual) | 39.4M | 5.53 | 252.9 |
-| run_10 | + learned per-channel decay | 39.4M | 5.53 | 253.1 |
-## Small runs — 10M (round 1: 2 tok/param)
-| Run | Architecture | Params | Val Loss | Perplexity |
-|-----|-------------|--------|----------|-----------|
-| run_13 | Small CPU-1 + BPE (4K vocab) | 12.5M | 30.54 | 4.85×10⁸ |
-| run_14 | Small CPU-1 byte-level | 10.7M | 5.58 | 263.9 |
-| run_15 | Small CPU-1 byte + hidden distillation | 10.7M | 5.58 | 263.9 |
-| run_16 | Small CPU-1 raw bytes, no teacher | 10.7M | 5.58 | 263.9 |
-## Round 2 — re-runs at 15 tok/param (7.5× the original budget)
-To test whether the ~5.55 floor on ternary runs was caused by
-under-training, the runs below were re-trained with the same architecture
-but **7.5× more tokens per parameter** (15 tok/param vs the original 2).
-### Small runs (10M, 1336 steps each)
-| Run | Architecture | Val Loss (r1) | Val Loss (r2) | Δ |
-|-----|-------------|---------------|---------------|---|
-| run_13_r2 | Small CPU-1 + BPE | 30.54 | **25.43** | −5.1 |
-| run_14_r2 | Small CPU-1 byte-level | 5.5754 | **5.5755** | +0.0001 |
-| run_15_r2 | Small CPU-1 byte + hidden distill | 5.5755 | **5.5755** | 0 |
-| run_16_r2 | Small CPU-1 raw bytes, no teacher | 5.5754 | **5.5754** | 0 |
-### 50M runs (15 tok/param, still in progress)
-run_04_r2, run_05_r2, run_07_r2, run_08_r2, run_09_r2, run_10_r2 — to be
-uploaded as they finish training on T4.
-## Key finding: ternary cold-start collapses to the uniform baseline
-The reference value is **ln(256) = 5.545 nats**, the entropy of a uniform
-distribution over the 256 byte vocabulary. All byte-level ternary runs
-(round 1 *and* round 2) plateau within 0.05 nats of this floor — i.e. the
-models are effectively producing uniform predictions over bytes.
-**More training does not help.** A 7.5× increase in tokens-per-parameter
-moves the loss by 0.0001 nats. The bottleneck is the cold-start dynamics
-of straight-through estimator (STE) ternary quantization, not the data
-budget. Compare against the FP16 sibling architectures:
-- run_02 (Transformer + byte + FP16, 38M): **1.72**
-- run_03 (MLGRU + byte + FP16, 38M): **1.87**
-- run_04 (MLGRU + byte + **ternary**, same 38M): **5.57** ← collapsed
-The FP16 runs converge with 2 tok/param. The ternary versions don't, even
-at 15 tok/param. Published 1.58-bit models (BitNet b1.58, Slender-Mamba)
-report functional results at ≥50 tok/param — and even those typically
-warm-start from FP checkpoints or apply quantization annealing.
-**Practical implication**: the ablation chain (runs 04–10, 14–16) cannot
-distinguish between architectural variants because all of them are stuck
-at the same numerical floor. The architectures themselves may be sound;
-the training recipe needs to either (a) warm-start from FP weights,
-(b) anneal the quantization, or (c) use a much larger token budget.
-> **References to be aware of**: FP16 runs (01–03) are valid trained
-> models. Ternary runs (04–10, 14–16, and their r2 counterparts) reach
-> the uniform-output floor and are useful only as negative controls.

+---
+language:
+  - en
+license: apache-2.0
+tags:
+  - pytorch
+  - text-generation
+  - 1.58bit
+  - ternary
+  - byte-level
+  - mlgru
+  - ablation
+library_name: pytorch
+pipeline_tag: text-generation
+model_type: custom
+---
+# CPU-1 Ablation Study — Ready-to-use Checkpoints (fp32 unpacked)
+Repo:        `Cukinator/cpu1-ablations-final`
+Source:      [`Cukinator/cpu1-ablation-checkpoints`](https://huggingface.co/Cukinator/cpu1-ablation-checkpoints) (compact 2-bit, raw training output)
+Code:        [github.com/Cukinator/1.58bits](https://github.com/Cukinator/1.58bits)
+Dataset:     [`Cukinator/cpu1-ablation-dataset`](https://huggingface.co/datasets/Cukinator/cpu1-ablation-dataset)
+Each checkpoint is a standard PyTorch `.pt` file with **float32 weights** —
+no unpacking needed. Compatible with `train_ablation.py` from the
+[1.58bits repo](https://github.com/Cukinator/1.58bits).
+Currently uploaded: **20 trained runs**. The remaining `*_r2` 39M runs and
+the four `*_r3` configs are queued but not yet trained / not yet unpacked.
+## Quick start
+```python
+import torch, sys
+sys.path.insert(0, "/path/to/1.58bits")
+from train_ablation import build_ablation_model, generate
+ckpt = torch.load("run_02/model.pt", map_location="cpu")
+model = build_ablation_model(ckpt["config"])
+model.load_state_dict(ckpt["state_dict"])
+model.eval()
+text = generate(model, "The quick brown fox", 128, ckpt["config"], torch.device("cpu"))
+print(text)
+```
+## Ablation chain — 50M runs (round 1, 2 tok/param)
+| Run | Architecture | Params | Val Loss | Perplexity | Throughput (P/s, CPU) |
+|-----|-------------|-------:|---------:|-----------:|----------------------:|
+| run_01 | Transformer + BPE (16K vocab) + FP16 | 54.7M | **4.66** | 106.1 | 72.4 |
+| run_02a_byte_only_heads | Transformer + Byte + 4 indep. heads (no LBD) | 38.5M | **2.31** | 10.1 | 84.8 |
+| run_02 | Transformer + Byte + LocalByteDecoder | 38.8M | **1.72** | **5.56** | 75.4 |
+| run_03 | MLGRU + Byte + FP16 | 38.8M | **1.87** | 6.49 | 58.0 |
+| run_04 | MLGRU + Byte + Ternary | 38.9M | 5.57 | 261.7 | 9.1 |
+| run_05 | + FPResidual | 39.0M | 5.55 | 257.7 | 9.7 |
+| run_05b_kernel_strict | MLGRU kernel-strict (no W_o) | 35.8M | 5.59 | 268.8 | 10.2 |
+| run_06 | + Bolmo patch embedding | 39.0M | 5.56 | 258.8 | 9.1 |
+| run_07 | + DeleteGate (CPU-1 complete) | 39.0M | 5.56 | 258.8 | 7.3 |
+| run_08 | Folded Transformer + Byte + Ternary | 38.9M | 5.56 | 260.4 | 9.7 |
+| run_09 | + PFNet | 39.4M | 5.53 | 252.9 | 8.5 |
+| run_10 | + learned per-channel decay | 39.4M | 5.53 | 253.1 | 8.7 |
+## Small runs — 10M (round 1, 2 tok/param)
+| Run | Architecture | Params | Val Loss | Perplexity |
+|-----|-------------|-------:|---------:|-----------:|
+| run_13 | Small CPU-1 + BPE (4K vocab) | 12.5M | 30.54 | 4.85×10⁸ |
+| run_14 | Small CPU-1 byte-level (Qwen logprob distillation) | 10.7M | 5.58 | 263.9 |
+| run_15 | Small CPU-1 byte + hidden distillation (EmbeddingAligner) | 10.7M | 5.58 | 263.9 |
+| run_16 | Small CPU-1 raw bytes, no teacher (FineWeb) | 10.7M | 5.58 | 263.9 |
+## Round 2 — re-runs at 15 tok/param (7.5× the original budget)
+Re-trained at **15 tok/param** (7.5× the original budget) to test whether
+the ~5.55 nat floor on ternary runs was caused by under-training.
+### 10M re-runs (1336 steps each, all uploaded)
+| Run | Architecture | Val Loss (r1) | Val Loss (r2) | Δ |
+|-----|-------------|---------------|---------------|---|
+| run_13_r2 | Small CPU-1 + BPE | 30.54 | **25.43** | −5.1 |
+| run_14_r2 | Small CPU-1 byte-level | 5.5754 | **5.5755** | +0.0001 |
+| run_15_r2 | Small CPU-1 byte + hidden distill | 5.5755 | **5.5755** | 0 |
+| run_16_r2 | Small CPU-1 raw bytes | 5.5754 | **5.5754** | 0 |
+### 39M re-runs (in progress)
+`run_04_r2`, `run_07_r2`, `run_05_r2`, `run_08_r2`, `run_09_r2`, `run_10_r2`
+were started at 15 tok/param. As of 2026-05 only `run_04_r2` and `run_07_r2`
+have intermediate step checkpoints (≈97% of training); none of them
+have been unpacked to fp32 in this repo yet because the audit below
+showed they were converging to the same uniform floor as their r1
+counterparts. They remain available in the source compact_2bit repo
+[`Cukinator/cpu1-ablation-checkpoints`](https://huggingface.co/Cukinator/cpu1-ablation-checkpoints).
+Manually-measured byte CE loss on the latest `run_04_r2` and `run_07_r2`
+step checkpoints (sample of English narrative):
+| Run | Loss (sample) | Δ vs uniform (5.545) |
+|-----|--------------:|--------------------:|
+| run_02 (FP16 Transformer, reference) | 4.37 | −1.18 |
+| run_03 (FP16 MLGRU, reference) | 3.97 | −1.58 |
+| run_04 (Ternary r1, 2 tok/p) | 5.55 | +0.01 |
+| **run_04_r2 (Ternary r2, step 4840 / ~5000)** | **5.57** | **+0.02** |
+| run_07 (CPU-1 complete r1) | 5.56 | +0.02 |
+| **run_07_r2 (CPU-1 complete r2, step 4860)** | **5.56** | **+0.02** |
+## Round 3 — cold-start rescue (queued, not uploaded)
+After the audit, four configurations were added to `RUN_CONFIGS` /
+`SMALL_RUN_CONFIGS` that apply all four corrections in concert:
+1. **bf16 AMP** instead of fp16 (no GradScaler underflow on BitLinear rescale chain)
+2. **`lr_scale=2.0` on BitLinear** (BitNet b1.58 §3.1 prescription)
+3. **CE-only training signal** (drop the 90/10 KL distillation + boundary/emb_align auxiliary losses)
+4. **50 tok/param** (3.3× r2, close to BitNet's lower bound at 3B scale)
+Configs: `run_04_r3`, `run_07_r3`, `run_14_r3`, `run_15_r3`. Not yet
+trained. If they reach val_loss < ~5.0 they will be uploaded here.
+## Key finding (2026-05 audit): ternary cold-start collapses to the uniform baseline
+The reference value is **ln(256) = 5.545 nats**, the entropy of a uniform
+distribution over the 256 byte vocabulary. All byte-level ternary runs
+(round 1 *and* round 2, both scales) plateau within 0.05 nats of this
+floor — the models are effectively producing uniform predictions over bytes.
+**More training does not help.** A 7.5× increase in tokens-per-parameter
+moves the loss by 0.0001 nats. A weight-evolution audit on `run_14_r2`
+showed that **>99.9% of BitLinear weights are in the same ternary state
+at step 10 and at step 1326** (the entire training). The bottleneck is the
+cold-start dynamics of straight-through-estimator (STE) ternary
+quantisation, not the data budget.
+Compare against the FP16 sibling architectures:
+- run_02 (Transformer + byte + FP16, 38M): **1.72**
+- run_03 (MLGRU + byte + FP16, 38M): **1.87**
+- run_04 (MLGRU + byte + **ternary**, same 38M): **5.57** ← collapsed
+Published 1.58-bit models (BitNet b1.58 at 700M+ with 30–143 tok/param;
+Slender-Mamba with FP16 warm-start) reach functional performance, but the
+cold-start regime at <50 tok/param and <100M parameters that this study
+operates in is not covered by their results.
+**Throughput** is also worth flagging: even with the ideal BitNet.cpp /
+T-MAC class kernels (4–6× speedup on the matmul fraction), the projected
+end-to-end throughput of the 39M ternary models is **0.30–0.50×** of
+their FP16 Transformer sibling at the same scale. The 1.58-bit speed
+advantage only materialises above ~700M parameters where weight RAM
+bandwidth becomes the bottleneck.
+> **Practical implication.** The byte-level ternary chain (runs 04–10, 14–16,
+> and their r2 counterparts) cannot distinguish between architectural
+> variants because all of them are stuck at the same numerical floor. The
+> architectures themselves may be sound; the training recipe needs to
+> either (a) warm-start from FP weights, (b) anneal the quantisation, or
+> (c) use a much larger token budget.
+Full mechanistic analysis, weight-flip diagnostics and throughput
+projections are in the
+[main repository README](https://github.com/Cukinator/1.58bits/blob/main/README.md#ablation-audit--2026-05-findings).
+## License
+Apache-2.0. Same as the source code at [github.com/Cukinator/1.58bits](https://github.com/Cukinator/1.58bits).