| --- |
| language: |
| - en |
| license: apache-2.0 |
| tags: |
| - pytorch |
| - text-generation |
| - 1.58bit |
| - ternary |
| - byte-level |
| - mlgru |
| - ablation |
| library_name: pytorch |
| pipeline_tag: text-generation |
| model_type: custom |
| --- |
| |
| # CPU-1 Ablation Study β Ready-to-use Checkpoints (fp32 unpacked) |
|
|
| Repo: `Cukinator/cpu1-ablations-final` |
| Source: [`Cukinator/cpu1-ablation-checkpoints`](https://huggingface.co/Cukinator/cpu1-ablation-checkpoints) (compact 2-bit, raw training output) |
| Code: [github.com/Cukinator/1.58bits](https://github.com/Cukinator/1.58bits) |
| Dataset: [`Cukinator/cpu1-ablation-dataset`](https://huggingface.co/datasets/Cukinator/cpu1-ablation-dataset) |
|
|
| Each checkpoint is a standard PyTorch `.pt` file with **float32 weights** β |
| no unpacking needed. Compatible with `train_ablation.py` from the |
| [1.58bits repo](https://github.com/Cukinator/1.58bits). |
|
|
| Currently uploaded: **20 trained runs**. The remaining `*_r2` 39M runs and |
| the four `*_r3` configs are queued but not yet trained / not yet unpacked. |
|
|
| ## Quick start |
|
|
| ```python |
| import torch, sys |
| sys.path.insert(0, "/path/to/1.58bits") |
| from train_ablation import build_ablation_model, generate |
| |
| ckpt = torch.load("run_02/model.pt", map_location="cpu") |
| model = build_ablation_model(ckpt["config"]) |
| model.load_state_dict(ckpt["state_dict"]) |
| model.eval() |
| |
| text = generate(model, "The quick brown fox", 128, ckpt["config"], torch.device("cpu")) |
| print(text) |
| ``` |
|
|
| ## Ablation chain β 50M runs (round 1, 2 tok/param) |
|
|
| | Run | Architecture | Params | Val Loss | Perplexity | Throughput (P/s, CPU) | |
| |-----|-------------|-------:|---------:|-----------:|----------------------:| |
| | run_01 | Transformer + BPE (16K vocab) + FP16 | 54.7M | **4.66** | 106.1 | 72.4 | |
| | run_02a_byte_only_heads | Transformer + Byte + 4 indep. heads (no LBD) | 38.5M | **2.31** | 10.1 | 84.8 | |
| | run_02 | Transformer + Byte + LocalByteDecoder | 38.8M | **1.72** | **5.56** | 75.4 | |
| | run_03 | MLGRU + Byte + FP16 | 38.8M | **1.87** | 6.49 | 58.0 | |
| | run_04 | MLGRU + Byte + Ternary | 38.9M | 5.57 | 261.7 | 9.1 | |
| | run_05 | + FPResidual | 39.0M | 5.55 | 257.7 | 9.7 | |
| | run_05b_kernel_strict | MLGRU kernel-strict (no W_o) | 35.8M | 5.59 | 268.8 | 10.2 | |
| | run_06 | + Bolmo patch embedding | 39.0M | 5.56 | 258.8 | 9.1 | |
| | run_07 | + DeleteGate (CPU-1 complete) | 39.0M | 5.56 | 258.8 | 7.3 | |
| | run_08 | Folded Transformer + Byte + Ternary | 38.9M | 5.56 | 260.4 | 9.7 | |
| | run_09 | + PFNet | 39.4M | 5.53 | 252.9 | 8.5 | |
| | run_10 | + learned per-channel decay | 39.4M | 5.53 | 253.1 | 8.7 | |
|
|
| ## Small runs β 10M (round 1, 2 tok/param) |
|
|
| | Run | Architecture | Params | Val Loss | Perplexity | |
| |-----|-------------|-------:|---------:|-----------:| |
| | run_13 | Small CPU-1 + BPE (4K vocab) | 12.5M | 30.54 | 4.85Γ10βΈ | |
| | run_14 | Small CPU-1 byte-level (Qwen logprob distillation) | 10.7M | 5.58 | 263.9 | |
| | run_15 | Small CPU-1 byte + hidden distillation (EmbeddingAligner) | 10.7M | 5.58 | 263.9 | |
| | run_16 | Small CPU-1 raw bytes, no teacher (FineWeb) | 10.7M | 5.58 | 263.9 | |
|
|
| ## Round 2 β re-runs at 15 tok/param (7.5Γ the original budget) |
|
|
| Re-trained at **15 tok/param** (7.5Γ the original budget) to test whether |
| the ~5.55 nat floor on ternary runs was caused by under-training. |
|
|
| ### 10M re-runs (1336 steps each, all uploaded) |
|
|
| | Run | Architecture | Val Loss (r1) | Val Loss (r2) | Ξ | |
| |-----|-------------|---------------|---------------|---| |
| | run_13_r2 | Small CPU-1 + BPE | 30.54 | **25.43** | β5.1 | |
| | run_14_r2 | Small CPU-1 byte-level | 5.5754 | **5.5755** | +0.0001 | |
| | run_15_r2 | Small CPU-1 byte + hidden distill | 5.5755 | **5.5755** | 0 | |
| | run_16_r2 | Small CPU-1 raw bytes | 5.5754 | **5.5754** | 0 | |
|
|
| ### 39M re-runs (in progress) |
|
|
| `run_04_r2`, `run_07_r2`, `run_05_r2`, `run_08_r2`, `run_09_r2`, `run_10_r2` |
| were started at 15 tok/param. As of 2026-05 only `run_04_r2` and `run_07_r2` |
| have intermediate step checkpoints (β97% of training); none of them |
| have been unpacked to fp32 in this repo yet because the audit below |
| showed they were converging to the same uniform floor as their r1 |
| counterparts. They remain available in the source compact_2bit repo |
| [`Cukinator/cpu1-ablation-checkpoints`](https://huggingface.co/Cukinator/cpu1-ablation-checkpoints). |
| |
| Manually-measured byte CE loss on the latest `run_04_r2` and `run_07_r2` |
| step checkpoints (sample of English narrative): |
| |
| | Run | Loss (sample) | Ξ vs uniform (5.545) | |
| |-----|--------------:|--------------------:| |
| | run_02 (FP16 Transformer, reference) | 4.37 | β1.18 | |
| | run_03 (FP16 MLGRU, reference) | 3.97 | β1.58 | |
| | run_04 (Ternary r1, 2 tok/p) | 5.55 | +0.01 | |
| | **run_04_r2 (Ternary r2, step 4840 / ~5000)** | **5.57** | **+0.02** | |
| | run_07 (CPU-1 complete r1) | 5.56 | +0.02 | |
| | **run_07_r2 (CPU-1 complete r2, step 4860)** | **5.56** | **+0.02** | |
| |
| ## Round 3 β cold-start rescue (queued, not uploaded) |
| |
| After the audit, four configurations were added to `RUN_CONFIGS` / |
| `SMALL_RUN_CONFIGS` that apply all four corrections in concert: |
|
|
| 1. **bf16 AMP** instead of fp16 (no GradScaler underflow on BitLinear rescale chain) |
| 2. **`lr_scale=2.0` on BitLinear** (BitNet b1.58 Β§3.1 prescription) |
| 3. **CE-only training signal** (drop the 90/10 KL distillation + boundary/emb_align auxiliary losses) |
| 4. **50 tok/param** (3.3Γ r2, close to BitNet's lower bound at 3B scale) |
|
|
| Configs: `run_04_r3`, `run_07_r3`, `run_14_r3`, `run_15_r3`. Not yet |
| trained. If they reach val_loss < ~5.0 they will be uploaded here. |
| |
| ## Key finding (2026-05 audit): ternary cold-start collapses to the uniform baseline |
| |
| The reference value is **ln(256) = 5.545 nats**, the entropy of a uniform |
| distribution over the 256 byte vocabulary. All byte-level ternary runs |
| (round 1 *and* round 2, both scales) plateau within 0.05 nats of this |
| floor β the models are effectively producing uniform predictions over bytes. |
| |
| **More training does not help.** A 7.5Γ increase in tokens-per-parameter |
| moves the loss by 0.0001 nats. A weight-evolution audit on `run_14_r2` |
| showed that **>99.9% of BitLinear weights are in the same ternary state |
| at step 10 and at step 1326** (the entire training). The bottleneck is the |
| cold-start dynamics of straight-through-estimator (STE) ternary |
| quantisation, not the data budget. |
| |
| Compare against the FP16 sibling architectures: |
| |
| - run_02 (Transformer + byte + FP16, 38M): **1.72** |
| - run_03 (MLGRU + byte + FP16, 38M): **1.87** |
| - run_04 (MLGRU + byte + **ternary**, same 38M): **5.57** β collapsed |
|
|
| Published 1.58-bit models (BitNet b1.58 at 700M+ with 30β143 tok/param; |
| Slender-Mamba with FP16 warm-start) reach functional performance, but the |
| cold-start regime at <50 tok/param and <100M parameters that this study |
| operates in is not covered by their results. |
|
|
| **Throughput** is also worth flagging: even with the ideal BitNet.cpp / |
| T-MAC class kernels (4β6Γ speedup on the matmul fraction), the projected |
| end-to-end throughput of the 39M ternary models is **0.30β0.50Γ** of |
| their FP16 Transformer sibling at the same scale. The 1.58-bit speed |
| advantage only materialises above ~700M parameters where weight RAM |
| bandwidth becomes the bottleneck. |
|
|
| > **Practical implication.** The byte-level ternary chain (runs 04β10, 14β16, |
| > and their r2 counterparts) cannot distinguish between architectural |
| > variants because all of them are stuck at the same numerical floor. The |
| > architectures themselves may be sound; the training recipe needs to |
| > either (a) warm-start from FP weights, (b) anneal the quantisation, or |
| > (c) use a much larger token budget. |
|
|
| Full mechanistic analysis, weight-flip diagnostics and throughput |
| projections are in the |
| [main repository README](https://github.com/Cukinator/1.58bits/blob/main/README.md#ablation-audit--2026-05-findings). |
|
|
| ## License |
|
|
| Apache-2.0. Same as the source code at [github.com/Cukinator/1.58bits](https://github.com/Cukinator/1.58bits). |
|
|