Update README: reflect actual upload state (20 runs), r2/r3 status, full audit table
Browse files
README.md
CHANGED
|
@@ -1,121 +1,168 @@
|
|
| 1 |
-
---
|
| 2 |
-
language:
|
| 3 |
-
- en
|
| 4 |
-
license: apache-2.0
|
| 5 |
-
tags:
|
| 6 |
-
- pytorch
|
| 7 |
-
- text-generation
|
| 8 |
-
- 1.58bit
|
| 9 |
-
- ternary
|
| 10 |
-
- byte-level
|
| 11 |
-
- mlgru
|
| 12 |
-
- ablation
|
| 13 |
-
library_name: pytorch
|
| 14 |
-
pipeline_tag: text-generation
|
| 15 |
-
model_type: custom
|
| 16 |
-
---
|
| 17 |
-
|
| 18 |
-
# CPU-1 Ablation Study β Ready-to-use Checkpoints
|
| 19 |
-
|
| 20 |
-
Repo:
|
| 21 |
-
Source
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
model
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
|
| 51 |
-
|
|
| 52 |
-
|
|
| 53 |
-
|
|
| 54 |
-
|
|
| 55 |
-
|
|
| 56 |
-
|
|
| 57 |
-
|
|
| 58 |
-
|
|
| 59 |
-
|
|
| 60 |
-
|
|
| 61 |
-
|
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
|
| 68 |
-
|
|
| 69 |
-
|
|
| 70 |
-
|
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
|
| 82 |
-
|
|
| 83 |
-
|
|
| 84 |
-
|
|
| 85 |
-
|
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
(
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
tags:
|
| 6 |
+
- pytorch
|
| 7 |
+
- text-generation
|
| 8 |
+
- 1.58bit
|
| 9 |
+
- ternary
|
| 10 |
+
- byte-level
|
| 11 |
+
- mlgru
|
| 12 |
+
- ablation
|
| 13 |
+
library_name: pytorch
|
| 14 |
+
pipeline_tag: text-generation
|
| 15 |
+
model_type: custom
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# CPU-1 Ablation Study β Ready-to-use Checkpoints (fp32 unpacked)
|
| 19 |
+
|
| 20 |
+
Repo: `Cukinator/cpu1-ablations-final`
|
| 21 |
+
Source: [`Cukinator/cpu1-ablation-checkpoints`](https://huggingface.co/Cukinator/cpu1-ablation-checkpoints) (compact 2-bit, raw training output)
|
| 22 |
+
Code: [github.com/Cukinator/1.58bits](https://github.com/Cukinator/1.58bits)
|
| 23 |
+
Dataset: [`Cukinator/cpu1-ablation-dataset`](https://huggingface.co/datasets/Cukinator/cpu1-ablation-dataset)
|
| 24 |
+
|
| 25 |
+
Each checkpoint is a standard PyTorch `.pt` file with **float32 weights** β
|
| 26 |
+
no unpacking needed. Compatible with `train_ablation.py` from the
|
| 27 |
+
[1.58bits repo](https://github.com/Cukinator/1.58bits).
|
| 28 |
+
|
| 29 |
+
Currently uploaded: **20 trained runs**. The remaining `*_r2` 39M runs and
|
| 30 |
+
the four `*_r3` configs are queued but not yet trained / not yet unpacked.
|
| 31 |
+
|
| 32 |
+
## Quick start
|
| 33 |
+
|
| 34 |
+
```python
|
| 35 |
+
import torch, sys
|
| 36 |
+
sys.path.insert(0, "/path/to/1.58bits")
|
| 37 |
+
from train_ablation import build_ablation_model, generate
|
| 38 |
+
|
| 39 |
+
ckpt = torch.load("run_02/model.pt", map_location="cpu")
|
| 40 |
+
model = build_ablation_model(ckpt["config"])
|
| 41 |
+
model.load_state_dict(ckpt["state_dict"])
|
| 42 |
+
model.eval()
|
| 43 |
+
|
| 44 |
+
text = generate(model, "The quick brown fox", 128, ckpt["config"], torch.device("cpu"))
|
| 45 |
+
print(text)
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
## Ablation chain β 50M runs (round 1, 2 tok/param)
|
| 49 |
+
|
| 50 |
+
| Run | Architecture | Params | Val Loss | Perplexity | Throughput (P/s, CPU) |
|
| 51 |
+
|-----|-------------|-------:|---------:|-----------:|----------------------:|
|
| 52 |
+
| run_01 | Transformer + BPE (16K vocab) + FP16 | 54.7M | **4.66** | 106.1 | 72.4 |
|
| 53 |
+
| run_02a_byte_only_heads | Transformer + Byte + 4 indep. heads (no LBD) | 38.5M | **2.31** | 10.1 | 84.8 |
|
| 54 |
+
| run_02 | Transformer + Byte + LocalByteDecoder | 38.8M | **1.72** | **5.56** | 75.4 |
|
| 55 |
+
| run_03 | MLGRU + Byte + FP16 | 38.8M | **1.87** | 6.49 | 58.0 |
|
| 56 |
+
| run_04 | MLGRU + Byte + Ternary | 38.9M | 5.57 | 261.7 | 9.1 |
|
| 57 |
+
| run_05 | + FPResidual | 39.0M | 5.55 | 257.7 | 9.7 |
|
| 58 |
+
| run_05b_kernel_strict | MLGRU kernel-strict (no W_o) | 35.8M | 5.59 | 268.8 | 10.2 |
|
| 59 |
+
| run_06 | + Bolmo patch embedding | 39.0M | 5.56 | 258.8 | 9.1 |
|
| 60 |
+
| run_07 | + DeleteGate (CPU-1 complete) | 39.0M | 5.56 | 258.8 | 7.3 |
|
| 61 |
+
| run_08 | Folded Transformer + Byte + Ternary | 38.9M | 5.56 | 260.4 | 9.7 |
|
| 62 |
+
| run_09 | + PFNet | 39.4M | 5.53 | 252.9 | 8.5 |
|
| 63 |
+
| run_10 | + learned per-channel decay | 39.4M | 5.53 | 253.1 | 8.7 |
|
| 64 |
+
|
| 65 |
+
## Small runs β 10M (round 1, 2 tok/param)
|
| 66 |
+
|
| 67 |
+
| Run | Architecture | Params | Val Loss | Perplexity |
|
| 68 |
+
|-----|-------------|-------:|---------:|-----------:|
|
| 69 |
+
| run_13 | Small CPU-1 + BPE (4K vocab) | 12.5M | 30.54 | 4.85Γ10βΈ |
|
| 70 |
+
| run_14 | Small CPU-1 byte-level (Qwen logprob distillation) | 10.7M | 5.58 | 263.9 |
|
| 71 |
+
| run_15 | Small CPU-1 byte + hidden distillation (EmbeddingAligner) | 10.7M | 5.58 | 263.9 |
|
| 72 |
+
| run_16 | Small CPU-1 raw bytes, no teacher (FineWeb) | 10.7M | 5.58 | 263.9 |
|
| 73 |
+
|
| 74 |
+
## Round 2 β re-runs at 15 tok/param (7.5Γ the original budget)
|
| 75 |
+
|
| 76 |
+
Re-trained at **15 tok/param** (7.5Γ the original budget) to test whether
|
| 77 |
+
the ~5.55 nat floor on ternary runs was caused by under-training.
|
| 78 |
+
|
| 79 |
+
### 10M re-runs (1336 steps each, all uploaded)
|
| 80 |
+
|
| 81 |
+
| Run | Architecture | Val Loss (r1) | Val Loss (r2) | Ξ |
|
| 82 |
+
|-----|-------------|---------------|---------------|---|
|
| 83 |
+
| run_13_r2 | Small CPU-1 + BPE | 30.54 | **25.43** | β5.1 |
|
| 84 |
+
| run_14_r2 | Small CPU-1 byte-level | 5.5754 | **5.5755** | +0.0001 |
|
| 85 |
+
| run_15_r2 | Small CPU-1 byte + hidden distill | 5.5755 | **5.5755** | 0 |
|
| 86 |
+
| run_16_r2 | Small CPU-1 raw bytes | 5.5754 | **5.5754** | 0 |
|
| 87 |
+
|
| 88 |
+
### 39M re-runs (in progress)
|
| 89 |
+
|
| 90 |
+
`run_04_r2`, `run_07_r2`, `run_05_r2`, `run_08_r2`, `run_09_r2`, `run_10_r2`
|
| 91 |
+
were started at 15 tok/param. As of 2026-05 only `run_04_r2` and `run_07_r2`
|
| 92 |
+
have intermediate step checkpoints (β97% of training); none of them
|
| 93 |
+
have been unpacked to fp32 in this repo yet because the audit below
|
| 94 |
+
showed they were converging to the same uniform floor as their r1
|
| 95 |
+
counterparts. They remain available in the source compact_2bit repo
|
| 96 |
+
[`Cukinator/cpu1-ablation-checkpoints`](https://huggingface.co/Cukinator/cpu1-ablation-checkpoints).
|
| 97 |
+
|
| 98 |
+
Manually-measured byte CE loss on the latest `run_04_r2` and `run_07_r2`
|
| 99 |
+
step checkpoints (sample of English narrative):
|
| 100 |
+
|
| 101 |
+
| Run | Loss (sample) | Ξ vs uniform (5.545) |
|
| 102 |
+
|-----|--------------:|--------------------:|
|
| 103 |
+
| run_02 (FP16 Transformer, reference) | 4.37 | β1.18 |
|
| 104 |
+
| run_03 (FP16 MLGRU, reference) | 3.97 | β1.58 |
|
| 105 |
+
| run_04 (Ternary r1, 2 tok/p) | 5.55 | +0.01 |
|
| 106 |
+
| **run_04_r2 (Ternary r2, step 4840 / ~5000)** | **5.57** | **+0.02** |
|
| 107 |
+
| run_07 (CPU-1 complete r1) | 5.56 | +0.02 |
|
| 108 |
+
| **run_07_r2 (CPU-1 complete r2, step 4860)** | **5.56** | **+0.02** |
|
| 109 |
+
|
| 110 |
+
## Round 3 β cold-start rescue (queued, not uploaded)
|
| 111 |
+
|
| 112 |
+
After the audit, four configurations were added to `RUN_CONFIGS` /
|
| 113 |
+
`SMALL_RUN_CONFIGS` that apply all four corrections in concert:
|
| 114 |
+
|
| 115 |
+
1. **bf16 AMP** instead of fp16 (no GradScaler underflow on BitLinear rescale chain)
|
| 116 |
+
2. **`lr_scale=2.0` on BitLinear** (BitNet b1.58 Β§3.1 prescription)
|
| 117 |
+
3. **CE-only training signal** (drop the 90/10 KL distillation + boundary/emb_align auxiliary losses)
|
| 118 |
+
4. **50 tok/param** (3.3Γ r2, close to BitNet's lower bound at 3B scale)
|
| 119 |
+
|
| 120 |
+
Configs: `run_04_r3`, `run_07_r3`, `run_14_r3`, `run_15_r3`. Not yet
|
| 121 |
+
trained. If they reach val_loss < ~5.0 they will be uploaded here.
|
| 122 |
+
|
| 123 |
+
## Key finding (2026-05 audit): ternary cold-start collapses to the uniform baseline
|
| 124 |
+
|
| 125 |
+
The reference value is **ln(256) = 5.545 nats**, the entropy of a uniform
|
| 126 |
+
distribution over the 256 byte vocabulary. All byte-level ternary runs
|
| 127 |
+
(round 1 *and* round 2, both scales) plateau within 0.05 nats of this
|
| 128 |
+
floor β the models are effectively producing uniform predictions over bytes.
|
| 129 |
+
|
| 130 |
+
**More training does not help.** A 7.5Γ increase in tokens-per-parameter
|
| 131 |
+
moves the loss by 0.0001 nats. A weight-evolution audit on `run_14_r2`
|
| 132 |
+
showed that **>99.9% of BitLinear weights are in the same ternary state
|
| 133 |
+
at step 10 and at step 1326** (the entire training). The bottleneck is the
|
| 134 |
+
cold-start dynamics of straight-through-estimator (STE) ternary
|
| 135 |
+
quantisation, not the data budget.
|
| 136 |
+
|
| 137 |
+
Compare against the FP16 sibling architectures:
|
| 138 |
+
|
| 139 |
+
- run_02 (Transformer + byte + FP16, 38M): **1.72**
|
| 140 |
+
- run_03 (MLGRU + byte + FP16, 38M): **1.87**
|
| 141 |
+
- run_04 (MLGRU + byte + **ternary**, same 38M): **5.57** β collapsed
|
| 142 |
+
|
| 143 |
+
Published 1.58-bit models (BitNet b1.58 at 700M+ with 30β143 tok/param;
|
| 144 |
+
Slender-Mamba with FP16 warm-start) reach functional performance, but the
|
| 145 |
+
cold-start regime at <50 tok/param and <100M parameters that this study
|
| 146 |
+
operates in is not covered by their results.
|
| 147 |
+
|
| 148 |
+
**Throughput** is also worth flagging: even with the ideal BitNet.cpp /
|
| 149 |
+
T-MAC class kernels (4β6Γ speedup on the matmul fraction), the projected
|
| 150 |
+
end-to-end throughput of the 39M ternary models is **0.30β0.50Γ** of
|
| 151 |
+
their FP16 Transformer sibling at the same scale. The 1.58-bit speed
|
| 152 |
+
advantage only materialises above ~700M parameters where weight RAM
|
| 153 |
+
bandwidth becomes the bottleneck.
|
| 154 |
+
|
| 155 |
+
> **Practical implication.** The byte-level ternary chain (runs 04β10, 14β16,
|
| 156 |
+
> and their r2 counterparts) cannot distinguish between architectural
|
| 157 |
+
> variants because all of them are stuck at the same numerical floor. The
|
| 158 |
+
> architectures themselves may be sound; the training recipe needs to
|
| 159 |
+
> either (a) warm-start from FP weights, (b) anneal the quantisation, or
|
| 160 |
+
> (c) use a much larger token budget.
|
| 161 |
+
|
| 162 |
+
Full mechanistic analysis, weight-flip diagnostics and throughput
|
| 163 |
+
projections are in the
|
| 164 |
+
[main repository README](https://github.com/Cukinator/1.58bits/blob/main/README.md#ablation-audit--2026-05-findings).
|
| 165 |
+
|
| 166 |
+
## License
|
| 167 |
+
|
| 168 |
+
Apache-2.0. Same as the source code at [github.com/Cukinator/1.58bits](https://github.com/Cukinator/1.58bits).
|