Cukinator commited on
Commit
77d5726
Β·
verified Β·
1 Parent(s): 6ff046a

Update README: reflect actual upload state (20 runs), r2/r3 status, full audit table

Browse files
Files changed (1) hide show
  1. README.md +168 -121
README.md CHANGED
@@ -1,121 +1,168 @@
1
- ---
2
- language:
3
- - en
4
- license: apache-2.0
5
- tags:
6
- - pytorch
7
- - text-generation
8
- - 1.58bit
9
- - ternary
10
- - byte-level
11
- - mlgru
12
- - ablation
13
- library_name: pytorch
14
- pipeline_tag: text-generation
15
- model_type: custom
16
- ---
17
-
18
- # CPU-1 Ablation Study β€” Ready-to-use Checkpoints
19
-
20
- Repo: `Cukinator/cpu1-ablations-final`
21
- Source (compact 2-bit): `Cukinator/cpu1-ablation-checkpoints`
22
-
23
- Each checkpoint is a standard PyTorch `.pt` file with **float32 weights** β€”
24
- no unpacking needed. Compatible with `train_ablation.py` from the
25
- [1.58bits repo](https://github.com/Cukinator/1.58bits).
26
-
27
- 20 progressive ablation runs across two scales (50M and 10M parameters),
28
- with a round-2 re-run at 15 tok/param for the runs that failed to converge.
29
-
30
- ## Quick start
31
-
32
- ```python
33
- import torch, sys
34
- sys.path.insert(0, "/path/to/1.58bits")
35
- from train_ablation import build_ablation_model, generate
36
-
37
- ckpt = torch.load("run_02/model.pt", map_location="cpu")
38
- model = build_ablation_model(ckpt["config"])
39
- model.load_state_dict(ckpt["state_dict"])
40
- model.eval()
41
-
42
- text = generate(model, "The quick brown fox", 128, ckpt["config"], torch.device("cpu"))
43
- print(text)
44
- ```
45
-
46
- ## Ablation chain β€” 50M runs (round 1: 2 tok/param)
47
-
48
- | Run | Architecture | Params | Val Loss | Perplexity |
49
- |-----|-------------|--------|----------|-----------|
50
- | run_01 | Transformer + BPE + FP16 (baseline) | 54.7M | 4.66 | 106.1 |
51
- | run_02a | Transformer + Byte + 4 heads (no LBD) | 38.5M | 2.31 | 10.1 |
52
- | run_02 | Transformer + Byte + LocalByteDecoder | 38.8M | **1.72** | **5.56** |
53
- | run_03 | MLGRU + Byte + FP16 | 38.8M | **1.87** | **6.49** |
54
- | run_04 | MLGRU + Byte + Ternary | 38.9M | 5.57 | 261.7 |
55
- | run_05 | + FPResidual | 39.0M | 5.55 | 257.7 |
56
- | run_05b | MLGRU kernel strict | 35.8M | 5.59 | 268.8 |
57
- | run_06 | + BolmoPatchEmbedding | 39.0M | 5.56 | 258.8 |
58
- | run_07 | + DeleteGate (CPU-1 complete) | 39.0M | 5.56 | 258.8 |
59
- | run_08 | Folded Transformer + Byte + Ternary (sibling) | 38.9M | 5.56 | 260.4 |
60
- | run_09 | + PFNet (CPU-1 + nonlinear residual) | 39.4M | 5.53 | 252.9 |
61
- | run_10 | + learned per-channel decay | 39.4M | 5.53 | 253.1 |
62
-
63
- ## Small runs β€” 10M (round 1: 2 tok/param)
64
-
65
- | Run | Architecture | Params | Val Loss | Perplexity |
66
- |-----|-------------|--------|----------|-----------|
67
- | run_13 | Small CPU-1 + BPE (4K vocab) | 12.5M | 30.54 | 4.85Γ—10⁸ |
68
- | run_14 | Small CPU-1 byte-level | 10.7M | 5.58 | 263.9 |
69
- | run_15 | Small CPU-1 byte + hidden distillation | 10.7M | 5.58 | 263.9 |
70
- | run_16 | Small CPU-1 raw bytes, no teacher | 10.7M | 5.58 | 263.9 |
71
-
72
- ## Round 2 β€” re-runs at 15 tok/param (7.5Γ— the original budget)
73
-
74
- To test whether the ~5.55 floor on ternary runs was caused by
75
- under-training, the runs below were re-trained with the same architecture
76
- but **7.5Γ— more tokens per parameter** (15 tok/param vs the original 2).
77
-
78
- ### Small runs (10M, 1336 steps each)
79
-
80
- | Run | Architecture | Val Loss (r1) | Val Loss (r2) | Ξ” |
81
- |-----|-------------|---------------|---------------|---|
82
- | run_13_r2 | Small CPU-1 + BPE | 30.54 | **25.43** | βˆ’5.1 |
83
- | run_14_r2 | Small CPU-1 byte-level | 5.5754 | **5.5755** | +0.0001 |
84
- | run_15_r2 | Small CPU-1 byte + hidden distill | 5.5755 | **5.5755** | 0 |
85
- | run_16_r2 | Small CPU-1 raw bytes, no teacher | 5.5754 | **5.5754** | 0 |
86
-
87
- ### 50M runs (15 tok/param, still in progress)
88
-
89
- run_04_r2, run_05_r2, run_07_r2, run_08_r2, run_09_r2, run_10_r2 β€” to be
90
- uploaded as they finish training on T4.
91
-
92
- ## Key finding: ternary cold-start collapses to the uniform baseline
93
-
94
- The reference value is **ln(256) = 5.545 nats**, the entropy of a uniform
95
- distribution over the 256 byte vocabulary. All byte-level ternary runs
96
- (round 1 *and* round 2) plateau within 0.05 nats of this floor β€” i.e. the
97
- models are effectively producing uniform predictions over bytes.
98
-
99
- **More training does not help.** A 7.5Γ— increase in tokens-per-parameter
100
- moves the loss by 0.0001 nats. The bottleneck is the cold-start dynamics
101
- of straight-through estimator (STE) ternary quantization, not the data
102
- budget. Compare against the FP16 sibling architectures:
103
-
104
- - run_02 (Transformer + byte + FP16, 38M): **1.72**
105
- - run_03 (MLGRU + byte + FP16, 38M): **1.87**
106
- - run_04 (MLGRU + byte + **ternary**, same 38M): **5.57** ← collapsed
107
-
108
- The FP16 runs converge with 2 tok/param. The ternary versions don't, even
109
- at 15 tok/param. Published 1.58-bit models (BitNet b1.58, Slender-Mamba)
110
- report functional results at β‰₯50 tok/param β€” and even those typically
111
- warm-start from FP checkpoints or apply quantization annealing.
112
-
113
- **Practical implication**: the ablation chain (runs 04–10, 14–16) cannot
114
- distinguish between architectural variants because all of them are stuck
115
- at the same numerical floor. The architectures themselves may be sound;
116
- the training recipe needs to either (a) warm-start from FP weights,
117
- (b) anneal the quantization, or (c) use a much larger token budget.
118
-
119
- > **References to be aware of**: FP16 runs (01–03) are valid trained
120
- > models. Ternary runs (04–10, 14–16, and their r2 counterparts) reach
121
- > the uniform-output floor and are useful only as negative controls.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - pytorch
7
+ - text-generation
8
+ - 1.58bit
9
+ - ternary
10
+ - byte-level
11
+ - mlgru
12
+ - ablation
13
+ library_name: pytorch
14
+ pipeline_tag: text-generation
15
+ model_type: custom
16
+ ---
17
+
18
+ # CPU-1 Ablation Study β€” Ready-to-use Checkpoints (fp32 unpacked)
19
+
20
+ Repo: `Cukinator/cpu1-ablations-final`
21
+ Source: [`Cukinator/cpu1-ablation-checkpoints`](https://huggingface.co/Cukinator/cpu1-ablation-checkpoints) (compact 2-bit, raw training output)
22
+ Code: [github.com/Cukinator/1.58bits](https://github.com/Cukinator/1.58bits)
23
+ Dataset: [`Cukinator/cpu1-ablation-dataset`](https://huggingface.co/datasets/Cukinator/cpu1-ablation-dataset)
24
+
25
+ Each checkpoint is a standard PyTorch `.pt` file with **float32 weights** β€”
26
+ no unpacking needed. Compatible with `train_ablation.py` from the
27
+ [1.58bits repo](https://github.com/Cukinator/1.58bits).
28
+
29
+ Currently uploaded: **20 trained runs**. The remaining `*_r2` 39M runs and
30
+ the four `*_r3` configs are queued but not yet trained / not yet unpacked.
31
+
32
+ ## Quick start
33
+
34
+ ```python
35
+ import torch, sys
36
+ sys.path.insert(0, "/path/to/1.58bits")
37
+ from train_ablation import build_ablation_model, generate
38
+
39
+ ckpt = torch.load("run_02/model.pt", map_location="cpu")
40
+ model = build_ablation_model(ckpt["config"])
41
+ model.load_state_dict(ckpt["state_dict"])
42
+ model.eval()
43
+
44
+ text = generate(model, "The quick brown fox", 128, ckpt["config"], torch.device("cpu"))
45
+ print(text)
46
+ ```
47
+
48
+ ## Ablation chain β€” 50M runs (round 1, 2 tok/param)
49
+
50
+ | Run | Architecture | Params | Val Loss | Perplexity | Throughput (P/s, CPU) |
51
+ |-----|-------------|-------:|---------:|-----------:|----------------------:|
52
+ | run_01 | Transformer + BPE (16K vocab) + FP16 | 54.7M | **4.66** | 106.1 | 72.4 |
53
+ | run_02a_byte_only_heads | Transformer + Byte + 4 indep. heads (no LBD) | 38.5M | **2.31** | 10.1 | 84.8 |
54
+ | run_02 | Transformer + Byte + LocalByteDecoder | 38.8M | **1.72** | **5.56** | 75.4 |
55
+ | run_03 | MLGRU + Byte + FP16 | 38.8M | **1.87** | 6.49 | 58.0 |
56
+ | run_04 | MLGRU + Byte + Ternary | 38.9M | 5.57 | 261.7 | 9.1 |
57
+ | run_05 | + FPResidual | 39.0M | 5.55 | 257.7 | 9.7 |
58
+ | run_05b_kernel_strict | MLGRU kernel-strict (no W_o) | 35.8M | 5.59 | 268.8 | 10.2 |
59
+ | run_06 | + Bolmo patch embedding | 39.0M | 5.56 | 258.8 | 9.1 |
60
+ | run_07 | + DeleteGate (CPU-1 complete) | 39.0M | 5.56 | 258.8 | 7.3 |
61
+ | run_08 | Folded Transformer + Byte + Ternary | 38.9M | 5.56 | 260.4 | 9.7 |
62
+ | run_09 | + PFNet | 39.4M | 5.53 | 252.9 | 8.5 |
63
+ | run_10 | + learned per-channel decay | 39.4M | 5.53 | 253.1 | 8.7 |
64
+
65
+ ## Small runs β€” 10M (round 1, 2 tok/param)
66
+
67
+ | Run | Architecture | Params | Val Loss | Perplexity |
68
+ |-----|-------------|-------:|---------:|-----------:|
69
+ | run_13 | Small CPU-1 + BPE (4K vocab) | 12.5M | 30.54 | 4.85Γ—10⁸ |
70
+ | run_14 | Small CPU-1 byte-level (Qwen logprob distillation) | 10.7M | 5.58 | 263.9 |
71
+ | run_15 | Small CPU-1 byte + hidden distillation (EmbeddingAligner) | 10.7M | 5.58 | 263.9 |
72
+ | run_16 | Small CPU-1 raw bytes, no teacher (FineWeb) | 10.7M | 5.58 | 263.9 |
73
+
74
+ ## Round 2 β€” re-runs at 15 tok/param (7.5Γ— the original budget)
75
+
76
+ Re-trained at **15 tok/param** (7.5Γ— the original budget) to test whether
77
+ the ~5.55 nat floor on ternary runs was caused by under-training.
78
+
79
+ ### 10M re-runs (1336 steps each, all uploaded)
80
+
81
+ | Run | Architecture | Val Loss (r1) | Val Loss (r2) | Ξ” |
82
+ |-----|-------------|---------------|---------------|---|
83
+ | run_13_r2 | Small CPU-1 + BPE | 30.54 | **25.43** | βˆ’5.1 |
84
+ | run_14_r2 | Small CPU-1 byte-level | 5.5754 | **5.5755** | +0.0001 |
85
+ | run_15_r2 | Small CPU-1 byte + hidden distill | 5.5755 | **5.5755** | 0 |
86
+ | run_16_r2 | Small CPU-1 raw bytes | 5.5754 | **5.5754** | 0 |
87
+
88
+ ### 39M re-runs (in progress)
89
+
90
+ `run_04_r2`, `run_07_r2`, `run_05_r2`, `run_08_r2`, `run_09_r2`, `run_10_r2`
91
+ were started at 15 tok/param. As of 2026-05 only `run_04_r2` and `run_07_r2`
92
+ have intermediate step checkpoints (β‰ˆ97% of training); none of them
93
+ have been unpacked to fp32 in this repo yet because the audit below
94
+ showed they were converging to the same uniform floor as their r1
95
+ counterparts. They remain available in the source compact_2bit repo
96
+ [`Cukinator/cpu1-ablation-checkpoints`](https://huggingface.co/Cukinator/cpu1-ablation-checkpoints).
97
+
98
+ Manually-measured byte CE loss on the latest `run_04_r2` and `run_07_r2`
99
+ step checkpoints (sample of English narrative):
100
+
101
+ | Run | Loss (sample) | Ξ” vs uniform (5.545) |
102
+ |-----|--------------:|--------------------:|
103
+ | run_02 (FP16 Transformer, reference) | 4.37 | βˆ’1.18 |
104
+ | run_03 (FP16 MLGRU, reference) | 3.97 | βˆ’1.58 |
105
+ | run_04 (Ternary r1, 2 tok/p) | 5.55 | +0.01 |
106
+ | **run_04_r2 (Ternary r2, step 4840 / ~5000)** | **5.57** | **+0.02** |
107
+ | run_07 (CPU-1 complete r1) | 5.56 | +0.02 |
108
+ | **run_07_r2 (CPU-1 complete r2, step 4860)** | **5.56** | **+0.02** |
109
+
110
+ ## Round 3 β€” cold-start rescue (queued, not uploaded)
111
+
112
+ After the audit, four configurations were added to `RUN_CONFIGS` /
113
+ `SMALL_RUN_CONFIGS` that apply all four corrections in concert:
114
+
115
+ 1. **bf16 AMP** instead of fp16 (no GradScaler underflow on BitLinear rescale chain)
116
+ 2. **`lr_scale=2.0` on BitLinear** (BitNet b1.58 Β§3.1 prescription)
117
+ 3. **CE-only training signal** (drop the 90/10 KL distillation + boundary/emb_align auxiliary losses)
118
+ 4. **50 tok/param** (3.3Γ— r2, close to BitNet's lower bound at 3B scale)
119
+
120
+ Configs: `run_04_r3`, `run_07_r3`, `run_14_r3`, `run_15_r3`. Not yet
121
+ trained. If they reach val_loss < ~5.0 they will be uploaded here.
122
+
123
+ ## Key finding (2026-05 audit): ternary cold-start collapses to the uniform baseline
124
+
125
+ The reference value is **ln(256) = 5.545 nats**, the entropy of a uniform
126
+ distribution over the 256 byte vocabulary. All byte-level ternary runs
127
+ (round 1 *and* round 2, both scales) plateau within 0.05 nats of this
128
+ floor β€” the models are effectively producing uniform predictions over bytes.
129
+
130
+ **More training does not help.** A 7.5Γ— increase in tokens-per-parameter
131
+ moves the loss by 0.0001 nats. A weight-evolution audit on `run_14_r2`
132
+ showed that **>99.9% of BitLinear weights are in the same ternary state
133
+ at step 10 and at step 1326** (the entire training). The bottleneck is the
134
+ cold-start dynamics of straight-through-estimator (STE) ternary
135
+ quantisation, not the data budget.
136
+
137
+ Compare against the FP16 sibling architectures:
138
+
139
+ - run_02 (Transformer + byte + FP16, 38M): **1.72**
140
+ - run_03 (MLGRU + byte + FP16, 38M): **1.87**
141
+ - run_04 (MLGRU + byte + **ternary**, same 38M): **5.57** ← collapsed
142
+
143
+ Published 1.58-bit models (BitNet b1.58 at 700M+ with 30–143 tok/param;
144
+ Slender-Mamba with FP16 warm-start) reach functional performance, but the
145
+ cold-start regime at <50 tok/param and <100M parameters that this study
146
+ operates in is not covered by their results.
147
+
148
+ **Throughput** is also worth flagging: even with the ideal BitNet.cpp /
149
+ T-MAC class kernels (4–6Γ— speedup on the matmul fraction), the projected
150
+ end-to-end throughput of the 39M ternary models is **0.30–0.50Γ—** of
151
+ their FP16 Transformer sibling at the same scale. The 1.58-bit speed
152
+ advantage only materialises above ~700M parameters where weight RAM
153
+ bandwidth becomes the bottleneck.
154
+
155
+ > **Practical implication.** The byte-level ternary chain (runs 04–10, 14–16,
156
+ > and their r2 counterparts) cannot distinguish between architectural
157
+ > variants because all of them are stuck at the same numerical floor. The
158
+ > architectures themselves may be sound; the training recipe needs to
159
+ > either (a) warm-start from FP weights, (b) anneal the quantisation, or
160
+ > (c) use a much larger token budget.
161
+
162
+ Full mechanistic analysis, weight-flip diagnostics and throughput
163
+ projections are in the
164
+ [main repository README](https://github.com/Cukinator/1.58bits/blob/main/README.md#ablation-audit--2026-05-findings).
165
+
166
+ ## License
167
+
168
+ Apache-2.0. Same as the source code at [github.com/Cukinator/1.58bits](https://github.com/Cukinator/1.58bits).