- CPU-1 Ablation Study β Ready-to-use Checkpoints (fp32 unpacked)
- Quick start
- Ablation chain β 50M runs (round 1, 2 tok/param)
- Small runs β 10M (round 1, 2 tok/param)
- Round 2 β re-runs at 15 tok/param (7.5Γ the original budget)
- Round 3 β cold-start rescue (queued, not uploaded)
- Key finding (2026-05 audit): ternary cold-start collapses to the uniform baseline
- License
- Quick start
CPU-1 Ablation Study β Ready-to-use Checkpoints (fp32 unpacked)
Repo: Cukinator/cpu1-ablations-final
Source: Cukinator/cpu1-ablation-checkpoints (compact 2-bit, raw training output)
Code: github.com/Cukinator/1.58bits
Dataset: Cukinator/cpu1-ablation-dataset
Each checkpoint is a standard PyTorch .pt file with float32 weights β
no unpacking needed. Compatible with train_ablation.py from the
1.58bits repo.
Currently uploaded: 20 trained runs. The remaining *_r2 39M runs and
the four *_r3 configs are queued but not yet trained / not yet unpacked.
Quick start
import torch, sys
sys.path.insert(0, "/path/to/1.58bits")
from train_ablation import build_ablation_model, generate
ckpt = torch.load("run_02/model.pt", map_location="cpu")
model = build_ablation_model(ckpt["config"])
model.load_state_dict(ckpt["state_dict"])
model.eval()
text = generate(model, "The quick brown fox", 128, ckpt["config"], torch.device("cpu"))
print(text)
Ablation chain β 50M runs (round 1, 2 tok/param)
| Run | Architecture | Params | Val Loss | Perplexity | Throughput (P/s, CPU) |
|---|---|---|---|---|---|
| run_01 | Transformer + BPE (16K vocab) + FP16 | 54.7M | 4.66 | 106.1 | 72.4 |
| run_02a_byte_only_heads | Transformer + Byte + 4 indep. heads (no LBD) | 38.5M | 2.31 | 10.1 | 84.8 |
| run_02 | Transformer + Byte + LocalByteDecoder | 38.8M | 1.72 | 5.56 | 75.4 |
| run_03 | MLGRU + Byte + FP16 | 38.8M | 1.87 | 6.49 | 58.0 |
| run_04 | MLGRU + Byte + Ternary | 38.9M | 5.57 | 261.7 | 9.1 |
| run_05 | + FPResidual | 39.0M | 5.55 | 257.7 | 9.7 |
| run_05b_kernel_strict | MLGRU kernel-strict (no W_o) | 35.8M | 5.59 | 268.8 | 10.2 |
| run_06 | + Bolmo patch embedding | 39.0M | 5.56 | 258.8 | 9.1 |
| run_07 | + DeleteGate (CPU-1 complete) | 39.0M | 5.56 | 258.8 | 7.3 |
| run_08 | Folded Transformer + Byte + Ternary | 38.9M | 5.56 | 260.4 | 9.7 |
| run_09 | + PFNet | 39.4M | 5.53 | 252.9 | 8.5 |
| run_10 | + learned per-channel decay | 39.4M | 5.53 | 253.1 | 8.7 |
Small runs β 10M (round 1, 2 tok/param)
| Run | Architecture | Params | Val Loss | Perplexity |
|---|---|---|---|---|
| run_13 | Small CPU-1 + BPE (4K vocab) | 12.5M | 30.54 | 4.85Γ10βΈ |
| run_14 | Small CPU-1 byte-level (Qwen logprob distillation) | 10.7M | 5.58 | 263.9 |
| run_15 | Small CPU-1 byte + hidden distillation (EmbeddingAligner) | 10.7M | 5.58 | 263.9 |
| run_16 | Small CPU-1 raw bytes, no teacher (FineWeb) | 10.7M | 5.58 | 263.9 |
Round 2 β re-runs at 15 tok/param (7.5Γ the original budget)
Re-trained at 15 tok/param (7.5Γ the original budget) to test whether the ~5.55 nat floor on ternary runs was caused by under-training.
10M re-runs (1336 steps each, all uploaded)
| Run | Architecture | Val Loss (r1) | Val Loss (r2) | Ξ |
|---|---|---|---|---|
| run_13_r2 | Small CPU-1 + BPE | 30.54 | 25.43 | β5.1 |
| run_14_r2 | Small CPU-1 byte-level | 5.5754 | 5.5755 | +0.0001 |
| run_15_r2 | Small CPU-1 byte + hidden distill | 5.5755 | 5.5755 | 0 |
| run_16_r2 | Small CPU-1 raw bytes | 5.5754 | 5.5754 | 0 |
39M re-runs (in progress)
run_04_r2, run_07_r2, run_05_r2, run_08_r2, run_09_r2, run_10_r2
were started at 15 tok/param. As of 2026-05 only run_04_r2 and run_07_r2
have intermediate step checkpoints (β97% of training); none of them
have been unpacked to fp32 in this repo yet because the audit below
showed they were converging to the same uniform floor as their r1
counterparts. They remain available in the source compact_2bit repo
Cukinator/cpu1-ablation-checkpoints.
Manually-measured byte CE loss on the latest run_04_r2 and run_07_r2
step checkpoints (sample of English narrative):
| Run | Loss (sample) | Ξ vs uniform (5.545) |
|---|---|---|
| run_02 (FP16 Transformer, reference) | 4.37 | β1.18 |
| run_03 (FP16 MLGRU, reference) | 3.97 | β1.58 |
| run_04 (Ternary r1, 2 tok/p) | 5.55 | +0.01 |
| run_04_r2 (Ternary r2, step 4840 / ~5000) | 5.57 | +0.02 |
| run_07 (CPU-1 complete r1) | 5.56 | +0.02 |
| run_07_r2 (CPU-1 complete r2, step 4860) | 5.56 | +0.02 |
Round 3 β cold-start rescue (queued, not uploaded)
After the audit, four configurations were added to RUN_CONFIGS /
SMALL_RUN_CONFIGS that apply all four corrections in concert:
- bf16 AMP instead of fp16 (no GradScaler underflow on BitLinear rescale chain)
lr_scale=2.0on BitLinear (BitNet b1.58 Β§3.1 prescription)- CE-only training signal (drop the 90/10 KL distillation + boundary/emb_align auxiliary losses)
- 50 tok/param (3.3Γ r2, close to BitNet's lower bound at 3B scale)
Configs: run_04_r3, run_07_r3, run_14_r3, run_15_r3. Not yet
trained. If they reach val_loss < ~5.0 they will be uploaded here.
Key finding (2026-05 audit): ternary cold-start collapses to the uniform baseline
The reference value is ln(256) = 5.545 nats, the entropy of a uniform distribution over the 256 byte vocabulary. All byte-level ternary runs (round 1 and round 2, both scales) plateau within 0.05 nats of this floor β the models are effectively producing uniform predictions over bytes.
More training does not help. A 7.5Γ increase in tokens-per-parameter
moves the loss by 0.0001 nats. A weight-evolution audit on run_14_r2
showed that >99.9% of BitLinear weights are in the same ternary state
at step 10 and at step 1326 (the entire training). The bottleneck is the
cold-start dynamics of straight-through-estimator (STE) ternary
quantisation, not the data budget.
Compare against the FP16 sibling architectures:
- run_02 (Transformer + byte + FP16, 38M): 1.72
- run_03 (MLGRU + byte + FP16, 38M): 1.87
- run_04 (MLGRU + byte + ternary, same 38M): 5.57 β collapsed
Published 1.58-bit models (BitNet b1.58 at 700M+ with 30β143 tok/param; Slender-Mamba with FP16 warm-start) reach functional performance, but the cold-start regime at <50 tok/param and <100M parameters that this study operates in is not covered by their results.
Throughput is also worth flagging: even with the ideal BitNet.cpp / T-MAC class kernels (4β6Γ speedup on the matmul fraction), the projected end-to-end throughput of the 39M ternary models is 0.30β0.50Γ of their FP16 Transformer sibling at the same scale. The 1.58-bit speed advantage only materialises above ~700M parameters where weight RAM bandwidth becomes the bottleneck.
Practical implication. The byte-level ternary chain (runs 04β10, 14β16, and their r2 counterparts) cannot distinguish between architectural variants because all of them are stuck at the same numerical floor. The architectures themselves may be sound; the training recipe needs to either (a) warm-start from FP weights, (b) anneal the quantisation, or (c) use a much larger token budget.
Full mechanistic analysis, weight-flip diagnostics and throughput projections are in the main repository README.
License
Apache-2.0. Same as the source code at github.com/Cukinator/1.58bits.