File size: 7,858 Bytes

77d5726

---
language:
  - en
license: apache-2.0
tags:
  - pytorch
  - text-generation
  - 1.58bit
  - ternary
  - byte-level
  - mlgru
  - ablation
library_name: pytorch
pipeline_tag: text-generation
model_type: custom
---

# CPU-1 Ablation Study — Ready-to-use Checkpoints (fp32 unpacked)

Repo:        `Cukinator/cpu1-ablations-final`
Source:      [`Cukinator/cpu1-ablation-checkpoints`](https://huggingface.co/Cukinator/cpu1-ablation-checkpoints) (compact 2-bit, raw training output)
Code:        [github.com/Cukinator/1.58bits](https://github.com/Cukinator/1.58bits)
Dataset:     [`Cukinator/cpu1-ablation-dataset`](https://huggingface.co/datasets/Cukinator/cpu1-ablation-dataset)

Each checkpoint is a standard PyTorch `.pt` file with **float32 weights** —
no unpacking needed. Compatible with `train_ablation.py` from the
[1.58bits repo](https://github.com/Cukinator/1.58bits).

Currently uploaded: **20 trained runs**. The remaining `*_r2` 39M runs and
the four `*_r3` configs are queued but not yet trained / not yet unpacked.

## Quick start

```python
import torch, sys
sys.path.insert(0, "/path/to/1.58bits")
from train_ablation import build_ablation_model, generate

ckpt = torch.load("run_02/model.pt", map_location="cpu")
model = build_ablation_model(ckpt["config"])
model.load_state_dict(ckpt["state_dict"])
model.eval()

text = generate(model, "The quick brown fox", 128, ckpt["config"], torch.device("cpu"))
print(text)
```

## Ablation chain — 50M runs (round 1, 2 tok/param)

| Run | Architecture | Params | Val Loss | Perplexity | Throughput (P/s, CPU) |
|-----|-------------|-------:|---------:|-----------:|----------------------:|
| run_01 | Transformer + BPE (16K vocab) + FP16 | 54.7M | **4.66** | 106.1 | 72.4 |
| run_02a_byte_only_heads | Transformer + Byte + 4 indep. heads (no LBD) | 38.5M | **2.31** | 10.1 | 84.8 |
| run_02 | Transformer + Byte + LocalByteDecoder | 38.8M | **1.72** | **5.56** | 75.4 |
| run_03 | MLGRU + Byte + FP16 | 38.8M | **1.87** | 6.49 | 58.0 |
| run_04 | MLGRU + Byte + Ternary | 38.9M | 5.57 | 261.7 | 9.1 |
| run_05 | + FPResidual | 39.0M | 5.55 | 257.7 | 9.7 |
| run_05b_kernel_strict | MLGRU kernel-strict (no W_o) | 35.8M | 5.59 | 268.8 | 10.2 |
| run_06 | + Bolmo patch embedding | 39.0M | 5.56 | 258.8 | 9.1 |
| run_07 | + DeleteGate (CPU-1 complete) | 39.0M | 5.56 | 258.8 | 7.3 |
| run_08 | Folded Transformer + Byte + Ternary | 38.9M | 5.56 | 260.4 | 9.7 |
| run_09 | + PFNet | 39.4M | 5.53 | 252.9 | 8.5 |
| run_10 | + learned per-channel decay | 39.4M | 5.53 | 253.1 | 8.7 |

## Small runs — 10M (round 1, 2 tok/param)

| Run | Architecture | Params | Val Loss | Perplexity |
|-----|-------------|-------:|---------:|-----------:|
| run_13 | Small CPU-1 + BPE (4K vocab) | 12.5M | 30.54 | 4.85×10⁸ |
| run_14 | Small CPU-1 byte-level (Qwen logprob distillation) | 10.7M | 5.58 | 263.9 |
| run_15 | Small CPU-1 byte + hidden distillation (EmbeddingAligner) | 10.7M | 5.58 | 263.9 |
| run_16 | Small CPU-1 raw bytes, no teacher (FineWeb) | 10.7M | 5.58 | 263.9 |

## Round 2 — re-runs at 15 tok/param (7.5× the original budget)

Re-trained at **15 tok/param** (7.5× the original budget) to test whether
the ~5.55 nat floor on ternary runs was caused by under-training.

### 10M re-runs (1336 steps each, all uploaded)

| Run | Architecture | Val Loss (r1) | Val Loss (r2) | Δ |
|-----|-------------|---------------|---------------|---|
| run_13_r2 | Small CPU-1 + BPE | 30.54 | **25.43** | −5.1 |
| run_14_r2 | Small CPU-1 byte-level | 5.5754 | **5.5755** | +0.0001 |
| run_15_r2 | Small CPU-1 byte + hidden distill | 5.5755 | **5.5755** | 0 |
| run_16_r2 | Small CPU-1 raw bytes | 5.5754 | **5.5754** | 0 |

### 39M re-runs (in progress)

`run_04_r2`, `run_07_r2`, `run_05_r2`, `run_08_r2`, `run_09_r2`, `run_10_r2`
were started at 15 tok/param. As of 2026-05 only `run_04_r2` and `run_07_r2`
have intermediate step checkpoints (≈97% of training); none of them
have been unpacked to fp32 in this repo yet because the audit below
showed they were converging to the same uniform floor as their r1
counterparts. They remain available in the source compact_2bit repo
[`Cukinator/cpu1-ablation-checkpoints`](https://huggingface.co/Cukinator/cpu1-ablation-checkpoints).

Manually-measured byte CE loss on the latest `run_04_r2` and `run_07_r2`
step checkpoints (sample of English narrative):

| Run | Loss (sample) | Δ vs uniform (5.545) |
|-----|--------------:|--------------------:|
| run_02 (FP16 Transformer, reference) | 4.37 | −1.18 |
| run_03 (FP16 MLGRU, reference) | 3.97 | −1.58 |
| run_04 (Ternary r1, 2 tok/p) | 5.55 | +0.01 |
| **run_04_r2 (Ternary r2, step 4840 / ~5000)** | **5.57** | **+0.02** |
| run_07 (CPU-1 complete r1) | 5.56 | +0.02 |
| **run_07_r2 (CPU-1 complete r2, step 4860)** | **5.56** | **+0.02** |

## Round 3 — cold-start rescue (queued, not uploaded)

After the audit, four configurations were added to `RUN_CONFIGS` /
`SMALL_RUN_CONFIGS` that apply all four corrections in concert:

1. **bf16 AMP** instead of fp16 (no GradScaler underflow on BitLinear rescale chain)
2. **`lr_scale=2.0` on BitLinear** (BitNet b1.58 §3.1 prescription)
3. **CE-only training signal** (drop the 90/10 KL distillation + boundary/emb_align auxiliary losses)
4. **50 tok/param** (3.3× r2, close to BitNet's lower bound at 3B scale)

Configs: `run_04_r3`, `run_07_r3`, `run_14_r3`, `run_15_r3`. Not yet
trained. If they reach val_loss < ~5.0 they will be uploaded here.

## Key finding (2026-05 audit): ternary cold-start collapses to the uniform baseline

The reference value is **ln(256) = 5.545 nats**, the entropy of a uniform
distribution over the 256 byte vocabulary. All byte-level ternary runs
(round 1 *and* round 2, both scales) plateau within 0.05 nats of this
floor — the models are effectively producing uniform predictions over bytes.

**More training does not help.** A 7.5× increase in tokens-per-parameter
moves the loss by 0.0001 nats. A weight-evolution audit on `run_14_r2`
showed that **>99.9% of BitLinear weights are in the same ternary state
at step 10 and at step 1326** (the entire training). The bottleneck is the
cold-start dynamics of straight-through-estimator (STE) ternary
quantisation, not the data budget.

Compare against the FP16 sibling architectures:

- run_02 (Transformer + byte + FP16, 38M): **1.72**
- run_03 (MLGRU + byte + FP16, 38M): **1.87**
- run_04 (MLGRU + byte + **ternary**, same 38M): **5.57** ← collapsed

Published 1.58-bit models (BitNet b1.58 at 700M+ with 30–143 tok/param;
Slender-Mamba with FP16 warm-start) reach functional performance, but the
cold-start regime at <50 tok/param and <100M parameters that this study
operates in is not covered by their results.

**Throughput** is also worth flagging: even with the ideal BitNet.cpp /
T-MAC class kernels (4–6× speedup on the matmul fraction), the projected
end-to-end throughput of the 39M ternary models is **0.30–0.50×** of
their FP16 Transformer sibling at the same scale. The 1.58-bit speed
advantage only materialises above ~700M parameters where weight RAM
bandwidth becomes the bottleneck.

> **Practical implication.** The byte-level ternary chain (runs 04–10, 14–16,
> and their r2 counterparts) cannot distinguish between architectural
> variants because all of them are stuck at the same numerical floor. The
> architectures themselves may be sound; the training recipe needs to
> either (a) warm-start from FP weights, (b) anneal the quantisation, or
> (c) use a much larger token budget.

Full mechanistic analysis, weight-flip diagnostics and throughput
projections are in the
[main repository README](https://github.com/Cukinator/1.58bits/blob/main/README.md#ablation-audit--2026-05-findings).

## License

Apache-2.0. Same as the source code at [github.com/Cukinator/1.58bits](https://github.com/Cukinator/1.58bits).