Add README: describe compact_2bit source repo + audit summary
Browse files
README.md
ADDED
|
@@ -0,0 +1,121 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
tags:
|
| 6 |
+
- pytorch
|
| 7 |
+
- text-generation
|
| 8 |
+
- 1.58bit
|
| 9 |
+
- ternary
|
| 10 |
+
- byte-level
|
| 11 |
+
- mlgru
|
| 12 |
+
- ablation
|
| 13 |
+
- checkpoints
|
| 14 |
+
library_name: pytorch
|
| 15 |
+
pipeline_tag: text-generation
|
| 16 |
+
model_type: custom
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
# CPU-1 Ablation Study — Source Checkpoints (compact 2-bit)
|
| 20 |
+
|
| 21 |
+
Repo: `Cukinator/cpu1-ablation-checkpoints`
|
| 22 |
+
Unpacked: [`Cukinator/cpu1-ablations-final`](https://huggingface.co/Cukinator/cpu1-ablations-final)
|
| 23 |
+
Code: [github.com/Cukinator/1.58bits](https://github.com/Cukinator/1.58bits)
|
| 24 |
+
Dataset: [`Cukinator/cpu1-ablation-dataset`](https://huggingface.co/datasets/Cukinator/cpu1-ablation-dataset)
|
| 25 |
+
|
| 26 |
+
This repository stores the **raw training checkpoints** produced by
|
| 27 |
+
`train_ablation.py` from the [1.58bits repo](https://github.com/Cukinator/1.58bits).
|
| 28 |
+
There are two checkpoint flavours, both saved per run inside its own folder:
|
| 29 |
+
|
| 30 |
+
| Filename pattern | Format | Purpose |
|
| 31 |
+
|------------------|--------|---------|
|
| 32 |
+
| `<run>/checkpoint_<run>_final.pt` | `compact_2bit` (2-bit packed ternary + bf16 scales) | Final inference checkpoint — minimal size, ~9 MB for a 39M ternary model |
|
| 33 |
+
| `<run>/checkpoint_<run>_step<N>.pt` | bf16 model + bf16 optimizer state | Phase 1 intermediate resume points |
|
| 34 |
+
| `<run>/checkpoint_<run>_phase2_step<N>.pt` | bf16 model + bf16 optimizer state | Phase 2 intermediate resume points (delete-gate runs only) |
|
| 35 |
+
|
| 36 |
+
> If you just want **ready-to-use float32 weights**, use the unpacked mirror
|
| 37 |
+
> at [`Cukinator/cpu1-ablations-final`](https://huggingface.co/Cukinator/cpu1-ablations-final) — those are plain
|
| 38 |
+
> `.pt` files you can load with `torch.load(...)` and `model.load_state_dict(...)`
|
| 39 |
+
> without any unpacking step.
|
| 40 |
+
|
| 41 |
+
This source repo exists so that (a) training jobs can resume from the latest
|
| 42 |
+
step checkpoint after preemption, and (b) the compact_2bit format itself
|
| 43 |
+
can be inspected and benchmarked.
|
| 44 |
+
|
| 45 |
+
## Repository contents
|
| 46 |
+
|
| 47 |
+
22 trained runs, organised in three rounds:
|
| 48 |
+
|
| 49 |
+
| Round | Tokens/param | Runs |
|
| 50 |
+
|-------|:------------:|------|
|
| 51 |
+
| **r1** — original ablation budget | 2 | `run_01`, `run_02`, `run_02a_byte_only_heads`, `run_03`, `run_04`, `run_05`, `run_05b_kernel_strict`, `run_06`, `run_07`, `run_08`, `run_09`, `run_10`, `run_13`, `run_14`, `run_15`, `run_16` |
|
| 52 |
+
| **r2** — re-run at higher budget | 15 | `run_04_r2`, `run_07_r2` (partial), `run_13_r2`, `run_14_r2`, `run_15_r2`, `run_16_r2` |
|
| 53 |
+
| **r3** — cold-start rescue (queued) | 50 | `run_04_r3`, `run_07_r3`, `run_14_r3`, `run_15_r3` *(not yet uploaded)* |
|
| 54 |
+
|
| 55 |
+
The naming and architecture of each run is defined in `RUN_CONFIGS` / `SMALL_RUN_CONFIGS`
|
| 56 |
+
in [`train_ablation.py`](https://github.com/Cukinator/1.58bits/blob/main/train_ablation.py).
|
| 57 |
+
|
| 58 |
+
## Quick start (compact_2bit)
|
| 59 |
+
|
| 60 |
+
Loading a compact_2bit checkpoint requires the unpacking helper that
|
| 61 |
+
ships with the training code:
|
| 62 |
+
|
| 63 |
+
```python
|
| 64 |
+
import sys
|
| 65 |
+
sys.path.insert(0, "/path/to/1.58bits")
|
| 66 |
+
from train_ablation import load_ablation_checkpoint, build_ablation_model, generate
|
| 67 |
+
import torch
|
| 68 |
+
|
| 69 |
+
state, config = load_ablation_checkpoint(
|
| 70 |
+
"run_02/checkpoint_run_02_final.pt"
|
| 71 |
+
)
|
| 72 |
+
model = build_ablation_model(config)
|
| 73 |
+
model.load_state_dict(state, strict=False)
|
| 74 |
+
model.eval()
|
| 75 |
+
|
| 76 |
+
print(generate(model, "The quick brown fox", 128, config, torch.device("cpu")))
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
For the same checkpoint **without** an external dependency, use
|
| 80 |
+
[`Cukinator/cpu1-ablations-final`](https://huggingface.co/Cukinator/cpu1-ablations-final).
|
| 81 |
+
|
| 82 |
+
## Final-checkpoint sizes (compact_2bit)
|
| 83 |
+
|
| 84 |
+
Sizes are measured from the actual `_final.pt` files on disk.
|
| 85 |
+
|
| 86 |
+
| Run family | Architecture | d_model | Final size |
|
| 87 |
+
|------------|-------------|--------:|-----------:|
|
| 88 |
+
| `run_01` | Transformer + BPE (16K vocab) + FP16 | 512 | ~210 MB |
|
| 89 |
+
| `run_02`, `run_02a`, `run_03` | FP16 byte-level baselines | 512 | ~75 MB |
|
| 90 |
+
| `run_04`..`run_10` | 39M ternary chain | 512 | ~9 MB |
|
| 91 |
+
| `run_05b_kernel_strict` | MLGRU without W_o | 512 | ~8 MB |
|
| 92 |
+
| `run_13` | 10M BPE + ternary (4K vocab) | 320 | ~5 MB |
|
| 93 |
+
| `run_14`, `run_15`, `run_16` | 10M byte + ternary variants | 320 | ~3 MB |
|
| 94 |
+
|
| 95 |
+
## Training results
|
| 96 |
+
|
| 97 |
+
The full table of `val_loss`, `perplexity`, throughput and architecture per
|
| 98 |
+
run is published in the
|
| 99 |
+
[README of the unpacked mirror](https://huggingface.co/Cukinator/cpu1-ablations-final).
|
| 100 |
+
|
| 101 |
+
A summary of the 2026-05 audit:
|
| 102 |
+
|
| 103 |
+
- **FP16 baselines** (`run_01`, `run_02`, `run_02a`, `run_03`) converge as
|
| 104 |
+
designed: byte + LocalByteDecoder reaches val_loss 1.72, MLGRU FP16 reaches 1.87.
|
| 105 |
+
- **All byte-level ternary runs collapse to `ln(256) ≈ 5.545 nats`** — the
|
| 106 |
+
uniform-output entropy floor. This holds across both scales (10M and 39M)
|
| 107 |
+
and both token budgets (2 tok/p and 15 tok/p).
|
| 108 |
+
- A 7.5× increase in tokens-per-parameter (r2) moved the validation loss
|
| 109 |
+
by 0.0001 nats. The cold-start dynamics of straight-through-estimator
|
| 110 |
+
ternary training, not the budget, are the bottleneck at this scale.
|
| 111 |
+
- An r3 set with four corrections (bf16 AMP, `lr_scale=2.0` on BitLinear,
|
| 112 |
+
CE-only training signal, 50 tok/param) is queued in `RUN_CONFIGS` but
|
| 113 |
+
has not yet been trained.
|
| 114 |
+
|
| 115 |
+
Details, mechanistic analysis and throughput projections are documented in
|
| 116 |
+
the
|
| 117 |
+
[main repository README](https://github.com/Cukinator/1.58bits/blob/main/README.md#ablation-audit--2026-05-findings).
|
| 118 |
+
|
| 119 |
+
## License
|
| 120 |
+
|
| 121 |
+
Apache-2.0. Same as the source code at [github.com/Cukinator/1.58bits](https://github.com/Cukinator/1.58bits).
|