| --- |
| language: |
| - en |
| license: apache-2.0 |
| tags: |
| - pytorch |
| - text-generation |
| - 1.58bit |
| - ternary |
| - byte-level |
| - mlgru |
| - ablation |
| - checkpoints |
| library_name: pytorch |
| pipeline_tag: text-generation |
| model_type: custom |
| --- |
| |
| # CPU-1 Ablation Study β Source Checkpoints (compact 2-bit) |
|
|
| Repo: `Cukinator/cpu1-ablation-checkpoints` |
| Unpacked: [`Cukinator/cpu1-ablations-final`](https://huggingface.co/Cukinator/cpu1-ablations-final) |
| Code: [github.com/Cukinator/1.58bits](https://github.com/Cukinator/1.58bits) |
| Dataset: [`Cukinator/cpu1-ablation-dataset`](https://huggingface.co/datasets/Cukinator/cpu1-ablation-dataset) |
|
|
| This repository stores the **raw training checkpoints** produced by |
| `train_ablation.py` from the [1.58bits repo](https://github.com/Cukinator/1.58bits). |
| There are two checkpoint flavours, both saved per run inside its own folder: |
|
|
| | Filename pattern | Format | Purpose | |
| |------------------|--------|---------| |
| | `<run>/checkpoint_<run>_final.pt` | `compact_2bit` (2-bit packed ternary + bf16 scales) | Final inference checkpoint β minimal size, ~9 MB for a 39M ternary model | |
| | `<run>/checkpoint_<run>_step<N>.pt` | bf16 model + bf16 optimizer state | Phase 1 intermediate resume points | |
| | `<run>/checkpoint_<run>_phase2_step<N>.pt` | bf16 model + bf16 optimizer state | Phase 2 intermediate resume points (delete-gate runs only) | |
|
|
| > If you just want **ready-to-use float32 weights**, use the unpacked mirror |
| > at [`Cukinator/cpu1-ablations-final`](https://huggingface.co/Cukinator/cpu1-ablations-final) β those are plain |
| > `.pt` files you can load with `torch.load(...)` and `model.load_state_dict(...)` |
| > without any unpacking step. |
|
|
| This source repo exists so that (a) training jobs can resume from the latest |
| step checkpoint after preemption, and (b) the compact_2bit format itself |
| can be inspected and benchmarked. |
| |
| ## Repository contents |
| |
| 22 trained runs, organised in three rounds: |
| |
| | Round | Tokens/param | Runs | |
| |-------|:------------:|------| |
| | **r1** β original ablation budget | 2 | `run_01`, `run_02`, `run_02a_byte_only_heads`, `run_03`, `run_04`, `run_05`, `run_05b_kernel_strict`, `run_06`, `run_07`, `run_08`, `run_09`, `run_10`, `run_13`, `run_14`, `run_15`, `run_16` | |
| | **r2** β re-run at higher budget | 15 | `run_04_r2`, `run_07_r2` (partial), `run_13_r2`, `run_14_r2`, `run_15_r2`, `run_16_r2` | |
| | **r3** β cold-start rescue (queued) | 50 | `run_04_r3`, `run_07_r3`, `run_14_r3`, `run_15_r3` *(not yet uploaded)* | |
|
|
| The naming and architecture of each run is defined in `RUN_CONFIGS` / `SMALL_RUN_CONFIGS` |
| in [`train_ablation.py`](https://github.com/Cukinator/1.58bits/blob/main/train_ablation.py). |
|
|
| ## Quick start (compact_2bit) |
| |
| Loading a compact_2bit checkpoint requires the unpacking helper that |
| ships with the training code: |
|
|
| ```python |
| import sys |
| sys.path.insert(0, "/path/to/1.58bits") |
| from train_ablation import load_ablation_checkpoint, build_ablation_model, generate |
| import torch |
| |
| state, config = load_ablation_checkpoint( |
| "run_02/checkpoint_run_02_final.pt" |
| ) |
| model = build_ablation_model(config) |
| model.load_state_dict(state, strict=False) |
| model.eval() |
| |
| print(generate(model, "The quick brown fox", 128, config, torch.device("cpu"))) |
| ``` |
|
|
| For the same checkpoint **without** an external dependency, use |
| [`Cukinator/cpu1-ablations-final`](https://huggingface.co/Cukinator/cpu1-ablations-final). |
|
|
| ## Final-checkpoint sizes (compact_2bit) |
| |
| Sizes are measured from the actual `_final.pt` files on disk. |
|
|
| | Run family | Architecture | d_model | Final size | |
| |------------|-------------|--------:|-----------:| |
| | `run_01` | Transformer + BPE (16K vocab) + FP16 | 512 | ~210 MB | |
| | `run_02`, `run_02a`, `run_03` | FP16 byte-level baselines | 512 | ~75 MB | |
| | `run_04`..`run_10` | 39M ternary chain | 512 | ~9 MB | |
| | `run_05b_kernel_strict` | MLGRU without W_o | 512 | ~8 MB | |
| | `run_13` | 10M BPE + ternary (4K vocab) | 320 | ~5 MB | |
| | `run_14`, `run_15`, `run_16` | 10M byte + ternary variants | 320 | ~3 MB | |
|
|
| ## Training results |
|
|
| The full table of `val_loss`, `perplexity`, throughput and architecture per |
| run is published in the |
| [README of the unpacked mirror](https://huggingface.co/Cukinator/cpu1-ablations-final). |
|
|
| A summary of the 2026-05 audit: |
|
|
| - **FP16 baselines** (`run_01`, `run_02`, `run_02a`, `run_03`) converge as |
| designed: byte + LocalByteDecoder reaches val_loss 1.72, MLGRU FP16 reaches 1.87. |
| - **All byte-level ternary runs collapse to `ln(256) β 5.545 nats`** β the |
| uniform-output entropy floor. This holds across both scales (10M and 39M) |
| and both token budgets (2 tok/p and 15 tok/p). |
| - A 7.5Γ increase in tokens-per-parameter (r2) moved the validation loss |
| by 0.0001 nats. The cold-start dynamics of straight-through-estimator |
| ternary training, not the budget, are the bottleneck at this scale. |
| - An r3 set with four corrections (bf16 AMP, `lr_scale=2.0` on BitLinear, |
| CE-only training signal, 50 tok/param) is queued in `RUN_CONFIGS` but |
| has not yet been trained. |
|
|
| Details, mechanistic analysis and throughput projections are documented in |
| the |
| [main repository README](https://github.com/Cukinator/1.58bits/blob/main/README.md#ablation-audit--2026-05-findings). |
|
|
| ## License |
|
|
| Apache-2.0. Same as the source code at [github.com/Cukinator/1.58bits](https://github.com/Cukinator/1.58bits). |
|
|