language:
- en
license: apache-2.0
tags:
- pytorch
- text-generation
- 1.58bit
- ternary
- byte-level
- mlgru
- ablation
- checkpoints
library_name: pytorch
pipeline_tag: text-generation
model_type: custom
CPU-1 Ablation Study — Source Checkpoints (compact 2-bit)
Repo: Cukinator/cpu1-ablation-checkpoints
Unpacked: Cukinator/cpu1-ablations-final
Code: github.com/Cukinator/1.58bits
Dataset: Cukinator/cpu1-ablation-dataset
This repository stores the raw training checkpoints produced by
train_ablation.py from the 1.58bits repo.
There are two checkpoint flavours, both saved per run inside its own folder:
| Filename pattern | Format | Purpose |
|---|---|---|
<run>/checkpoint_<run>_final.pt |
compact_2bit (2-bit packed ternary + bf16 scales) |
Final inference checkpoint — minimal size, ~9 MB for a 39M ternary model |
<run>/checkpoint_<run>_step<N>.pt |
bf16 model + bf16 optimizer state | Phase 1 intermediate resume points |
<run>/checkpoint_<run>_phase2_step<N>.pt |
bf16 model + bf16 optimizer state | Phase 2 intermediate resume points (delete-gate runs only) |
If you just want ready-to-use float32 weights, use the unpacked mirror at
Cukinator/cpu1-ablations-final— those are plain.ptfiles you can load withtorch.load(...)andmodel.load_state_dict(...)without any unpacking step.
This source repo exists so that (a) training jobs can resume from the latest step checkpoint after preemption, and (b) the compact_2bit format itself can be inspected and benchmarked.
Repository contents
22 trained runs, organised in three rounds:
| Round | Tokens/param | Runs |
|---|---|---|
| r1 — original ablation budget | 2 | run_01, run_02, run_02a_byte_only_heads, run_03, run_04, run_05, run_05b_kernel_strict, run_06, run_07, run_08, run_09, run_10, run_13, run_14, run_15, run_16 |
| r2 — re-run at higher budget | 15 | run_04_r2, run_07_r2 (partial), run_13_r2, run_14_r2, run_15_r2, run_16_r2 |
| r3 — cold-start rescue (queued) | 50 | run_04_r3, run_07_r3, run_14_r3, run_15_r3 (not yet uploaded) |
The naming and architecture of each run is defined in RUN_CONFIGS / SMALL_RUN_CONFIGS
in train_ablation.py.
Quick start (compact_2bit)
Loading a compact_2bit checkpoint requires the unpacking helper that ships with the training code:
import sys
sys.path.insert(0, "/path/to/1.58bits")
from train_ablation import load_ablation_checkpoint, build_ablation_model, generate
import torch
state, config = load_ablation_checkpoint(
"run_02/checkpoint_run_02_final.pt"
)
model = build_ablation_model(config)
model.load_state_dict(state, strict=False)
model.eval()
print(generate(model, "The quick brown fox", 128, config, torch.device("cpu")))
For the same checkpoint without an external dependency, use
Cukinator/cpu1-ablations-final.
Final-checkpoint sizes (compact_2bit)
Sizes are measured from the actual _final.pt files on disk.
| Run family | Architecture | d_model | Final size |
|---|---|---|---|
run_01 |
Transformer + BPE (16K vocab) + FP16 | 512 | ~210 MB |
run_02, run_02a, run_03 |
FP16 byte-level baselines | 512 | ~75 MB |
run_04..run_10 |
39M ternary chain | 512 | ~9 MB |
run_05b_kernel_strict |
MLGRU without W_o | 512 | ~8 MB |
run_13 |
10M BPE + ternary (4K vocab) | 320 | ~5 MB |
run_14, run_15, run_16 |
10M byte + ternary variants | 320 | ~3 MB |
Training results
The full table of val_loss, perplexity, throughput and architecture per
run is published in the
README of the unpacked mirror.
A summary of the 2026-05 audit:
- FP16 baselines (
run_01,run_02,run_02a,run_03) converge as designed: byte + LocalByteDecoder reaches val_loss 1.72, MLGRU FP16 reaches 1.87. - All byte-level ternary runs collapse to
ln(256) ≈ 5.545 nats— the uniform-output entropy floor. This holds across both scales (10M and 39M) and both token budgets (2 tok/p and 15 tok/p). - A 7.5× increase in tokens-per-parameter (r2) moved the validation loss by 0.0001 nats. The cold-start dynamics of straight-through-estimator ternary training, not the budget, are the bottleneck at this scale.
- An r3 set with four corrections (bf16 AMP,
lr_scale=2.0on BitLinear, CE-only training signal, 50 tok/param) is queued inRUN_CONFIGSbut has not yet been trained.
Details, mechanistic analysis and throughput projections are documented in the main repository README.
License
Apache-2.0. Same as the source code at github.com/Cukinator/1.58bits.