Cukinator's picture
Add README: describe compact_2bit source repo + audit summary
c68536f verified
metadata
language:
  - en
license: apache-2.0
tags:
  - pytorch
  - text-generation
  - 1.58bit
  - ternary
  - byte-level
  - mlgru
  - ablation
  - checkpoints
library_name: pytorch
pipeline_tag: text-generation
model_type: custom

CPU-1 Ablation Study — Source Checkpoints (compact 2-bit)

Repo: Cukinator/cpu1-ablation-checkpoints Unpacked: Cukinator/cpu1-ablations-final Code: github.com/Cukinator/1.58bits Dataset: Cukinator/cpu1-ablation-dataset

This repository stores the raw training checkpoints produced by train_ablation.py from the 1.58bits repo. There are two checkpoint flavours, both saved per run inside its own folder:

Filename pattern Format Purpose
<run>/checkpoint_<run>_final.pt compact_2bit (2-bit packed ternary + bf16 scales) Final inference checkpoint — minimal size, ~9 MB for a 39M ternary model
<run>/checkpoint_<run>_step<N>.pt bf16 model + bf16 optimizer state Phase 1 intermediate resume points
<run>/checkpoint_<run>_phase2_step<N>.pt bf16 model + bf16 optimizer state Phase 2 intermediate resume points (delete-gate runs only)

If you just want ready-to-use float32 weights, use the unpacked mirror at Cukinator/cpu1-ablations-final — those are plain .pt files you can load with torch.load(...) and model.load_state_dict(...) without any unpacking step.

This source repo exists so that (a) training jobs can resume from the latest step checkpoint after preemption, and (b) the compact_2bit format itself can be inspected and benchmarked.

Repository contents

22 trained runs, organised in three rounds:

Round Tokens/param Runs
r1 — original ablation budget 2 run_01, run_02, run_02a_byte_only_heads, run_03, run_04, run_05, run_05b_kernel_strict, run_06, run_07, run_08, run_09, run_10, run_13, run_14, run_15, run_16
r2 — re-run at higher budget 15 run_04_r2, run_07_r2 (partial), run_13_r2, run_14_r2, run_15_r2, run_16_r2
r3 — cold-start rescue (queued) 50 run_04_r3, run_07_r3, run_14_r3, run_15_r3 (not yet uploaded)

The naming and architecture of each run is defined in RUN_CONFIGS / SMALL_RUN_CONFIGS in train_ablation.py.

Quick start (compact_2bit)

Loading a compact_2bit checkpoint requires the unpacking helper that ships with the training code:

import sys
sys.path.insert(0, "/path/to/1.58bits")
from train_ablation import load_ablation_checkpoint, build_ablation_model, generate
import torch

state, config = load_ablation_checkpoint(
    "run_02/checkpoint_run_02_final.pt"
)
model = build_ablation_model(config)
model.load_state_dict(state, strict=False)
model.eval()

print(generate(model, "The quick brown fox", 128, config, torch.device("cpu")))

For the same checkpoint without an external dependency, use Cukinator/cpu1-ablations-final.

Final-checkpoint sizes (compact_2bit)

Sizes are measured from the actual _final.pt files on disk.

Run family Architecture d_model Final size
run_01 Transformer + BPE (16K vocab) + FP16 512 ~210 MB
run_02, run_02a, run_03 FP16 byte-level baselines 512 ~75 MB
run_04..run_10 39M ternary chain 512 ~9 MB
run_05b_kernel_strict MLGRU without W_o 512 ~8 MB
run_13 10M BPE + ternary (4K vocab) 320 ~5 MB
run_14, run_15, run_16 10M byte + ternary variants 320 ~3 MB

Training results

The full table of val_loss, perplexity, throughput and architecture per run is published in the README of the unpacked mirror.

A summary of the 2026-05 audit:

  • FP16 baselines (run_01, run_02, run_02a, run_03) converge as designed: byte + LocalByteDecoder reaches val_loss 1.72, MLGRU FP16 reaches 1.87.
  • All byte-level ternary runs collapse to ln(256) ≈ 5.545 nats — the uniform-output entropy floor. This holds across both scales (10M and 39M) and both token budgets (2 tok/p and 15 tok/p).
  • A 7.5× increase in tokens-per-parameter (r2) moved the validation loss by 0.0001 nats. The cold-start dynamics of straight-through-estimator ternary training, not the budget, are the bottleneck at this scale.
  • An r3 set with four corrections (bf16 AMP, lr_scale=2.0 on BitLinear, CE-only training signal, 50 tok/param) is queued in RUN_CONFIGS but has not yet been trained.

Details, mechanistic analysis and throughput projections are documented in the main repository README.

License

Apache-2.0. Same as the source code at github.com/Cukinator/1.58bits.