Add README: describe compact_2bit source repo + audit summary

c68536f verified 8 days ago

5.32 kB

language:
  - en
license: apache-2.0
tags:
  - pytorch
  - text-generation
  - 1.58bit
  - ternary
  - byte-level
  - mlgru
  - ablation
  - checkpoints
library_name: pytorch
pipeline_tag: text-generation
model_type: custom

CPU-1 Ablation Study — Source Checkpoints (compact 2-bit)

Repo: Cukinator/cpu1-ablation-checkpoints Unpacked: Cukinator/cpu1-ablations-final Code: github.com/Cukinator/1.58bits Dataset: Cukinator/cpu1-ablation-dataset

This repository stores the raw training checkpoints produced by train_ablation.py from the 1.58bits repo. There are two checkpoint flavours, both saved per run inside its own folder:

Filename pattern	Format	Purpose
`<run>/checkpoint_<run>_final.pt`	`compact_2bit` (2-bit packed ternary + bf16 scales)	Final inference checkpoint — minimal size, ~9 MB for a 39M ternary model
`<run>/checkpoint_<run>_step<N>.pt`	bf16 model + bf16 optimizer state	Phase 1 intermediate resume points
`<run>/checkpoint_<run>_phase2_step<N>.pt`	bf16 model + bf16 optimizer state	Phase 2 intermediate resume points (delete-gate runs only)

If you just want ready-to-use float32 weights, use the unpacked mirror at Cukinator/cpu1-ablations-final — those are plain .pt files you can load with torch.load(...) and model.load_state_dict(...) without any unpacking step.

This source repo exists so that (a) training jobs can resume from the latest step checkpoint after preemption, and (b) the compact_2bit format itself can be inspected and benchmarked.

Repository contents

22 trained runs, organised in three rounds:

Round	Tokens/param	Runs
r1 — original ablation budget	2	`run_01`, `run_02`, `run_02a_byte_only_heads`, `run_03`, `run_04`, `run_05`, `run_05b_kernel_strict`, `run_06`, `run_07`, `run_08`, `run_09`, `run_10`, `run_13`, `run_14`, `run_15`, `run_16`
r2 — re-run at higher budget	15	`run_04_r2`, `run_07_r2` (partial), `run_13_r2`, `run_14_r2`, `run_15_r2`, `run_16_r2`
r3 — cold-start rescue (queued)	50	`run_04_r3`, `run_07_r3`, `run_14_r3`, `run_15_r3` (not yet uploaded)

The naming and architecture of each run is defined in RUN_CONFIGS / SMALL_RUN_CONFIGS in train_ablation.py.

Quick start (compact_2bit)

Loading a compact_2bit checkpoint requires the unpacking helper that ships with the training code:

import sys
sys.path.insert(0, "/path/to/1.58bits")
from train_ablation import load_ablation_checkpoint, build_ablation_model, generate
import torch

state, config = load_ablation_checkpoint(
    "run_02/checkpoint_run_02_final.pt"
)
model = build_ablation_model(config)
model.load_state_dict(state, strict=False)
model.eval()

print(generate(model, "The quick brown fox", 128, config, torch.device("cpu")))

For the same checkpoint without an external dependency, use Cukinator/cpu1-ablations-final.

Final-checkpoint sizes (compact_2bit)

Sizes are measured from the actual _final.pt files on disk.

Run family	Architecture	d_model	Final size
`run_01`	Transformer + BPE (16K vocab) + FP16	512	~210 MB
`run_02`, `run_02a`, `run_03`	FP16 byte-level baselines	512	~75 MB
`run_04`..`run_10`	39M ternary chain	512	~9 MB
`run_05b_kernel_strict`	MLGRU without W_o	512	~8 MB
`run_13`	10M BPE + ternary (4K vocab)	320	~5 MB
`run_14`, `run_15`, `run_16`	10M byte + ternary variants	320	~3 MB

Training results

The full table of val_loss, perplexity, throughput and architecture per run is published in the README of the unpacked mirror.

A summary of the 2026-05 audit:

FP16 baselines (run_01, run_02, run_02a, run_03) converge as designed: byte + LocalByteDecoder reaches val_loss 1.72, MLGRU FP16 reaches 1.87.
All byte-level ternary runs collapse to ln(256) ≈ 5.545 nats — the uniform-output entropy floor. This holds across both scales (10M and 39M) and both token budgets (2 tok/p and 15 tok/p).
A 7.5× increase in tokens-per-parameter (r2) moved the validation loss by 0.0001 nats. The cold-start dynamics of straight-through-estimator ternary training, not the budget, are the bottleneck at this scale.
An r3 set with four corrections (bf16 AMP, lr_scale=2.0 on BitLinear, CE-only training signal, 50 tok/param) is queued in RUN_CONFIGS but has not yet been trained.

Details, mechanistic analysis and throughput projections are documented in the main repository README.

License

Apache-2.0. Same as the source code at github.com/Cukinator/1.58bits.