Add README: describe compact_2bit source repo + audit summary

c68536f verified 8 days ago

5.32 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- pytorch
	- text-generation
	- 1.58bit
	- ternary
	- byte-level
	- mlgru
	- ablation
	- checkpoints
	library_name: pytorch
	pipeline_tag: text-generation
	model_type: custom
	---

	# CPU-1 Ablation Study — Source Checkpoints (compact 2-bit)

	Repo: `Cukinator/cpu1-ablation-checkpoints`
	Unpacked: [`Cukinator/cpu1-ablations-final`](https://huggingface.co/Cukinator/cpu1-ablations-final)
	Code: [github.com/Cukinator/1.58bits](https://github.com/Cukinator/1.58bits)
	Dataset: [`Cukinator/cpu1-ablation-dataset`](https://huggingface.co/datasets/Cukinator/cpu1-ablation-dataset)

	This repository stores the raw training checkpoints produced by
	`train_ablation.py` from the [1.58bits repo](https://github.com/Cukinator/1.58bits).
	There are two checkpoint flavours, both saved per run inside its own folder:

	\| Filename pattern \| Format \| Purpose \|
	\|------------------\|--------\|---------\|
	\| `<run>/checkpoint_<run>_final.pt` \| `compact_2bit` (2-bit packed ternary + bf16 scales) \| Final inference checkpoint — minimal size, ~9 MB for a 39M ternary model \|
	\| `<run>/checkpoint_<run>_step<N>.pt` \| bf16 model + bf16 optimizer state \| Phase 1 intermediate resume points \|
	\| `<run>/checkpoint_<run>_phase2_step<N>.pt` \| bf16 model + bf16 optimizer state \| Phase 2 intermediate resume points (delete-gate runs only) \|

	> If you just want ready-to-use float32 weights, use the unpacked mirror
	> at [`Cukinator/cpu1-ablations-final`](https://huggingface.co/Cukinator/cpu1-ablations-final) — those are plain
	> `.pt` files you can load with `torch.load(...)` and `model.load_state_dict(...)`
	> without any unpacking step.

	This source repo exists so that (a) training jobs can resume from the latest
	step checkpoint after preemption, and (b) the compact_2bit format itself
	can be inspected and benchmarked.

	## Repository contents

	22 trained runs, organised in three rounds:

	\| Round \| Tokens/param \| Runs \|
	\|-------\|:------------:\|------\|
	\| r1 — original ablation budget \| 2 \| `run_01`, `run_02`, `run_02a_byte_only_heads`, `run_03`, `run_04`, `run_05`, `run_05b_kernel_strict`, `run_06`, `run_07`, `run_08`, `run_09`, `run_10`, `run_13`, `run_14`, `run_15`, `run_16` \|
	\| r2 — re-run at higher budget \| 15 \| `run_04_r2`, `run_07_r2` (partial), `run_13_r2`, `run_14_r2`, `run_15_r2`, `run_16_r2` \|
	\| r3 — cold-start rescue (queued) \| 50 \| `run_04_r3`, `run_07_r3`, `run_14_r3`, `run_15_r3` (not yet uploaded) \|

	The naming and architecture of each run is defined in `RUN_CONFIGS` / `SMALL_RUN_CONFIGS`
	in [`train_ablation.py`](https://github.com/Cukinator/1.58bits/blob/main/train_ablation.py).

	## Quick start (compact_2bit)

	Loading a compact_2bit checkpoint requires the unpacking helper that
	ships with the training code:

	```python
	import sys
	sys.path.insert(0, "/path/to/1.58bits")
	from train_ablation import load_ablation_checkpoint, build_ablation_model, generate
	import torch

	state, config = load_ablation_checkpoint(
	"run_02/checkpoint_run_02_final.pt"
	)
	model = build_ablation_model(config)
	model.load_state_dict(state, strict=False)
	model.eval()

	print(generate(model, "The quick brown fox", 128, config, torch.device("cpu")))
	```

	For the same checkpoint without an external dependency, use
	[`Cukinator/cpu1-ablations-final`](https://huggingface.co/Cukinator/cpu1-ablations-final).

	## Final-checkpoint sizes (compact_2bit)

	Sizes are measured from the actual `_final.pt` files on disk.

	\| Run family \| Architecture \| d_model \| Final size \|
	\|------------\|-------------\|--------:\|-----------:\|
	\| `run_01` \| Transformer + BPE (16K vocab) + FP16 \| 512 \| ~210 MB \|
	\| `run_02`, `run_02a`, `run_03` \| FP16 byte-level baselines \| 512 \| ~75 MB \|
	\| `run_04`..`run_10` \| 39M ternary chain \| 512 \| ~9 MB \|
	\| `run_05b_kernel_strict` \| MLGRU without W_o \| 512 \| ~8 MB \|
	\| `run_13` \| 10M BPE + ternary (4K vocab) \| 320 \| ~5 MB \|
	\| `run_14`, `run_15`, `run_16` \| 10M byte + ternary variants \| 320 \| ~3 MB \|

	## Training results

	The full table of `val_loss`, `perplexity`, throughput and architecture per
	run is published in the
	[README of the unpacked mirror](https://huggingface.co/Cukinator/cpu1-ablations-final).

	A summary of the 2026-05 audit:

	- FP16 baselines (`run_01`, `run_02`, `run_02a`, `run_03`) converge as
	designed: byte + LocalByteDecoder reaches val_loss 1.72, MLGRU FP16 reaches 1.87.
	- All byte-level ternary runs collapse to `ln(256) ≈ 5.545 nats` — the
	uniform-output entropy floor. This holds across both scales (10M and 39M)
	and both token budgets (2 tok/p and 15 tok/p).
	- A 7.5× increase in tokens-per-parameter (r2) moved the validation loss
	by 0.0001 nats. The cold-start dynamics of straight-through-estimator
	ternary training, not the budget, are the bottleneck at this scale.
	- An r3 set with four corrections (bf16 AMP, `lr_scale=2.0` on BitLinear,
	CE-only training signal, 50 tok/param) is queued in `RUN_CONFIGS` but
	has not yet been trained.

	Details, mechanistic analysis and throughput projections are documented in
	the
	[main repository README](https://github.com/Cukinator/1.58bits/blob/main/README.md#ablation-audit--2026-05-findings).

	## License

	Apache-2.0. Same as the source code at [github.com/Cukinator/1.58bits](https://github.com/Cukinator/1.58bits).