Cukinator's picture
Update README: reflect actual upload state (20 runs), r2/r3 status, full audit table
77d5726 verified
metadata
language:
  - en
license: apache-2.0
tags:
  - pytorch
  - text-generation
  - 1.58bit
  - ternary
  - byte-level
  - mlgru
  - ablation
library_name: pytorch
pipeline_tag: text-generation
model_type: custom

CPU-1 Ablation Study β€” Ready-to-use Checkpoints (fp32 unpacked)

Repo: Cukinator/cpu1-ablations-final Source: Cukinator/cpu1-ablation-checkpoints (compact 2-bit, raw training output) Code: github.com/Cukinator/1.58bits Dataset: Cukinator/cpu1-ablation-dataset

Each checkpoint is a standard PyTorch .pt file with float32 weights β€” no unpacking needed. Compatible with train_ablation.py from the 1.58bits repo.

Currently uploaded: 20 trained runs. The remaining *_r2 39M runs and the four *_r3 configs are queued but not yet trained / not yet unpacked.

Quick start

import torch, sys
sys.path.insert(0, "/path/to/1.58bits")
from train_ablation import build_ablation_model, generate

ckpt = torch.load("run_02/model.pt", map_location="cpu")
model = build_ablation_model(ckpt["config"])
model.load_state_dict(ckpt["state_dict"])
model.eval()

text = generate(model, "The quick brown fox", 128, ckpt["config"], torch.device("cpu"))
print(text)

Ablation chain β€” 50M runs (round 1, 2 tok/param)

Run Architecture Params Val Loss Perplexity Throughput (P/s, CPU)
run_01 Transformer + BPE (16K vocab) + FP16 54.7M 4.66 106.1 72.4
run_02a_byte_only_heads Transformer + Byte + 4 indep. heads (no LBD) 38.5M 2.31 10.1 84.8
run_02 Transformer + Byte + LocalByteDecoder 38.8M 1.72 5.56 75.4
run_03 MLGRU + Byte + FP16 38.8M 1.87 6.49 58.0
run_04 MLGRU + Byte + Ternary 38.9M 5.57 261.7 9.1
run_05 + FPResidual 39.0M 5.55 257.7 9.7
run_05b_kernel_strict MLGRU kernel-strict (no W_o) 35.8M 5.59 268.8 10.2
run_06 + Bolmo patch embedding 39.0M 5.56 258.8 9.1
run_07 + DeleteGate (CPU-1 complete) 39.0M 5.56 258.8 7.3
run_08 Folded Transformer + Byte + Ternary 38.9M 5.56 260.4 9.7
run_09 + PFNet 39.4M 5.53 252.9 8.5
run_10 + learned per-channel decay 39.4M 5.53 253.1 8.7

Small runs β€” 10M (round 1, 2 tok/param)

Run Architecture Params Val Loss Perplexity
run_13 Small CPU-1 + BPE (4K vocab) 12.5M 30.54 4.85Γ—10⁸
run_14 Small CPU-1 byte-level (Qwen logprob distillation) 10.7M 5.58 263.9
run_15 Small CPU-1 byte + hidden distillation (EmbeddingAligner) 10.7M 5.58 263.9
run_16 Small CPU-1 raw bytes, no teacher (FineWeb) 10.7M 5.58 263.9

Round 2 β€” re-runs at 15 tok/param (7.5Γ— the original budget)

Re-trained at 15 tok/param (7.5Γ— the original budget) to test whether the ~5.55 nat floor on ternary runs was caused by under-training.

10M re-runs (1336 steps each, all uploaded)

Run Architecture Val Loss (r1) Val Loss (r2) Ξ”
run_13_r2 Small CPU-1 + BPE 30.54 25.43 βˆ’5.1
run_14_r2 Small CPU-1 byte-level 5.5754 5.5755 +0.0001
run_15_r2 Small CPU-1 byte + hidden distill 5.5755 5.5755 0
run_16_r2 Small CPU-1 raw bytes 5.5754 5.5754 0

39M re-runs (in progress)

run_04_r2, run_07_r2, run_05_r2, run_08_r2, run_09_r2, run_10_r2 were started at 15 tok/param. As of 2026-05 only run_04_r2 and run_07_r2 have intermediate step checkpoints (β‰ˆ97% of training); none of them have been unpacked to fp32 in this repo yet because the audit below showed they were converging to the same uniform floor as their r1 counterparts. They remain available in the source compact_2bit repo Cukinator/cpu1-ablation-checkpoints.

Manually-measured byte CE loss on the latest run_04_r2 and run_07_r2 step checkpoints (sample of English narrative):

Run Loss (sample) Ξ” vs uniform (5.545)
run_02 (FP16 Transformer, reference) 4.37 βˆ’1.18
run_03 (FP16 MLGRU, reference) 3.97 βˆ’1.58
run_04 (Ternary r1, 2 tok/p) 5.55 +0.01
run_04_r2 (Ternary r2, step 4840 / ~5000) 5.57 +0.02
run_07 (CPU-1 complete r1) 5.56 +0.02
run_07_r2 (CPU-1 complete r2, step 4860) 5.56 +0.02

Round 3 β€” cold-start rescue (queued, not uploaded)

After the audit, four configurations were added to RUN_CONFIGS / SMALL_RUN_CONFIGS that apply all four corrections in concert:

  1. bf16 AMP instead of fp16 (no GradScaler underflow on BitLinear rescale chain)
  2. lr_scale=2.0 on BitLinear (BitNet b1.58 Β§3.1 prescription)
  3. CE-only training signal (drop the 90/10 KL distillation + boundary/emb_align auxiliary losses)
  4. 50 tok/param (3.3Γ— r2, close to BitNet's lower bound at 3B scale)

Configs: run_04_r3, run_07_r3, run_14_r3, run_15_r3. Not yet trained. If they reach val_loss < ~5.0 they will be uploaded here.

Key finding (2026-05 audit): ternary cold-start collapses to the uniform baseline

The reference value is ln(256) = 5.545 nats, the entropy of a uniform distribution over the 256 byte vocabulary. All byte-level ternary runs (round 1 and round 2, both scales) plateau within 0.05 nats of this floor β€” the models are effectively producing uniform predictions over bytes.

More training does not help. A 7.5Γ— increase in tokens-per-parameter moves the loss by 0.0001 nats. A weight-evolution audit on run_14_r2 showed that >99.9% of BitLinear weights are in the same ternary state at step 10 and at step 1326 (the entire training). The bottleneck is the cold-start dynamics of straight-through-estimator (STE) ternary quantisation, not the data budget.

Compare against the FP16 sibling architectures:

  • run_02 (Transformer + byte + FP16, 38M): 1.72
  • run_03 (MLGRU + byte + FP16, 38M): 1.87
  • run_04 (MLGRU + byte + ternary, same 38M): 5.57 ← collapsed

Published 1.58-bit models (BitNet b1.58 at 700M+ with 30–143 tok/param; Slender-Mamba with FP16 warm-start) reach functional performance, but the cold-start regime at <50 tok/param and <100M parameters that this study operates in is not covered by their results.

Throughput is also worth flagging: even with the ideal BitNet.cpp / T-MAC class kernels (4–6Γ— speedup on the matmul fraction), the projected end-to-end throughput of the 39M ternary models is 0.30–0.50Γ— of their FP16 Transformer sibling at the same scale. The 1.58-bit speed advantage only materialises above ~700M parameters where weight RAM bandwidth becomes the bottleneck.

Practical implication. The byte-level ternary chain (runs 04–10, 14–16, and their r2 counterparts) cannot distinguish between architectural variants because all of them are stuck at the same numerical floor. The architectures themselves may be sound; the training recipe needs to either (a) warm-start from FP weights, (b) anneal the quantisation, or (c) use a much larger token budget.

Full mechanistic analysis, weight-flip diagnostics and throughput projections are in the main repository README.

License

Apache-2.0. Same as the source code at github.com/Cukinator/1.58bits.