bitnet-1bitllm / notes /INDEX.md
hidude562's picture
1bitllm code (checkpoints to follow)
4754707 verified

Notes index β€” research project status

v76 paper β€” synth findings PARTIALLY refuted at 75M scale

Synth-scale story (mechanistically interesting, prescriptively weak)

  • v76_paper_outline.md β€” paper outline with honest 75M empirical results.
  • v76_paper_synthesis.md β€” synth mechanism narrative.
  • v76_math_derivation.md β€” original math memo (mostly refuted).
  • d_sweep_synth.md, split_q_synth.md β€” raw synth data.
  • synth_results_summary.csv β€” 38 synth experiments.

75M empirical results (this session)

  • 75m_kshift_offset_validation.md β€” the honest result: synth-predicted offset=2 advantage REVERSED at 75M. offset=1 wins by 0.014 BPC at 5K steps.
  • synth_alibi_offset_ablation.md β€” synth+ALiBi+n=8 placement sweep at 4L (9 seeds Γ— 4 layers): monotonic L0β†’L3 with lottery distributions. At 8L, L0 fails 3/3 seeds while L4/L7 are deterministic (3/3 hit). L1 is 4/5 seeds reliable but seed=100 fails at L1 (17%) β€” L1 still has lottery-fail tail; L4-L7 are strictly more reliable. K-shift L0 is uniquely bad. For 75M: K-shift L1 ~80% confident win, K-shift L5+ high confidence even at single seed.

Empirical 75M head-to-head (softmax+K-shift L0, 5000 steps each):

Step offset=2 offset=1 Ξ”
1500 7.9281 7.9606 -0.033 βœ… early
2000 7.6452 7.5921 +0.053 ❌ post-warmup
5000 6.7779 6.7639 +0.014 ❌

The synth-predicted offset=2 advantage is real during warmup (steps 1000-1500) but erased post-warmup. At full LR, offset=1 (the canonical v76 setting) is slightly better. Likely causes: ALiBi attenuation, natural-text variable- distance induction (vs synth's fixed distance 24).

Synth K-shift findings = MECHANISTIC INSIGHT, not optimization prescription.

Outstanding (VM lost connectivity mid-run)

  • K-shift L11 placement test at 75M β€” started but VM dropped. Needs re-run. Synth predicted placement was the bigger effect; whether that transfers to 75M is the remaining open question.

⭐ Read this first: SUMMARY_bonsai_1bit.md consolidates the full bonsai 1-bit research arc.

End-to-end 1-bit Bonsai training (CPU) β€” W1A1 + a4.8 + AVX-512 popcount

  • bonsai_e2e_1bit.md β€” full Q1_0_g128 format applied DURING training, end-to-end (token embed + attention + FFN + LM head). Held-out scaling with PLATEAU @ ~0.04 nat: d=128/2L 76K +0.41 nat β†’ d=192/4L 340K +0.09 β†’ d=256/4L 1.1M +0.045 β†’ d=512/4L 5M +0.044 (plateau, multi-seed). Not a clean power law β€” the data suggests an inherent ~0.04 nat floor from W1A1+a4.8+STE that doesn't shrink with more capacity. Sweet spot is d=256/1M-param range. 14.8Γ— weight compression, AVX-512 popcount forward GEMM, 3-seed verified at d=128, d=192, d=256, d=512.
    • W1A1 vanilla penalty: 0.52 nat (vs fp32). Most reclaimed via a4.8 boundary trick (down to 0.22 nat penalty).
    • AVX-512 popcount + per-group fp16 scale fusion: equal-quality, faster.
    • Lion underperforms AdamW at this scale; hard top-k attention loses 2+ nats β€” both kept as flags but not in the recommended recipe.
    • Two backward bugs fixed: STE clip mask + per-group scale chain rule.
    • Files: bonsai_layers.py, train_bonsai_cpu.py, lion_optim.py, bit_cpu_avx.py, bench_bit_avx.py

CPU 1-bit training (this session) β€” AVX-512 popcount WINS

  • cpu_1bit_training.md β€” built bit_cpu_avx.py C extension using __builtin_popcountll (AVX-512 VPOPCNTQ on Zen 4). 3.34Γ— faster than BLAS bf16 at 256Γ—1536Γ—1536 (300M training shape). End-to-end training speedup 10-18% (grows with model d) AND lower loss thanks to int32 exact dot vs bf16 truncation. This is the actual analog of bitnet.cpp's inference path applied to training.
  • Pure-PyTorch bit-packing (uint8 + LUT popcount) was a dead end: 367Γ— SLOWER than BLAS. Need C-level popcount intrinsics.
  • Files: bit_cpu_avx.py, train_1bit_cpu_avx.py, bench_bit_avx.py (AVX path); bit_cpu_kernel.py, train_1bit_cpu.py, bench_bit_cpu.py (slow pure-PyTorch path); train_fp32_cpu_baseline.py (control).

Int8 GEMM kernel work (synth-validated; awaits 75M benchmark)

  • session_2026-05-08_int8_gemm_kernel.md β€” full session writeup for Triton int8 forward GEMM + cached weight buffer + int8 dx backward.

Status: 13 CPU tests pass with zero numerical diff. Files ready to sync to VM and benchmark. Awaits VM (port 1145 timing out since session start).

Files: bit_int8_kernel.py, model_v47.py (with enable_int8_gemm()), bench_bit_gemm.py, test_int8_gemm_cpu.py, test_int8_e2e_cpu.py, train_fineweb_v75.py and train_300m_flex.py (with --use-int8-gemm and --int8-bwd CLI flags).

Bonsai (anneal_binarize) β€” complete

  • session_2026-05-07_bonsai_alibi_finetune.md β€” bonsai is now the most efficient 300M training recipe. 38% training reaches 65% peak strength. Different attractor than cold-start (canonical L0H0 specialist).
  • Implementation in model_v47.py (set_binarize_mix, _maybe_mixed_sign).

Negative results

  • truebit_no_latent_negative.md β€” flip-counter int weights without fp32 latent: copy_acc 9.4% (vs 100% control). Gradient-magnitude info loss is fundamental.
  • Q-shift, split-Q, embedding orthogonality (refuted in v76_paper_synthesis).

Currently blocked

  • VM at 174.0.242.177:1145 unreachable since session start. All GPU-dependent verification awaits VM return:
    • 75M K-shift L11 vs L0 head-to-head
    • Triton int8 kernel benchmark
    • 300M-scale anything

Code added/modified this multi-session period

File Purpose
bit_int8_kernel.py Triton int8 GEMM kernels (forward + dx backward)
model_v47.py int8 GEMM toggle, bonsai mix
model_v47b.py NEW: kshift_offset parameter through IntBinaryAttention/Block/LM
synth_induction_train.py Split-Q implementation, K-shift CLI override + offset
train_fineweb_v75.py --use-int8-gemm, --int8-bwd, --k-shift-offset flags
train_300m_flex.py Same int8 flags + --k-shift/--k-shift-layers/--k-shift-offset/--rwsp/--rwsp-k/--q-shift/--q-shift-layers flags. Also fixed FlexAttention monkey-patch to fall back to original forward when K-shift/Q-shift/RWSP enabled
bench_bit_gemm.py int8 vs bf16 benchmarks at training shapes
test_int8_gemm_cpu.py 8 unit tests
test_int8_e2e_cpu.py 5 end-to-end tests
analyze_synth_attention.py Attention argmax distribution analyzer (depth-general)

All files in /home/nathan/1bitllm/. Ready to sync to VM when accessible.

Ready-to-run experiments at scale (VM-blocked)

Once VM access is restored, the following commands should work end-to-end:

# v76 paper's "easy fix" prediction: change offset from 1 to 5
python train_fineweb_v75.py \
    --rwsp --k-shift --k-shift-layers 0 --k-shift-offset 5 \
    --max-steps 20000 --tag v76b_offset5

# Compare to baseline (current v76):
python train_fineweb_v75.py \
    --rwsp --k-shift --k-shift-layers 0 --k-shift-offset 1 \
    --max-steps 20000 --tag v76_offset1_baseline

Predicted: offset=5 reaches lower val_bpc and >40% top-per-head IPMR vs baseline ~10%.