1bitllm code (checkpoints to follow)

4754707 verified 19 days ago

7.26 kB

	# Notes index — research project status

	## v76 paper — synth findings PARTIALLY refuted at 75M scale

	### Synth-scale story (mechanistically interesting, prescriptively weak)

	- `v76_paper_outline.md` — paper outline with honest 75M empirical results.
	- `v76_paper_synthesis.md` — synth mechanism narrative.
	- `v76_math_derivation.md` — original math memo (mostly refuted).
	- `d_sweep_synth.md`, `split_q_synth.md` — raw synth data.
	- `synth_results_summary.csv` — 38 synth experiments.

	### 75M empirical results (this session)

	- `75m_kshift_offset_validation.md` — the honest result: synth-predicted
	offset=2 advantage REVERSED at 75M. offset=1 wins by 0.014 BPC at 5K steps.
	- `synth_alibi_offset_ablation.md` — synth+ALiBi+n=8 placement sweep at
	4L (9 seeds × 4 layers): monotonic L0→L3 with lottery distributions.
	At 8L, L0 fails 3/3 seeds while L4/L7 are deterministic (3/3 hit). L1
	is 4/5 seeds reliable but seed=100 fails at L1 (17%) — **L1 still has
	lottery-fail tail; L4-L7 are strictly more reliable**. K-shift L0 is
	uniquely bad. For 75M: K-shift L1 ~80% confident win, K-shift L5+ high
	confidence even at single seed.

	Empirical 75M head-to-head (softmax+K-shift L0, 5000 steps each):

	\| Step \| offset=2 \| offset=1 \| Δ \|
	\|---\|---\|---\|---\|
	\| 1500 \| 7.9281 \| 7.9606 \| -0.033 ✅ early \|
	\| 2000 \| 7.6452 \| 7.5921 \| +0.053 ❌ post-warmup \|
	\| 5000 \| 6.7779 \| 6.7639 \| +0.014 ❌ \|

	The synth-predicted offset=2 advantage is real during warmup (steps 1000-1500)
	but erased post-warmup. At full LR, offset=1 (the canonical v76 setting)
	is slightly better. Likely causes: ALiBi attenuation, natural-text variable-
	distance induction (vs synth's fixed distance 24).

	Synth K-shift findings = MECHANISTIC INSIGHT, not optimization prescription.

	### Outstanding (VM lost connectivity mid-run)

	- K-shift L11 placement test at 75M — started but VM dropped. Needs re-run.
	Synth predicted placement was the bigger effect; whether that transfers
	to 75M is the remaining open question.

	## ⭐ Read this first: `SUMMARY_bonsai_1bit.md` consolidates the full bonsai 1-bit research arc.

	## End-to-end 1-bit Bonsai training (CPU) — W1A1 + a4.8 + AVX-512 popcount

	- `bonsai_e2e_1bit.md` — full Q1_0_g128 format applied DURING training,
	end-to-end (token embed + attention + FFN + LM head). **Held-out scaling
	with PLATEAU @ ~0.04 nat:** d=128/2L 76K +0.41 nat → d=192/4L 340K +0.09
	→ d=256/4L 1.1M +0.045 → d=512/4L 5M +0.044 (plateau, multi-seed).
	Not a clean power law — the data suggests an inherent ~0.04 nat floor
	from W1A1+a4.8+STE that doesn't shrink with more capacity. Sweet spot
	is d=256/1M-param range. 14.8× weight compression, AVX-512 popcount
	forward GEMM, 3-seed verified at d=128, d=192, d=256, d=512.
	- W1A1 vanilla penalty: 0.52 nat (vs fp32). Most reclaimed via a4.8
	boundary trick (down to 0.22 nat penalty).
	- AVX-512 popcount + per-group fp16 scale fusion: equal-quality, faster.
	- Lion underperforms AdamW at this scale; hard top-k attention loses 2+
	nats — both kept as flags but not in the recommended recipe.
	- Two backward bugs fixed: STE clip mask + per-group scale chain rule.
	- Files: `bonsai_layers.py`, `train_bonsai_cpu.py`, `lion_optim.py`,
	`bit_cpu_avx.py`, `bench_bit_avx.py`

	## CPU 1-bit training (this session) — AVX-512 popcount WINS

	- `cpu_1bit_training.md` — built `bit_cpu_avx.py` C extension using
	`__builtin_popcountll` (AVX-512 VPOPCNTQ on Zen 4). **3.34× faster than
	BLAS bf16 at 256×1536×1536 (300M training shape).** End-to-end training
	speedup 10-18% (grows with model d) AND lower loss thanks to int32
	exact dot vs bf16 truncation. This is the actual analog of bitnet.cpp's
	inference path applied to training.
	- Pure-PyTorch bit-packing (uint8 + LUT popcount) was a dead end: 367×
	SLOWER than BLAS. Need C-level popcount intrinsics.
	- Files: `bit_cpu_avx.py`, `train_1bit_cpu_avx.py`, `bench_bit_avx.py`
	(AVX path); `bit_cpu_kernel.py`, `train_1bit_cpu.py`, `bench_bit_cpu.py`
	(slow pure-PyTorch path); `train_fp32_cpu_baseline.py` (control).

	## Int8 GEMM kernel work (synth-validated; awaits 75M benchmark)

	- `session_2026-05-08_int8_gemm_kernel.md` — full session writeup for
	Triton int8 forward GEMM + cached weight buffer + int8 dx backward.

	Status: 13 CPU tests pass with zero numerical diff. Files ready to sync
	to VM and benchmark. Awaits VM (port 1145 timing out since session start).

	Files: `bit_int8_kernel.py`, `model_v47.py` (with `enable_int8_gemm()`),
	`bench_bit_gemm.py`, `test_int8_gemm_cpu.py`, `test_int8_e2e_cpu.py`,
	`train_fineweb_v75.py` and `train_300m_flex.py` (with `--use-int8-gemm`
	and `--int8-bwd` CLI flags).

	## Bonsai (anneal_binarize) — complete

	- `session_2026-05-07_bonsai_alibi_finetune.md` — bonsai is now the most
	efficient 300M training recipe. 38% training reaches 65% peak strength.
	Different attractor than cold-start (canonical L0H0 specialist).
	- Implementation in `model_v47.py` (`set_binarize_mix`, `_maybe_mixed_sign`).

	## Negative results

	- `truebit_no_latent_negative.md` — flip-counter int weights without fp32
	latent: copy_acc 9.4% (vs 100% control). Gradient-magnitude info loss is
	fundamental.
	- Q-shift, split-Q, embedding orthogonality (refuted in v76_paper_synthesis).

	## Currently blocked

	- VM at `174.0.242.177:1145` unreachable since session start. All
	GPU-dependent verification awaits VM return:
	- 75M K-shift L11 vs L0 head-to-head
	- Triton int8 kernel benchmark
	- 300M-scale anything

	## Code added/modified this multi-session period

	\| File \| Purpose \|
	\|---\|---\|
	\| `bit_int8_kernel.py` \| Triton int8 GEMM kernels (forward + dx backward) \|
	\| `model_v47.py` \| int8 GEMM toggle, bonsai mix \|
	\| `model_v47b.py` \| NEW: `kshift_offset` parameter through IntBinaryAttention/Block/LM \|
	\| `synth_induction_train.py` \| Split-Q implementation, K-shift CLI override + offset \|
	\| `train_fineweb_v75.py` \| `--use-int8-gemm`, `--int8-bwd`, `--k-shift-offset` flags \|
	\| `train_300m_flex.py` \| Same int8 flags + `--k-shift`/`--k-shift-layers`/`--k-shift-offset`/`--rwsp`/`--rwsp-k`/`--q-shift`/`--q-shift-layers` flags. Also fixed FlexAttention monkey-patch to fall back to original forward when K-shift/Q-shift/RWSP enabled \|
	\| `bench_bit_gemm.py` \| int8 vs bf16 benchmarks at training shapes \|
	\| `test_int8_gemm_cpu.py` \| 8 unit tests \|
	\| `test_int8_e2e_cpu.py` \| 5 end-to-end tests \|
	\| `analyze_synth_attention.py` \| Attention argmax distribution analyzer (depth-general) \|

	All files in `/home/nathan/1bitllm/`. Ready to sync to VM when accessible.

	## Ready-to-run experiments at scale (VM-blocked)

	Once VM access is restored, the following commands should work end-to-end:

	```bash
	# v76 paper's "easy fix" prediction: change offset from 1 to 5
	python train_fineweb_v75.py \
	--rwsp --k-shift --k-shift-layers 0 --k-shift-offset 5 \
	--max-steps 20000 --tag v76b_offset5

	# Compare to baseline (current v76):
	python train_fineweb_v75.py \
	--rwsp --k-shift --k-shift-layers 0 --k-shift-offset 1 \
	--max-steps 20000 --tag v76_offset1_baseline
	```

	Predicted: offset=5 reaches lower val_bpc and >40% top-per-head IPMR vs baseline ~10%.