| # Notes index β research project status |
|
|
| ## v76 paper β synth findings PARTIALLY refuted at 75M scale |
|
|
| ### Synth-scale story (mechanistically interesting, prescriptively weak) |
|
|
| - `v76_paper_outline.md` β paper outline with **honest 75M empirical results**. |
| - `v76_paper_synthesis.md` β synth mechanism narrative. |
| - `v76_math_derivation.md` β original math memo (mostly refuted). |
| - `d_sweep_synth.md`, `split_q_synth.md` β raw synth data. |
| - `synth_results_summary.csv` β 38 synth experiments. |
|
|
| ### 75M empirical results (this session) |
|
|
| - `75m_kshift_offset_validation.md` β **the honest result**: synth-predicted |
| offset=2 advantage REVERSED at 75M. offset=1 wins by 0.014 BPC at 5K steps. |
| - `synth_alibi_offset_ablation.md` β synth+ALiBi+n=8 placement sweep at |
| 4L (9 seeds Γ 4 layers): monotonic L0βL3 with lottery distributions. |
| At 8L, L0 fails 3/3 seeds while L4/L7 are deterministic (3/3 hit). L1 |
| is 4/5 seeds reliable but seed=100 fails at L1 (17%) β **L1 still has |
| lottery-fail tail; L4-L7 are strictly more reliable**. K-shift L0 is |
| uniquely bad. For 75M: K-shift L1 ~80% confident win, K-shift L5+ high |
| confidence even at single seed. |
|
|
| **Empirical 75M head-to-head** (softmax+K-shift L0, 5000 steps each): |
|
|
| | Step | offset=2 | offset=1 | Ξ | |
| |---|---|---|---| |
| | 1500 | 7.9281 | 7.9606 | -0.033 β
early | |
| | 2000 | 7.6452 | 7.5921 | +0.053 β post-warmup | |
| | **5000** | **6.7779** | **6.7639** | **+0.014** β | |
|
|
| The synth-predicted offset=2 advantage is real during warmup (steps 1000-1500) |
| but **erased post-warmup**. At full LR, offset=1 (the canonical v76 setting) |
| is slightly better. Likely causes: ALiBi attenuation, natural-text variable- |
| distance induction (vs synth's fixed distance 24). |
|
|
| **Synth K-shift findings = MECHANISTIC INSIGHT, not optimization prescription.** |
|
|
| ### Outstanding (VM lost connectivity mid-run) |
|
|
| - K-shift L11 placement test at 75M β started but VM dropped. Needs re-run. |
| Synth predicted placement was the bigger effect; whether that transfers |
| to 75M is the remaining open question. |
|
|
| ## β Read this first: `SUMMARY_bonsai_1bit.md` consolidates the full bonsai 1-bit research arc. |
|
|
| ## End-to-end 1-bit Bonsai training (CPU) β W1A1 + a4.8 + AVX-512 popcount |
|
|
| - `bonsai_e2e_1bit.md` β full Q1_0_g128 format applied DURING training, |
| end-to-end (token embed + attention + FFN + LM head). **Held-out scaling |
| with PLATEAU @ ~0.04 nat:** d=128/2L 76K +0.41 nat β d=192/4L 340K +0.09 |
| β d=256/4L 1.1M +0.045 β d=512/4L 5M +0.044 (plateau, multi-seed). |
| Not a clean power law β the data suggests an inherent ~0.04 nat floor |
| from W1A1+a4.8+STE that doesn't shrink with more capacity. Sweet spot |
| is d=256/1M-param range. 14.8Γ weight compression, AVX-512 popcount |
| forward GEMM, 3-seed verified at d=128, d=192, d=256, d=512. |
| - W1A1 vanilla penalty: 0.52 nat (vs fp32). Most reclaimed via a4.8 |
| boundary trick (down to 0.22 nat penalty). |
| - AVX-512 popcount + per-group fp16 scale fusion: equal-quality, faster. |
| - Lion underperforms AdamW at this scale; hard top-k attention loses 2+ |
| nats β both kept as flags but not in the recommended recipe. |
| - Two backward bugs fixed: STE clip mask + per-group scale chain rule. |
| - Files: `bonsai_layers.py`, `train_bonsai_cpu.py`, `lion_optim.py`, |
| `bit_cpu_avx.py`, `bench_bit_avx.py` |
| |
| ## CPU 1-bit training (this session) β AVX-512 popcount WINS |
|
|
| - `cpu_1bit_training.md` β built `bit_cpu_avx.py` C extension using |
| `__builtin_popcountll` (AVX-512 VPOPCNTQ on Zen 4). **3.34Γ faster than |
| BLAS bf16 at 256Γ1536Γ1536 (300M training shape).** End-to-end training |
| speedup 10-18% (grows with model d) AND lower loss thanks to int32 |
| exact dot vs bf16 truncation. This is the actual analog of bitnet.cpp's |
| inference path applied to training. |
| - Pure-PyTorch bit-packing (uint8 + LUT popcount) was a dead end: 367Γ |
| SLOWER than BLAS. Need C-level popcount intrinsics. |
| - Files: `bit_cpu_avx.py`, `train_1bit_cpu_avx.py`, `bench_bit_avx.py` |
| (AVX path); `bit_cpu_kernel.py`, `train_1bit_cpu.py`, `bench_bit_cpu.py` |
| (slow pure-PyTorch path); `train_fp32_cpu_baseline.py` (control). |
|
|
| ## Int8 GEMM kernel work (synth-validated; awaits 75M benchmark) |
|
|
| - `session_2026-05-08_int8_gemm_kernel.md` β full session writeup for |
| Triton int8 forward GEMM + cached weight buffer + int8 dx backward. |
|
|
| **Status**: 13 CPU tests pass with zero numerical diff. Files ready to sync |
| to VM and benchmark. Awaits VM (port 1145 timing out since session start). |
|
|
| Files: `bit_int8_kernel.py`, `model_v47.py` (with `enable_int8_gemm()`), |
| `bench_bit_gemm.py`, `test_int8_gemm_cpu.py`, `test_int8_e2e_cpu.py`, |
| `train_fineweb_v75.py` and `train_300m_flex.py` (with `--use-int8-gemm` |
| and `--int8-bwd` CLI flags). |
|
|
| ## Bonsai (anneal_binarize) β complete |
| |
| - `session_2026-05-07_bonsai_alibi_finetune.md` β bonsai is now the most |
| efficient 300M training recipe. 38% training reaches 65% peak strength. |
| Different attractor than cold-start (canonical L0H0 specialist). |
| - Implementation in `model_v47.py` (`set_binarize_mix`, `_maybe_mixed_sign`). |
| |
| ## Negative results |
| |
| - `truebit_no_latent_negative.md` β flip-counter int weights without fp32 |
| latent: copy_acc 9.4% (vs 100% control). Gradient-magnitude info loss is |
| fundamental. |
| - Q-shift, split-Q, embedding orthogonality (refuted in v76_paper_synthesis). |
| |
| ## Currently blocked |
| |
| - VM at `174.0.242.177:1145` unreachable since session start. All |
| GPU-dependent verification awaits VM return: |
| - 75M K-shift L11 vs L0 head-to-head |
| - Triton int8 kernel benchmark |
| - 300M-scale anything |
| |
| ## Code added/modified this multi-session period |
| |
| | File | Purpose | |
| |---|---| |
| | `bit_int8_kernel.py` | Triton int8 GEMM kernels (forward + dx backward) | |
| | `model_v47.py` | int8 GEMM toggle, bonsai mix | |
| | `model_v47b.py` | **NEW**: `kshift_offset` parameter through IntBinaryAttention/Block/LM | |
| | `synth_induction_train.py` | Split-Q implementation, K-shift CLI override + offset | |
| | `train_fineweb_v75.py` | `--use-int8-gemm`, `--int8-bwd`, **`--k-shift-offset`** flags | |
| | `train_300m_flex.py` | Same int8 flags + **`--k-shift`/`--k-shift-layers`/`--k-shift-offset`/`--rwsp`/`--rwsp-k`/`--q-shift`/`--q-shift-layers`** flags. Also fixed FlexAttention monkey-patch to fall back to original forward when K-shift/Q-shift/RWSP enabled | |
| | `bench_bit_gemm.py` | int8 vs bf16 benchmarks at training shapes | |
| | `test_int8_gemm_cpu.py` | 8 unit tests | |
| | `test_int8_e2e_cpu.py` | 5 end-to-end tests | |
| | `analyze_synth_attention.py` | Attention argmax distribution analyzer (depth-general) | |
|
|
| All files in `/home/nathan/1bitllm/`. Ready to sync to VM when accessible. |
|
|
| ## Ready-to-run experiments at scale (VM-blocked) |
|
|
| Once VM access is restored, the following commands should work end-to-end: |
|
|
| ```bash |
| # v76 paper's "easy fix" prediction: change offset from 1 to 5 |
| python train_fineweb_v75.py \ |
| --rwsp --k-shift --k-shift-layers 0 --k-shift-offset 5 \ |
| --max-steps 20000 --tag v76b_offset5 |
| |
| # Compare to baseline (current v76): |
| python train_fineweb_v75.py \ |
| --rwsp --k-shift --k-shift-layers 0 --k-shift-offset 1 \ |
| --max-steps 20000 --tag v76_offset1_baseline |
| ``` |
|
|
| Predicted: offset=5 reaches lower val_bpc and >40% top-per-head IPMR vs baseline ~10%. |