| # H4 Polytopic Attention β Autonomous Research Protocol |
|
|
| You are an autonomous research agent optimizing a **hybrid transformer** that combines frozen H4 geometric attention with trainable adapter layers. Your goal is to minimize `val_bpb` (bits per byte on held-out text) within a fixed compute budget. |
|
|
| ## The Architecture |
|
|
| The system has two parts: |
|
|
| 1. **Frozen geometric backbone** (DO NOT MODIFY): |
| - 600-cell vertices (120 Γ 4) β the H4 polytope |
| - H4 simple roots (4 Γ 4) β Coxeter reflection hyperplanes |
| - E8βH4 projection matrix (4 Γ 8) β golden ratio eigenvalues |
| - ChamberTree structure β O(log t) spatial partitioning of SΒ³ |
| - E8LatticeIndex β Voronoi cell memory backend |
|
|
| 2. **Trainable adapters** (YOU MODIFY THESE): |
| - `W_q_proj`, `W_k_proj`, `W_v_proj` β input projections to H4/value space |
| - `W_nudge` β per-head 4Γ4 rotation of query direction |
| - `chamber_bonus` β per-head 16-dim learnable attention bias per chamber |
| - `W_out` β output projection |
| - FFN layers (fully trainable) |
| - Token/positional embeddings |
| - LM head |
|
|
| ## What You Can Change |
|
|
| In `python/train_cpu.py`, you may modify: |
|
|
| - **Hyperparameters:** learning rate, batch size, sequence length, warmup, grad clip, weight decay |
| - **Architecture of trainable layers:** d_model, n_heads, n_layers, d_value, d_ffn, top_k, dropout |
| - **Optimizer setup:** scheduler, betas, epsilon |
| - **Training loop:** loss weighting, gradient accumulation, evaluation strategy |
| - **Adapter architecture:** number of nudge parameters, chamber bonus structure, FFN design |
|
|
| ## What You CANNOT Change |
|
|
| - `python/h4_polytopic_attention.py` β frozen geometry |
| - `python/utils/phi_positional.py` β golden-angle encoding (you can change how it's used, not the encoding itself) |
| - `python/utils/chamber_index.py` β chamber lookup bridge |
| - The H4 vertices, simple roots, E8 projection, ChamberTree structure |
| - The fundamental constraint that Q and K live on SΒ³ (unit 4-sphere) |
|
|
| ## The Loop |
|
|
| ``` |
| while forever: |
| 1. Read current train_cpu.py and results.tsv |
| 2. Form a hypothesis about what change will improve val_bpb |
| 3. Modify train_cpu.py (the ONLY file you modify) |
| 4. Run: cd python && python train_cpu.py |
| 5. Parse the "---" summary block from stdout |
| 6. If val_bpb improved or the experiment is informative: |
| - git add python/train_cpu.py && git commit |
| - Append to results.tsv: commit<TAB>val_bpb<TAB>val_loss<TAB>chamber_entropy<TAB>status<TAB>description |
| - status = "keep" |
| 7. If val_bpb did not improve: |
| - git checkout python/train_cpu.py (discard changes) |
| - Append to results.tsv with status = "discard" |
| 8. If the run crashed: |
| - Fix the crash, do NOT count it as an experiment |
| - Append to results.tsv with status = "crash" |
| 9. Think about what you learned. Update your mental model of: |
| - Which hyperparameters matter most |
| - Whether the geometry is being utilized (check chamber_entropy) |
| - Whether W_nudge is learning meaningful directions (check geo_alignment) |
| - What the loss landscape looks like |
| 10. Repeat |
| ``` |
|
|
| ## Time Budget |
|
|
| - **2 minutes per experiment** on CPU (TIME_BUDGET = 120 in train_cpu.py) |
| - This allows ~24 experiments in an overnight run |
| - If an experiment takes longer than 3 minutes, it has a bug β fix it |
|
|
| ## Metrics |
|
|
| Primary: `val_bpb` β lower is better. Bits per byte on held-out character-level text. |
|
|
| Diagnostic (track but don't optimize directly): |
| - `chamber_entropy` β Shannon entropy of chamber utilization. High = using full geometry. Low = collapsed to few chambers. |
| - `avg_nudge_rank` β effective rank of W_nudge deviation from identity. High = rank-1 (good, focused direction). Low = diffuse. |
| - `avg_geo_alignment` β max dot product of W_nudge dominant direction with 600-cell vertices. >0.9 = strongly aligning with geometry. |
| - `num_steps` β training throughput indicator. |
|
|
| ## results.tsv Format |
|
|
| ``` |
| commit val_bpb val_loss chamber_entropy status description |
| a1b2c3d 2.345678 1.625000 2.1234 keep baseline: d_model=64, 2 layers, lr=3e-4 |
| e4f5g6h 2.298765 1.592500 2.3456 keep increased d_ffn from 256 to 512 |
| ``` |
|
|
| Tab-separated. Short 7-char commit hashes. Do not commit results.tsv to git. |
|
|
| ## Strategy Hints |
|
|
| Based on the Fibonacci proof-of-concept (26 trainable params on frozen H4 backbone, Ο gap 0.025β0.001): |
|
|
| 1. **The geometry provides strong inductive bias.** W_nudge naturally converges to rank-1, aligning with 600-cell vertices. Don't fight this β let the nudge stay small. |
| |
| 2. **Chamber utilization matters.** If entropy is low, the model is only using a few chambers. Try: different init for W_nudge, larger top_k, or adding noise to queries during training. |
| |
| 3. **Start with the simplest change.** Learning rate and d_model matter most. Don't change 5 things at once. |
|
|
| 4. **The tree is for long sequences.** For seq_len β€ 256, full attention is faster than tree lookup (Python overhead). Set `use_tree = MAX_SEQ_LEN > 256`. |
|
|
| 5. **Watch for mode collapse.** If all heads learn the same nudge direction, the model wastes capacity. Consider adding a diversity loss or different initializations per head. |
|
|
| 6. **The golden ratio is not arbitrary.** Οβ»ΒΉ is the most irrational number β golden-angle positions are maximally separated. The positional encoding exploits this. Don't replace it with sinusoidal. |
|
|
| ## Data |
|
|
| Currently using character-level text (Shakespeare if available, synthetic Fibonacci-structured text otherwise). To add real data: |
|
|
| 1. Download TinyStories: `wget https://huggingface.co/datasets/karpathy/tinystories-gpt4-clean/resolve/main/TinyStories_all_data.tar.gz` |
| 2. Extract to `data/` |
| 3. Update `load_text_data()` in train_cpu.py |
| |
| For now, the synthetic data is fine for proving the architecture works. Switch to real data once val_bpb is decreasing consistently. |
|
|
| ## Phase 6: BitLinear Experiments |
|
|
| Toggle ternary mode with `USE_BITLINEAR = True` in hyperparameters section. |
|
|
| ### Experiment Ideas |
|
|
| 1. **Baseline comparison**: Same architecture, float vs ternary. |
| Measure val_bpb gap. BitNet Reloaded shows <0.1 bpb gap at 100K+ params |
| with 2x hidden size. |
| |
| 2. **Hidden size scaling**: If ternary hurts val_bpb, try doubling d_model |
| (64->128) while keeping ternary. The 2x scaling law from BitNet Reloaded |
| predicts this recovers the gap. |
| |
| 3. **Selective ternary**: Make only the large projections (q/k/v/out) ternary, |
| keep the 4x4 nudge layers in float. The nudge is where geometric alignment |
| happens --- it might need float precision. |
| |
| 4. **Zero ratio tracking**: Monitor the zero percentage in ternary nudge |
| layers across training. High zero% means the head learned feature |
| selection (ignoring some Coxeter directions). Plot zero% vs |
| chamber_entropy --- they should be inversely correlated. |
|
|
| 5. **Chamber preservation sweep**: Run ternary_diagnostics.chamber_preservation |
| after each experiment. If preservation drops below 85%, the ternary |
| quantization is too aggressive for that architecture. |
|
|
| ### Additional Metrics |
|
|
| - `ternary`: yes/no --- whether BitLinear is active |
| - `chamber_preserve`: mean chamber preservation rate (float vs ternary) |
| - `mean_zero_pct`: mean zero% across BitLinear layers |
| - `compression`: weight compression ratio vs float32 |
| - `model_size_kb`: total model size in KB |
|
|
| ### Keep/Discard Rules for Ternary |
|
|
| Same as float rules, plus: |
| - A ternary experiment that matches float val_bpb within 0.05 is a WIN |
| (same quality, ~20x smaller weights) |
| - A ternary experiment with >5% chamber preservation drop is SUSPECT |
| (check if val_bpb actually suffered --- preservation can drop without |
| hurting quality if the dropped chambers weren't being used) |
|
|