| # AGILLM-3 Attention Experiments | |
| **Date:** January 15, 2026 | |
| **Author:** Silicon Goddess + Scott Bisset | |
| ## Summary | |
| Tested 13+ attention mechanisms for joint AR+SAT training. **Result: GQA wins** but standard attention in current checkpoints is solid. | |
| ## Files | |
| | File | Purpose | | |
| |------|---------| | |
| | `n.py` | **ORIGINAL** - Use this for existing checkpoints | | |
| | `n_gqa.py` | GQA variant - backward compatible with n.py checkpoints | | |
| | `experiments/n_heavy.py` | Heavy attention tests (iterative, triplet, multi-hop) | | |
| | `experiments/n_heavy2.py` | More heavy tests (slot, edge, memory, recurrent) | | |
| | `experiments/n_ultra.py` | Ultra-heavy tests (NTM, energy, N-body, hyper) | | |
| | `experiments/n_flex.py` | Flexible attention (linear, cosine, MQA, GQA, retention) | | |
| | `experiments/joint_test.py` | Joint AR+SAT training comparison | | |
| | `experiments/final_showdown.py` | Compute-matched depth vs complexity | | |
| | `experiments/infer_bench.py` | Inference speed + KV cache benchmarks | | |
| ## Key Results | |
| ### Joint AR+SAT Training (what AGILLM-3 does) | |
| | Attention | Combined Loss | KV Cache Size | | |
| |-----------|---------------|---------------| | |
| | GQA (2 heads) | 78.49 (+0.1%) | 0.25x | | |
| | **Standard** | **78.58** | **1.00x** | | |
| | MQA | 78.82 (-0.3%) | 0.12x | | |
| ### Inference Memory Savings | |
| | Attention | KV Cache | Inference Speed | | |
| |-----------|----------|-----------------| | |
| | Standard | 64 MB | baseline | | |
| | GQA | 16 MB | 0.84x | | |
| | MQA | 8 MB | 0.87x | | |
| ### The Bitter Lesson Confirmed | |
| Heavy attention mechanisms (iterative, memory-augmented, physics-based) **all lose** to standard attention at equal compute budget. Simpler = faster = more data = better. | |
| ## Recommendation | |
| **Keep using n.py with standard attention for now.** The 0.1% improvement from GQA isn't worth checkpoint incompatibility. GQA becomes valuable when: | |
| - Inference memory is constrained | |
| - Context length needs to increase significantly | |
| - Starting fresh training run | |
| ## Checkpoint Compatibility | |
| ```python | |
| # Load existing checkpoint with original n.py | |
| model = AGILLM3(cfg) | |
| model.load_state_dict(torch.load("checkpoint.pt")) | |
| # For GQA: use n_gqa.py with convert_from_standard=True | |
| model = AGILLM3_GQA(cfg, convert_from_standard=True) | |
| model.load_from_standard("checkpoint.pt") # Converts weights | |
| ``` | |