AGILLM-3 Attention Experiments
Date: January 15, 2026
Author: Silicon Goddess + Scott Bisset
Summary
Tested 13+ attention mechanisms for joint AR+SAT training. Result: GQA wins but standard attention in current checkpoints is solid.
Files
| File | Purpose |
|---|---|
n.py |
ORIGINAL - Use this for existing checkpoints |
n_gqa.py |
GQA variant - backward compatible with n.py checkpoints |
experiments/n_heavy.py |
Heavy attention tests (iterative, triplet, multi-hop) |
experiments/n_heavy2.py |
More heavy tests (slot, edge, memory, recurrent) |
experiments/n_ultra.py |
Ultra-heavy tests (NTM, energy, N-body, hyper) |
experiments/n_flex.py |
Flexible attention (linear, cosine, MQA, GQA, retention) |
experiments/joint_test.py |
Joint AR+SAT training comparison |
experiments/final_showdown.py |
Compute-matched depth vs complexity |
experiments/infer_bench.py |
Inference speed + KV cache benchmarks |
Key Results
Joint AR+SAT Training (what AGILLM-3 does)
| Attention | Combined Loss | KV Cache Size |
|---|---|---|
| GQA (2 heads) | 78.49 (+0.1%) | 0.25x |
| Standard | 78.58 | 1.00x |
| MQA | 78.82 (-0.3%) | 0.12x |
Inference Memory Savings
| Attention | KV Cache | Inference Speed |
|---|---|---|
| Standard | 64 MB | baseline |
| GQA | 16 MB | 0.84x |
| MQA | 8 MB | 0.87x |
The Bitter Lesson Confirmed
Heavy attention mechanisms (iterative, memory-augmented, physics-based) all lose to standard attention at equal compute budget. Simpler = faster = more data = better.
Recommendation
Keep using n.py with standard attention for now. The 0.1% improvement from GQA isn't worth checkpoint incompatibility. GQA becomes valuable when:
- Inference memory is constrained
- Context length needs to increase significantly
- Starting fresh training run
Checkpoint Compatibility
# Load existing checkpoint with original n.py
model = AGILLM3(cfg)
model.load_state_dict(torch.load("checkpoint.pt"))
# For GQA: use n_gqa.py with convert_from_standard=True
model = AGILLM3_GQA(cfg, convert_from_standard=True)
model.load_from_standard("checkpoint.pt") # Converts weights