AGILLM-3 Attention Experiments

Date: January 15, 2026
Author: Silicon Goddess + Scott Bisset

Summary

Tested 13+ attention mechanisms for joint AR+SAT training. Result: GQA wins but standard attention in current checkpoints is solid.

Files

File	Purpose
`n.py`	ORIGINAL - Use this for existing checkpoints
`n_gqa.py`	GQA variant - backward compatible with n.py checkpoints
`experiments/n_heavy.py`	Heavy attention tests (iterative, triplet, multi-hop)
`experiments/n_heavy2.py`	More heavy tests (slot, edge, memory, recurrent)
`experiments/n_ultra.py`	Ultra-heavy tests (NTM, energy, N-body, hyper)
`experiments/n_flex.py`	Flexible attention (linear, cosine, MQA, GQA, retention)
`experiments/joint_test.py`	Joint AR+SAT training comparison
`experiments/final_showdown.py`	Compute-matched depth vs complexity
`experiments/infer_bench.py`	Inference speed + KV cache benchmarks

Key Results

Joint AR+SAT Training (what AGILLM-3 does)

Attention	Combined Loss	KV Cache Size
GQA (2 heads)	78.49 (+0.1%)	0.25x
Standard	78.58	1.00x
MQA	78.82 (-0.3%)	0.12x

Inference Memory Savings

Attention	KV Cache	Inference Speed
Standard	64 MB	baseline
GQA	16 MB	0.84x
MQA	8 MB	0.87x

The Bitter Lesson Confirmed

Heavy attention mechanisms (iterative, memory-augmented, physics-based) all lose to standard attention at equal compute budget. Simpler = faster = more data = better.

Recommendation

Keep using n.py with standard attention for now. The 0.1% improvement from GQA isn't worth checkpoint incompatibility. GQA becomes valuable when:

Inference memory is constrained
Context length needs to increase significantly
Starting fresh training run

Checkpoint Compatibility

# Load existing checkpoint with original n.py
model = AGILLM3(cfg)
model.load_state_dict(torch.load("checkpoint.pt"))

# For GQA: use n_gqa.py with convert_from_standard=True
model = AGILLM3_GQA(cfg, convert_from_standard=True)
model.load_from_standard("checkpoint.pt")  # Converts weights