File size: 2,271 Bytes
5d46996
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# AGILLM-3 Attention Experiments

**Date:** January 15, 2026  
**Author:** Silicon Goddess + Scott Bisset

## Summary

Tested 13+ attention mechanisms for joint AR+SAT training. **Result: GQA wins** but standard attention in current checkpoints is solid.

## Files

| File | Purpose |
|------|---------|
| `n.py` | **ORIGINAL** - Use this for existing checkpoints |
| `n_gqa.py` | GQA variant - backward compatible with n.py checkpoints |
| `experiments/n_heavy.py` | Heavy attention tests (iterative, triplet, multi-hop) |
| `experiments/n_heavy2.py` | More heavy tests (slot, edge, memory, recurrent) |
| `experiments/n_ultra.py` | Ultra-heavy tests (NTM, energy, N-body, hyper) |
| `experiments/n_flex.py` | Flexible attention (linear, cosine, MQA, GQA, retention) |
| `experiments/joint_test.py` | Joint AR+SAT training comparison |
| `experiments/final_showdown.py` | Compute-matched depth vs complexity |
| `experiments/infer_bench.py` | Inference speed + KV cache benchmarks |

## Key Results

### Joint AR+SAT Training (what AGILLM-3 does)
| Attention | Combined Loss | KV Cache Size |
|-----------|---------------|---------------|
| GQA (2 heads) | 78.49 (+0.1%) | 0.25x |
| **Standard** | **78.58** | **1.00x** |
| MQA | 78.82 (-0.3%) | 0.12x |

### Inference Memory Savings
| Attention | KV Cache | Inference Speed |
|-----------|----------|-----------------|
| Standard | 64 MB | baseline |
| GQA | 16 MB | 0.84x |
| MQA | 8 MB | 0.87x |

### The Bitter Lesson Confirmed
Heavy attention mechanisms (iterative, memory-augmented, physics-based) **all lose** to standard attention at equal compute budget. Simpler = faster = more data = better.

## Recommendation

**Keep using n.py with standard attention for now.** The 0.1% improvement from GQA isn't worth checkpoint incompatibility. GQA becomes valuable when:
- Inference memory is constrained
- Context length needs to increase significantly
- Starting fresh training run

## Checkpoint Compatibility

```python
# Load existing checkpoint with original n.py
model = AGILLM3(cfg)
model.load_state_dict(torch.load("checkpoint.pt"))

# For GQA: use n_gqa.py with convert_from_standard=True
model = AGILLM3_GQA(cfg, convert_from_standard=True)
model.load_from_standard("checkpoint.pt")  # Converts weights
```