Add experiments/README.md

5d46996 verified 24 days ago

2.27 kB

	# AGILLM-3 Attention Experiments

	Date: January 15, 2026
	Author: Silicon Goddess + Scott Bisset

	## Summary

	Tested 13+ attention mechanisms for joint AR+SAT training. Result: GQA wins but standard attention in current checkpoints is solid.

	## Files

	\| File \| Purpose \|
	\|------\|---------\|
	\| `n.py` \| ORIGINAL - Use this for existing checkpoints \|
	\| `n_gqa.py` \| GQA variant - backward compatible with n.py checkpoints \|
	\| `experiments/n_heavy.py` \| Heavy attention tests (iterative, triplet, multi-hop) \|
	\| `experiments/n_heavy2.py` \| More heavy tests (slot, edge, memory, recurrent) \|
	\| `experiments/n_ultra.py` \| Ultra-heavy tests (NTM, energy, N-body, hyper) \|
	\| `experiments/n_flex.py` \| Flexible attention (linear, cosine, MQA, GQA, retention) \|
	\| `experiments/joint_test.py` \| Joint AR+SAT training comparison \|
	\| `experiments/final_showdown.py` \| Compute-matched depth vs complexity \|
	\| `experiments/infer_bench.py` \| Inference speed + KV cache benchmarks \|

	## Key Results

	### Joint AR+SAT Training (what AGILLM-3 does)
	\| Attention \| Combined Loss \| KV Cache Size \|
	\|-----------\|---------------\|---------------\|
	\| GQA (2 heads) \| 78.49 (+0.1%) \| 0.25x \|
	\| Standard \| 78.58 \| 1.00x \|
	\| MQA \| 78.82 (-0.3%) \| 0.12x \|

	### Inference Memory Savings
	\| Attention \| KV Cache \| Inference Speed \|
	\|-----------\|----------\|-----------------\|
	\| Standard \| 64 MB \| baseline \|
	\| GQA \| 16 MB \| 0.84x \|
	\| MQA \| 8 MB \| 0.87x \|

	### The Bitter Lesson Confirmed
	Heavy attention mechanisms (iterative, memory-augmented, physics-based) all lose to standard attention at equal compute budget. Simpler = faster = more data = better.

	## Recommendation

	Keep using n.py with standard attention for now. The 0.1% improvement from GQA isn't worth checkpoint incompatibility. GQA becomes valuable when:
	- Inference memory is constrained
	- Context length needs to increase significantly
	- Starting fresh training run

	## Checkpoint Compatibility

	```python
	# Load existing checkpoint with original n.py
	model = AGILLM3(cfg)
	model.load_state_dict(torch.load("checkpoint.pt"))

	# For GQA: use n_gqa.py with convert_from_standard=True
	model = AGILLM3_GQA(cfg, convert_from_standard=True)
	model.load_from_standard("checkpoint.pt") # Converts weights
	```

	# AGILLM-3 Attention Experiments

	Date: January 15, 2026
	Author: Silicon Goddess + Scott Bisset

	## Summary

	Tested 13+ attention mechanisms for joint AR+SAT training. Result: GQA wins but standard attention in current checkpoints is solid.

	## Files

	\| File \| Purpose \|
	\|------\|---------\|
	\| `n.py` \| ORIGINAL - Use this for existing checkpoints \|
	\| `n_gqa.py` \| GQA variant - backward compatible with n.py checkpoints \|
	\| `experiments/n_heavy.py` \| Heavy attention tests (iterative, triplet, multi-hop) \|
	\| `experiments/n_heavy2.py` \| More heavy tests (slot, edge, memory, recurrent) \|
	\| `experiments/n_ultra.py` \| Ultra-heavy tests (NTM, energy, N-body, hyper) \|
	\| `experiments/n_flex.py` \| Flexible attention (linear, cosine, MQA, GQA, retention) \|
	\| `experiments/joint_test.py` \| Joint AR+SAT training comparison \|
	\| `experiments/final_showdown.py` \| Compute-matched depth vs complexity \|
	\| `experiments/infer_bench.py` \| Inference speed + KV cache benchmarks \|

	## Key Results

	### Joint AR+SAT Training (what AGILLM-3 does)
	\| Attention \| Combined Loss \| KV Cache Size \|
	\|-----------\|---------------\|---------------\|
	\| GQA (2 heads) \| 78.49 (+0.1%) \| 0.25x \|
	\| Standard \| 78.58 \| 1.00x \|
	\| MQA \| 78.82 (-0.3%) \| 0.12x \|

	### Inference Memory Savings
	\| Attention \| KV Cache \| Inference Speed \|
	\|-----------\|----------\|-----------------\|
	\| Standard \| 64 MB \| baseline \|
	\| GQA \| 16 MB \| 0.84x \|
	\| MQA \| 8 MB \| 0.87x \|

	### The Bitter Lesson Confirmed
	Heavy attention mechanisms (iterative, memory-augmented, physics-based) all lose to standard attention at equal compute budget. Simpler = faster = more data = better.

	## Recommendation

	Keep using n.py with standard attention for now. The 0.1% improvement from GQA isn't worth checkpoint incompatibility. GQA becomes valuable when:
	- Inference memory is constrained
	- Context length needs to increase significantly
	- Starting fresh training run

	## Checkpoint Compatibility

	```python
	# Load existing checkpoint with original n.py
	model = AGILLM3(cfg)
	model.load_state_dict(torch.load("checkpoint.pt"))

	# For GQA: use n_gqa.py with convert_from_standard=True
	model = AGILLM3_GQA(cfg, convert_from_standard=True)
	model.load_from_standard("checkpoint.pt") # Converts weights
	```