r3d91ll
/

Atlas-MAG_OmegaRule

+---
+license: mit
+tags:
+  - test-time-learning
+  - memory-augmented
+  - atlas
+  - nested-learning
+  - polynomial-memory
+  - omega-rule
+datasets:
+  - HuggingFaceTB/smollm-corpus
+language:
+  - en
+pipeline_tag: text-generation
+---
+# Atlas-MAG with Omega Rule
+A 43M parameter implementation of the Atlas paper's Memory-As-Gate (MAG) architecture with polynomial memory and test-time learning (TTL).
+**Paper**: [Atlas: Learning to Optimally Memorize the Context at Test Time](https://arxiv.org/abs/2505.23735) (Behrouz et al., Google Research)
+**Code**: [toddwbucy/Atlas-MAG_OmegaRule](https://github.com/toddwbucy/Atlas-MAG_OmegaRule)
+## What This Model Demonstrates
+This checkpoint exists to demonstrate a concrete infrastructure problem: **test-time learning models cannot be served by existing deployment stacks**.
+Atlas-MAG uses gradient descent *during the forward pass* to update its memory. PyTorch gates this behind `if self.training`. Every serving framework calls the equivalent of inference mode before serving. The model's memory architecture is silenced.
+Two scripts in the GitHub repo let you see this firsthand.
+## Quick Start
+```bash
+git clone https://github.com/toddwbucy/Atlas-MAG_OmegaRule.git
+cd Atlas-MAG_OmegaRule
+pip install torch huggingface_hub tokenizers
+# Demo: same model, same weights, different outputs depending on training flag
+python scripts/demo_ttl_inference.py
+# Benchmark: NIAH memory probe — TTL ON vs TTL OFF side by side
+python scripts/benchmark_niah.py
+```
+Both scripts auto-download this checkpoint. No manual download needed.
+## Model Details
+| | |
+|---|---|
+| **Architecture** | Atlas-MAG (Memory-As-Gate) |
+| **Parameters** | 43M |
+| **Dimensions** | dim=512, 6 layers, 8 heads |
+| **Memory** | Polynomial degree-2, rank-512 |
+| **Attention** | Sliding window, size=512 |
+| **TTL** | Muon optimizer (NS-5), theta=0.9, alpha=0.999, eta=0.01 |
+| **Vocab** | 49,152 (SmolLM tokenizer) |
+| **Training Steps** | 8,800 |
+| **Training Hardware** | 2x NVIDIA A6000 48GB |
+| **Training Data** | SmolLM-Corpus (cosmopedia 40%, fineweb-edu 50%, python-edu 10%) |
+| **NIAH Accuracy** | 85.9% |
+| **Checkpoint Size** | 473MB |
+| **Format** | PyTorch (.pt) |
+## Architecture
+```
+Input -> Embedding -> [MAGBlock x 6] -> RMSNorm -> LM Head -> Output
+MAGBlock:
+    x --+--> [Sliding Window Attention] --> attn_out
+        |                                      |
+        +--> [Deep Polynomial Memory]  --> mem_out
+                                               |
+        output = x + attn_out * sigmoid(mem_out)
+```
+The polynomial feature map increases memory capacity from O(d_k) to O(d_k^2) per layer — roughly 64x more associations.
+## Loading
+```python
+import torch
+from huggingface_hub import hf_hub_download
+# Download checkpoint
+ckpt_path = hf_hub_download("r3d91ll/Atlas-MAG_OmegaRule", "checkpoint_step008800.pt")
+checkpoint = torch.load(ckpt_path, map_location="cuda:0", weights_only=False)
+# The checkpoint contains:
+# - "model_state_dict": model weights
+# - "config": full training configuration dict
+print(checkpoint["config"])
+```
+For full model loading, see the [GitHub repository](https://github.com/toddwbucy/Atlas-MAG_OmegaRule) which includes the model class and demo scripts.
+## Files
+| File | Size | Description |
+|------|------|-------------|
+| `checkpoint_step008800.pt` | 473MB | Model weights + config + optimizer state |
+| `tokenizer_smollm.json` | 2.2MB | BPE tokenizer (SmolLM) |
+## Citation
+```bibtex
+@article{behrouz2025atlas,
+  title={Atlas: Learning to Optimally Memorize the Context at Test Time},
+  author={Behrouz, Ali and Li, Yingcong and Kacham, Praneeth and Daliri, Poria and Deng, Zhihao and Zhong, Peilin and Razaviyayn, Meisam and Mirrokni, Vahab},
+  journal={arXiv preprint arXiv:2505.23735},
+  year={2025}
+}
+```
+## License
+MIT