ultron / README.md
trojan0x's picture
Add Ultron README
f61c86a verified
# πŸ€– Ultron β€” Recurrent-Depth Transformer
> **An open-source, research-grounded looped transformer for latent reasoning.**
Ultron is a clean implementation of a Recurrent-Depth Transformer (RDT) that combines **only proven techniques** from the latest research. Unlike speculative reconstructions, every architectural choice in Ultron is backed by published results with clear attribution.
## Architecture
```
Input tokens (B, T)
↓
[Embedding + RoPE]
↓
[Prelude] β€” L_p standard transformer blocks, run once
↓
[LayerNorm(e)] β€” Prelude normalization (Parcae stability trick)
↓
[Recurrent Block Γ—T] β€” L_r transformer layers, looped T times
↑_________↓ h_{t+1} = AΒ·h_t + BΒ·e + R(h_t, e) [LTI-stable]
↓ + depth-wise LoRA + ACT halting
[C Β· h_T] β€” Output projection
↓
[Coda] β€” L_c standard transformer blocks, run once
↓
[RMSNorm β†’ LM Head]
↓
Output logits (B, T, vocab_size)
```
### Key Design Principles
1. **Only proven components**: Every technique has published results. MoE is optional (default OFF) because MoE + looping is untested at scale.
2. **Parcae stability**: LTI-constrained injection (ρ(A) < 1 by construction), prelude normalization, per-sequence depth sampling.
3. **Depth extrapolation**: Train on N loops, test on N+k. More loops at inference = deeper reasoning.
4. **Adaptive compute**: ACT halting lets easy tokens exit early, hard tokens get full depth.
5. **Parameter efficiency**: A 770M looped model matches a 1.3B standard transformer (Parcae, 2026).
## Installation
```bash
pip install torch
git clone https://huggingface.co/trojan0x/ultron
cd ultron
```
## Quick Start
```python
import torch
from ultron.model import Ultron, UltronConfig
# Minimal config for testing
cfg = UltronConfig(
vocab_size=32000, dim=768, n_heads=12, n_kv_heads=4,
max_seq_len=2048,
prelude_layers=2, coda_layers=2,
recurrent_layers=4, max_loop_iters=8,
lora_rank=8,
)
model = Ultron(cfg)
print(f"Parameters: {model.get_num_params():,}")
print(f"Spectral radius ρ(A): {model.get_spectral_radius():.6f} (must be < 1)")
# Forward pass
ids = torch.randint(0, 32000, (2, 128))
logits = model(ids) # (2, 128, 32000)
# Generation with depth extrapolation
prompt = torch.randint(0, 32000, (1, 16))
output = model.generate(prompt, max_new_tokens=64, n_loops=16) # deeper reasoning
```
## Pre-configured Variants
```python
from ultron.variants import ultron_small, ultron_base, ultron_medium, ultron_large
cfg = ultron_small() # ~75M params, effective depth 36 layers
cfg = ultron_base() # ~166M params, effective depth 78 layers
cfg = ultron_medium() # ~1B params, effective depth 136 layers
cfg = ultron_large() # ~3B params, effective depth 300 layers
```
| Variant | dim | heads | Prelude | Recurrent | Coda | Loops | Effective Depth | Params |
|---|---|---|---|---|---|---|---|---|
| `ultron_small` | 768 | 12 | 2 | 4 | 2 | 8 | 36 | ~75M |
| `ultron_base` | 1024 | 16 | 3 | 6 | 3 | 12 | 78 | ~166M |
| `ultron_medium` | 2048 | 16 | 4 | 8 | 4 | 16 | 136 | ~1B |
| `ultron_large` | 4096 | 32 | 6 | 12 | 6 | 24 | 300 | ~3B |
## Improvements over OpenMythos
| Feature | OpenMythos | **Ultron** | Rationale |
|---|---|---|---|
| **Prelude norm** | Missing | βœ… RMSNorm on encoded input | Critical for stability at 1.3B+ scale (Parcae Appendix J) |
| **C output projection** | Missing | βœ… Diagonal C matrix | Completes the LTI dynamical system (Parcae) |
| **Recurrent depth** | 1 layer per loop | βœ… Multiple layers per loop | More expressive recurrent block |
| **ACT bias init** | Default | βœ… Bias = -3 (encourage full loops early) | Prevents premature halting during early training |
| **Grad checkpointing** | None | βœ… Built-in | Required for memory-efficient loop unrolling |
| **MoE** | Always on (64 experts) | βœ… Optional (default OFF) | MoE + looping is unproven |
| **Top-p sampling** | Missing | βœ… Nucleus sampling support | Better generation quality |
| **LoRA init** | Random | βœ… Near-zero initialization | Starts as near-identity, prevents early instability |
## Research Foundation
Every component is grounded in published work:
| Component | Paper | Key Result |
|---|---|---|
| LTI-stable injection | [Parcae (Prairie et al., 2026)](https://arxiv.org/abs/2604.12946) | 6.3% lower PPL, eliminates training instability |
| Prelude normalization | [Parcae, Appendix J](https://arxiv.org/abs/2604.12946) | Critical for stability at 1.3B+ scale |
| Depth extrapolation | [Loop, Think, & Generalize (2025)](https://arxiv.org/abs/2604.07822) | Train 5-hop, test 10-hop by increasing loops |
| Depth-wise LoRA | [Relaxed Recursive Transformers (Bae et al., 2024)](https://arxiv.org/abs/2410.20672) | Recursive Gemma 1B recovers most of Gemma 2B |
| Looped = implicit CoT | [Saunshi et al., 2025](https://arxiv.org/abs/2502.17416) | Formally proven: T loops simulate T steps of CoT |
| ACT halting | [Graves, 2016](https://arxiv.org/abs/1603.08983) | Per-position adaptive computation |
| GQA | [Ainslie et al., 2023](https://arxiv.org/abs/2305.13245) | Efficient KV cache, proven with looping |
| RMSNorm | [Zhang & Sennrich, 2019](https://arxiv.org/abs/1910.07467) | Standard normalization |
| RoPE | [Su et al., 2021](https://arxiv.org/abs/2104.09864) | Rotary positional encoding |
| MLA (optional) | [DeepSeek-V2, 2024](https://arxiv.org/abs/2405.04434) | 10-20Γ— smaller KV cache |
| MoE (optional) | [DeepSeekMoE, 2024](https://arxiv.org/abs/2401.06066) | Fine-grained expert routing |
## Proven vs. Experimental
### βœ… Proven (default ON)
- LTI-stable injection with spectral radius < 1
- Prelude normalization
- Depth extrapolation via inference-time loops
- ACT halting for adaptive compute
- Depth-wise LoRA adaptation
- GQA attention
### ⚠️ Experimental (optional, default OFF)
- MoE FFN in recurrent block (`use_moe=True`)
- MLA attention (`attn_type="mla"`)
- Loop-index sinusoidal embedding
## Training Recipe (from Parcae)
Based on published scaling laws:
| Setting | Value | Source |
|---|---|---|
| Optimizer | AdamW (Ξ²1=0.9, Ξ²2=0.95) | Standard |
| Learning rate | 3e-4 (140M), 2e-4 (370M+) | Parcae |
| Schedule | Cosine decay with warmup | Parcae |
| Warmup steps | 2000 | Parcae |
| Weight decay | 0.1 | Parcae |
| Batch size | 512 Γ— 1280 tokens | Saunshi et al. |
| Dataset | FineWeb-Edu | Parcae / FineWeb |
| ΞΌ_bwd | ⌈μ_rec/2βŒ‰ | Parcae (backprop truncation) |
| Depth sampling | Per-sequence within micro-batch | Parcae |
## License
MIT License
## Citation
```bibtex
@software{ultron2026,
title = {Ultron: An Open-Source Recurrent-Depth Transformer},
year = {2026},
url = {https://huggingface.co/trojan0x/ultron},
note = {Grounded in Parcae, Relaxed Recursive Transformers, and looped transformer theory}
}
```