| # π€ Ultron β Recurrent-Depth Transformer |
|
|
| > **An open-source, research-grounded looped transformer for latent reasoning.** |
|
|
| Ultron is a clean implementation of a Recurrent-Depth Transformer (RDT) that combines **only proven techniques** from the latest research. Unlike speculative reconstructions, every architectural choice in Ultron is backed by published results with clear attribution. |
|
|
| ## Architecture |
|
|
| ``` |
| Input tokens (B, T) |
| β |
| [Embedding + RoPE] |
| β |
| [Prelude] β L_p standard transformer blocks, run once |
| β |
| [LayerNorm(e)] β Prelude normalization (Parcae stability trick) |
| β |
| [Recurrent Block ΓT] β L_r transformer layers, looped T times |
| β_________β h_{t+1} = AΒ·h_t + BΒ·e + R(h_t, e) [LTI-stable] |
| β + depth-wise LoRA + ACT halting |
| [C Β· h_T] β Output projection |
| β |
| [Coda] β L_c standard transformer blocks, run once |
| β |
| [RMSNorm β LM Head] |
| β |
| Output logits (B, T, vocab_size) |
| ``` |
|
|
| ### Key Design Principles |
|
|
| 1. **Only proven components**: Every technique has published results. MoE is optional (default OFF) because MoE + looping is untested at scale. |
| 2. **Parcae stability**: LTI-constrained injection (Ο(A) < 1 by construction), prelude normalization, per-sequence depth sampling. |
| 3. **Depth extrapolation**: Train on N loops, test on N+k. More loops at inference = deeper reasoning. |
| 4. **Adaptive compute**: ACT halting lets easy tokens exit early, hard tokens get full depth. |
| 5. **Parameter efficiency**: A 770M looped model matches a 1.3B standard transformer (Parcae, 2026). |
|
|
| ## Installation |
|
|
| ```bash |
| pip install torch |
| git clone https://huggingface.co/trojan0x/ultron |
| cd ultron |
| ``` |
|
|
| ## Quick Start |
|
|
| ```python |
| import torch |
| from ultron.model import Ultron, UltronConfig |
| |
| # Minimal config for testing |
| cfg = UltronConfig( |
| vocab_size=32000, dim=768, n_heads=12, n_kv_heads=4, |
| max_seq_len=2048, |
| prelude_layers=2, coda_layers=2, |
| recurrent_layers=4, max_loop_iters=8, |
| lora_rank=8, |
| ) |
| |
| model = Ultron(cfg) |
| print(f"Parameters: {model.get_num_params():,}") |
| print(f"Spectral radius Ο(A): {model.get_spectral_radius():.6f} (must be < 1)") |
| |
| # Forward pass |
| ids = torch.randint(0, 32000, (2, 128)) |
| logits = model(ids) # (2, 128, 32000) |
| |
| # Generation with depth extrapolation |
| prompt = torch.randint(0, 32000, (1, 16)) |
| output = model.generate(prompt, max_new_tokens=64, n_loops=16) # deeper reasoning |
| ``` |
|
|
| ## Pre-configured Variants |
|
|
| ```python |
| from ultron.variants import ultron_small, ultron_base, ultron_medium, ultron_large |
| |
| cfg = ultron_small() # ~75M params, effective depth 36 layers |
| cfg = ultron_base() # ~166M params, effective depth 78 layers |
| cfg = ultron_medium() # ~1B params, effective depth 136 layers |
| cfg = ultron_large() # ~3B params, effective depth 300 layers |
| ``` |
|
|
| | Variant | dim | heads | Prelude | Recurrent | Coda | Loops | Effective Depth | Params | |
| |---|---|---|---|---|---|---|---|---| |
| | `ultron_small` | 768 | 12 | 2 | 4 | 2 | 8 | 36 | ~75M | |
| | `ultron_base` | 1024 | 16 | 3 | 6 | 3 | 12 | 78 | ~166M | |
| | `ultron_medium` | 2048 | 16 | 4 | 8 | 4 | 16 | 136 | ~1B | |
| | `ultron_large` | 4096 | 32 | 6 | 12 | 6 | 24 | 300 | ~3B | |
|
|
| ## Improvements over OpenMythos |
|
|
| | Feature | OpenMythos | **Ultron** | Rationale | |
| |---|---|---|---| |
| | **Prelude norm** | Missing | β
RMSNorm on encoded input | Critical for stability at 1.3B+ scale (Parcae Appendix J) | |
| | **C output projection** | Missing | β
Diagonal C matrix | Completes the LTI dynamical system (Parcae) | |
| | **Recurrent depth** | 1 layer per loop | β
Multiple layers per loop | More expressive recurrent block | |
| | **ACT bias init** | Default | β
Bias = -3 (encourage full loops early) | Prevents premature halting during early training | |
| | **Grad checkpointing** | None | β
Built-in | Required for memory-efficient loop unrolling | |
| | **MoE** | Always on (64 experts) | β
Optional (default OFF) | MoE + looping is unproven | |
| | **Top-p sampling** | Missing | β
Nucleus sampling support | Better generation quality | |
| | **LoRA init** | Random | β
Near-zero initialization | Starts as near-identity, prevents early instability | |
|
|
| ## Research Foundation |
|
|
| Every component is grounded in published work: |
|
|
| | Component | Paper | Key Result | |
| |---|---|---| |
| | LTI-stable injection | [Parcae (Prairie et al., 2026)](https://arxiv.org/abs/2604.12946) | 6.3% lower PPL, eliminates training instability | |
| | Prelude normalization | [Parcae, Appendix J](https://arxiv.org/abs/2604.12946) | Critical for stability at 1.3B+ scale | |
| | Depth extrapolation | [Loop, Think, & Generalize (2025)](https://arxiv.org/abs/2604.07822) | Train 5-hop, test 10-hop by increasing loops | |
| | Depth-wise LoRA | [Relaxed Recursive Transformers (Bae et al., 2024)](https://arxiv.org/abs/2410.20672) | Recursive Gemma 1B recovers most of Gemma 2B | |
| | Looped = implicit CoT | [Saunshi et al., 2025](https://arxiv.org/abs/2502.17416) | Formally proven: T loops simulate T steps of CoT | |
| | ACT halting | [Graves, 2016](https://arxiv.org/abs/1603.08983) | Per-position adaptive computation | |
| | GQA | [Ainslie et al., 2023](https://arxiv.org/abs/2305.13245) | Efficient KV cache, proven with looping | |
| | RMSNorm | [Zhang & Sennrich, 2019](https://arxiv.org/abs/1910.07467) | Standard normalization | |
| | RoPE | [Su et al., 2021](https://arxiv.org/abs/2104.09864) | Rotary positional encoding | |
| | MLA (optional) | [DeepSeek-V2, 2024](https://arxiv.org/abs/2405.04434) | 10-20Γ smaller KV cache | |
| | MoE (optional) | [DeepSeekMoE, 2024](https://arxiv.org/abs/2401.06066) | Fine-grained expert routing | |
|
|
| ## Proven vs. Experimental |
|
|
| ### β
Proven (default ON) |
| - LTI-stable injection with spectral radius < 1 |
| - Prelude normalization |
| - Depth extrapolation via inference-time loops |
| - ACT halting for adaptive compute |
| - Depth-wise LoRA adaptation |
| - GQA attention |
|
|
| ### β οΈ Experimental (optional, default OFF) |
| - MoE FFN in recurrent block (`use_moe=True`) |
| - MLA attention (`attn_type="mla"`) |
| - Loop-index sinusoidal embedding |
|
|
| ## Training Recipe (from Parcae) |
|
|
| Based on published scaling laws: |
|
|
| | Setting | Value | Source | |
| |---|---|---| |
| | Optimizer | AdamW (Ξ²1=0.9, Ξ²2=0.95) | Standard | |
| | Learning rate | 3e-4 (140M), 2e-4 (370M+) | Parcae | |
| | Schedule | Cosine decay with warmup | Parcae | |
| | Warmup steps | 2000 | Parcae | |
| | Weight decay | 0.1 | Parcae | |
| | Batch size | 512 Γ 1280 tokens | Saunshi et al. | |
| | Dataset | FineWeb-Edu | Parcae / FineWeb | |
| | ΞΌ_bwd | βΞΌ_rec/2β | Parcae (backprop truncation) | |
| | Depth sampling | Per-sequence within micro-batch | Parcae | |
|
|
| ## License |
|
|
| MIT License |
|
|
| ## Citation |
|
|
| ```bibtex |
| @software{ultron2026, |
| title = {Ultron: An Open-Source Recurrent-Depth Transformer}, |
| year = {2026}, |
| url = {https://huggingface.co/trojan0x/ultron}, |
| note = {Grounded in Parcae, Relaxed Recursive Transformers, and looped transformer theory} |
| } |
| ``` |