# ๐Ÿค– Ultron โ€” Recurrent-Depth Transformer > **An open-source, research-grounded looped transformer for latent reasoning.** Ultron is a clean implementation of a Recurrent-Depth Transformer (RDT) that combines **only proven techniques** from the latest research. Unlike speculative reconstructions, every architectural choice in Ultron is backed by published results with clear attribution. ## Architecture ``` Input tokens (B, T) โ†“ [Embedding + RoPE] โ†“ [Prelude] โ€” L_p standard transformer blocks, run once โ†“ [LayerNorm(e)] โ€” Prelude normalization (Parcae stability trick) โ†“ [Recurrent Block ร—T] โ€” L_r transformer layers, looped T times โ†‘_________โ†“ h_{t+1} = Aยทh_t + Bยทe + R(h_t, e) [LTI-stable] โ†“ + depth-wise LoRA + ACT halting [C ยท h_T] โ€” Output projection โ†“ [Coda] โ€” L_c standard transformer blocks, run once โ†“ [RMSNorm โ†’ LM Head] โ†“ Output logits (B, T, vocab_size) ``` ### Key Design Principles 1. **Only proven components**: Every technique has published results. MoE is optional (default OFF) because MoE + looping is untested at scale. 2. **Parcae stability**: LTI-constrained injection (ฯ(A) < 1 by construction), prelude normalization, per-sequence depth sampling. 3. **Depth extrapolation**: Train on N loops, test on N+k. More loops at inference = deeper reasoning. 4. **Adaptive compute**: ACT halting lets easy tokens exit early, hard tokens get full depth. 5. **Parameter efficiency**: A 770M looped model matches a 1.3B standard transformer (Parcae, 2026). ## Installation ```bash pip install torch git clone https://huggingface.co/trojan0x/ultron cd ultron ``` ## Quick Start ```python import torch from ultron.model import Ultron, UltronConfig # Minimal config for testing cfg = UltronConfig( vocab_size=32000, dim=768, n_heads=12, n_kv_heads=4, max_seq_len=2048, prelude_layers=2, coda_layers=2, recurrent_layers=4, max_loop_iters=8, lora_rank=8, ) model = Ultron(cfg) print(f"Parameters: {model.get_num_params():,}") print(f"Spectral radius ฯ(A): {model.get_spectral_radius():.6f} (must be < 1)") # Forward pass ids = torch.randint(0, 32000, (2, 128)) logits = model(ids) # (2, 128, 32000) # Generation with depth extrapolation prompt = torch.randint(0, 32000, (1, 16)) output = model.generate(prompt, max_new_tokens=64, n_loops=16) # deeper reasoning ``` ## Pre-configured Variants ```python from ultron.variants import ultron_small, ultron_base, ultron_medium, ultron_large cfg = ultron_small() # ~75M params, effective depth 36 layers cfg = ultron_base() # ~166M params, effective depth 78 layers cfg = ultron_medium() # ~1B params, effective depth 136 layers cfg = ultron_large() # ~3B params, effective depth 300 layers ``` | Variant | dim | heads | Prelude | Recurrent | Coda | Loops | Effective Depth | Params | |---|---|---|---|---|---|---|---|---| | `ultron_small` | 768 | 12 | 2 | 4 | 2 | 8 | 36 | ~75M | | `ultron_base` | 1024 | 16 | 3 | 6 | 3 | 12 | 78 | ~166M | | `ultron_medium` | 2048 | 16 | 4 | 8 | 4 | 16 | 136 | ~1B | | `ultron_large` | 4096 | 32 | 6 | 12 | 6 | 24 | 300 | ~3B | ## Improvements over OpenMythos | Feature | OpenMythos | **Ultron** | Rationale | |---|---|---|---| | **Prelude norm** | Missing | โœ… RMSNorm on encoded input | Critical for stability at 1.3B+ scale (Parcae Appendix J) | | **C output projection** | Missing | โœ… Diagonal C matrix | Completes the LTI dynamical system (Parcae) | | **Recurrent depth** | 1 layer per loop | โœ… Multiple layers per loop | More expressive recurrent block | | **ACT bias init** | Default | โœ… Bias = -3 (encourage full loops early) | Prevents premature halting during early training | | **Grad checkpointing** | None | โœ… Built-in | Required for memory-efficient loop unrolling | | **MoE** | Always on (64 experts) | โœ… Optional (default OFF) | MoE + looping is unproven | | **Top-p sampling** | Missing | โœ… Nucleus sampling support | Better generation quality | | **LoRA init** | Random | โœ… Near-zero initialization | Starts as near-identity, prevents early instability | ## Research Foundation Every component is grounded in published work: | Component | Paper | Key Result | |---|---|---| | LTI-stable injection | [Parcae (Prairie et al., 2026)](https://arxiv.org/abs/2604.12946) | 6.3% lower PPL, eliminates training instability | | Prelude normalization | [Parcae, Appendix J](https://arxiv.org/abs/2604.12946) | Critical for stability at 1.3B+ scale | | Depth extrapolation | [Loop, Think, & Generalize (2025)](https://arxiv.org/abs/2604.07822) | Train 5-hop, test 10-hop by increasing loops | | Depth-wise LoRA | [Relaxed Recursive Transformers (Bae et al., 2024)](https://arxiv.org/abs/2410.20672) | Recursive Gemma 1B recovers most of Gemma 2B | | Looped = implicit CoT | [Saunshi et al., 2025](https://arxiv.org/abs/2502.17416) | Formally proven: T loops simulate T steps of CoT | | ACT halting | [Graves, 2016](https://arxiv.org/abs/1603.08983) | Per-position adaptive computation | | GQA | [Ainslie et al., 2023](https://arxiv.org/abs/2305.13245) | Efficient KV cache, proven with looping | | RMSNorm | [Zhang & Sennrich, 2019](https://arxiv.org/abs/1910.07467) | Standard normalization | | RoPE | [Su et al., 2021](https://arxiv.org/abs/2104.09864) | Rotary positional encoding | | MLA (optional) | [DeepSeek-V2, 2024](https://arxiv.org/abs/2405.04434) | 10-20ร— smaller KV cache | | MoE (optional) | [DeepSeekMoE, 2024](https://arxiv.org/abs/2401.06066) | Fine-grained expert routing | ## Proven vs. Experimental ### โœ… Proven (default ON) - LTI-stable injection with spectral radius < 1 - Prelude normalization - Depth extrapolation via inference-time loops - ACT halting for adaptive compute - Depth-wise LoRA adaptation - GQA attention ### โš ๏ธ Experimental (optional, default OFF) - MoE FFN in recurrent block (`use_moe=True`) - MLA attention (`attn_type="mla"`) - Loop-index sinusoidal embedding ## Training Recipe (from Parcae) Based on published scaling laws: | Setting | Value | Source | |---|---|---| | Optimizer | AdamW (ฮฒ1=0.9, ฮฒ2=0.95) | Standard | | Learning rate | 3e-4 (140M), 2e-4 (370M+) | Parcae | | Schedule | Cosine decay with warmup | Parcae | | Warmup steps | 2000 | Parcae | | Weight decay | 0.1 | Parcae | | Batch size | 512 ร— 1280 tokens | Saunshi et al. | | Dataset | FineWeb-Edu | Parcae / FineWeb | | ฮผ_bwd | โŒˆฮผ_rec/2โŒ‰ | Parcae (backprop truncation) | | Depth sampling | Per-sequence within micro-batch | Parcae | ## License MIT License ## Citation ```bibtex @software{ultron2026, title = {Ultron: An Open-Source Recurrent-Depth Transformer}, year = {2026}, url = {https://huggingface.co/trojan0x/ultron}, note = {Grounded in Parcae, Relaxed Recursive Transformers, and looped transformer theory} } ```