π€ Ultron β Recurrent-Depth Transformer
An open-source, research-grounded looped transformer for latent reasoning.
Ultron is a clean implementation of a Recurrent-Depth Transformer (RDT) that combines only proven techniques from the latest research. Unlike speculative reconstructions, every architectural choice in Ultron is backed by published results with clear attribution.
Architecture
Input tokens (B, T)
β
[Embedding + RoPE]
β
[Prelude] β L_p standard transformer blocks, run once
β
[LayerNorm(e)] β Prelude normalization (Parcae stability trick)
β
[Recurrent Block ΓT] β L_r transformer layers, looped T times
β_________β h_{t+1} = AΒ·h_t + BΒ·e + R(h_t, e) [LTI-stable]
β + depth-wise LoRA + ACT halting
[C Β· h_T] β Output projection
β
[Coda] β L_c standard transformer blocks, run once
β
[RMSNorm β LM Head]
β
Output logits (B, T, vocab_size)
Key Design Principles
- Only proven components: Every technique has published results. MoE is optional (default OFF) because MoE + looping is untested at scale.
- Parcae stability: LTI-constrained injection (Ο(A) < 1 by construction), prelude normalization, per-sequence depth sampling.
- Depth extrapolation: Train on N loops, test on N+k. More loops at inference = deeper reasoning.
- Adaptive compute: ACT halting lets easy tokens exit early, hard tokens get full depth.
- Parameter efficiency: A 770M looped model matches a 1.3B standard transformer (Parcae, 2026).
Installation
pip install torch
git clone https://huggingface.co/trojan0x/ultron
cd ultron
Quick Start
import torch
from ultron.model import Ultron, UltronConfig
# Minimal config for testing
cfg = UltronConfig(
vocab_size=32000, dim=768, n_heads=12, n_kv_heads=4,
max_seq_len=2048,
prelude_layers=2, coda_layers=2,
recurrent_layers=4, max_loop_iters=8,
lora_rank=8,
)
model = Ultron(cfg)
print(f"Parameters: {model.get_num_params():,}")
print(f"Spectral radius Ο(A): {model.get_spectral_radius():.6f} (must be < 1)")
# Forward pass
ids = torch.randint(0, 32000, (2, 128))
logits = model(ids) # (2, 128, 32000)
# Generation with depth extrapolation
prompt = torch.randint(0, 32000, (1, 16))
output = model.generate(prompt, max_new_tokens=64, n_loops=16) # deeper reasoning
Pre-configured Variants
from ultron.variants import ultron_small, ultron_base, ultron_medium, ultron_large
cfg = ultron_small() # ~75M params, effective depth 36 layers
cfg = ultron_base() # ~166M params, effective depth 78 layers
cfg = ultron_medium() # ~1B params, effective depth 136 layers
cfg = ultron_large() # ~3B params, effective depth 300 layers
| Variant | dim | heads | Prelude | Recurrent | Coda | Loops | Effective Depth | Params |
|---|---|---|---|---|---|---|---|---|
ultron_small |
768 | 12 | 2 | 4 | 2 | 8 | 36 | ~75M |
ultron_base |
1024 | 16 | 3 | 6 | 3 | 12 | 78 | ~166M |
ultron_medium |
2048 | 16 | 4 | 8 | 4 | 16 | 136 | ~1B |
ultron_large |
4096 | 32 | 6 | 12 | 6 | 24 | 300 | ~3B |
Improvements over OpenMythos
| Feature | OpenMythos | Ultron | Rationale |
|---|---|---|---|
| Prelude norm | Missing | β RMSNorm on encoded input | Critical for stability at 1.3B+ scale (Parcae Appendix J) |
| C output projection | Missing | β Diagonal C matrix | Completes the LTI dynamical system (Parcae) |
| Recurrent depth | 1 layer per loop | β Multiple layers per loop | More expressive recurrent block |
| ACT bias init | Default | β Bias = -3 (encourage full loops early) | Prevents premature halting during early training |
| Grad checkpointing | None | β Built-in | Required for memory-efficient loop unrolling |
| MoE | Always on (64 experts) | β Optional (default OFF) | MoE + looping is unproven |
| Top-p sampling | Missing | β Nucleus sampling support | Better generation quality |
| LoRA init | Random | β Near-zero initialization | Starts as near-identity, prevents early instability |
Research Foundation
Every component is grounded in published work:
| Component | Paper | Key Result |
|---|---|---|
| LTI-stable injection | Parcae (Prairie et al., 2026) | 6.3% lower PPL, eliminates training instability |
| Prelude normalization | Parcae, Appendix J | Critical for stability at 1.3B+ scale |
| Depth extrapolation | Loop, Think, & Generalize (2025) | Train 5-hop, test 10-hop by increasing loops |
| Depth-wise LoRA | Relaxed Recursive Transformers (Bae et al., 2024) | Recursive Gemma 1B recovers most of Gemma 2B |
| Looped = implicit CoT | Saunshi et al., 2025 | Formally proven: T loops simulate T steps of CoT |
| ACT halting | Graves, 2016 | Per-position adaptive computation |
| GQA | Ainslie et al., 2023 | Efficient KV cache, proven with looping |
| RMSNorm | Zhang & Sennrich, 2019 | Standard normalization |
| RoPE | Su et al., 2021 | Rotary positional encoding |
| MLA (optional) | DeepSeek-V2, 2024 | 10-20Γ smaller KV cache |
| MoE (optional) | DeepSeekMoE, 2024 | Fine-grained expert routing |
Proven vs. Experimental
β Proven (default ON)
- LTI-stable injection with spectral radius < 1
- Prelude normalization
- Depth extrapolation via inference-time loops
- ACT halting for adaptive compute
- Depth-wise LoRA adaptation
- GQA attention
β οΈ Experimental (optional, default OFF)
- MoE FFN in recurrent block (
use_moe=True) - MLA attention (
attn_type="mla") - Loop-index sinusoidal embedding
Training Recipe (from Parcae)
Based on published scaling laws:
| Setting | Value | Source |
|---|---|---|
| Optimizer | AdamW (Ξ²1=0.9, Ξ²2=0.95) | Standard |
| Learning rate | 3e-4 (140M), 2e-4 (370M+) | Parcae |
| Schedule | Cosine decay with warmup | Parcae |
| Warmup steps | 2000 | Parcae |
| Weight decay | 0.1 | Parcae |
| Batch size | 512 Γ 1280 tokens | Saunshi et al. |
| Dataset | FineWeb-Edu | Parcae / FineWeb |
| ΞΌ_bwd | βΞΌ_rec/2β | Parcae (backprop truncation) |
| Depth sampling | Per-sequence within micro-batch | Parcae |
License
MIT License
Citation
@software{ultron2026,
title = {Ultron: An Open-Source Recurrent-Depth Transformer},
year = {2026},
url = {https://huggingface.co/trojan0x/ultron},
note = {Grounded in Parcae, Relaxed Recursive Transformers, and looped transformer theory}
}