🤖 Ultron — Recurrent-Depth Transformer

An open-source, research-grounded looped transformer for latent reasoning.

Ultron is a clean implementation of a Recurrent-Depth Transformer (RDT) that combines only proven techniques from the latest research. Unlike speculative reconstructions, every architectural choice in Ultron is backed by published results with clear attribution.

Architecture

Input tokens (B, T)
    ↓
[Embedding + RoPE]
    ↓
[Prelude]              — L_p standard transformer blocks, run once
    ↓
[LayerNorm(e)]         — Prelude normalization (Parcae stability trick)
    ↓
[Recurrent Block ×T]   — L_r transformer layers, looped T times
    ↑_________↓          h_{t+1} = A·h_t + B·e + R(h_t, e)  [LTI-stable]
    ↓                    + depth-wise LoRA + ACT halting
[C · h_T]              — Output projection
    ↓
[Coda]                 — L_c standard transformer blocks, run once
    ↓
[RMSNorm → LM Head]
    ↓
Output logits (B, T, vocab_size)

Key Design Principles

Only proven components: Every technique has published results. MoE is optional (default OFF) because MoE + looping is untested at scale.
Parcae stability: LTI-constrained injection (ρ(A) < 1 by construction), prelude normalization, per-sequence depth sampling.
Depth extrapolation: Train on N loops, test on N+k. More loops at inference = deeper reasoning.
Adaptive compute: ACT halting lets easy tokens exit early, hard tokens get full depth.
Parameter efficiency: A 770M looped model matches a 1.3B standard transformer (Parcae, 2026).

Installation

pip install torch
git clone https://huggingface.co/trojan0x/ultron
cd ultron

Quick Start

import torch
from ultron.model import Ultron, UltronConfig

# Minimal config for testing
cfg = UltronConfig(
    vocab_size=32000, dim=768, n_heads=12, n_kv_heads=4,
    max_seq_len=2048,
    prelude_layers=2, coda_layers=2,
    recurrent_layers=4, max_loop_iters=8,
    lora_rank=8,
)

model = Ultron(cfg)
print(f"Parameters: {model.get_num_params():,}")
print(f"Spectral radius ρ(A): {model.get_spectral_radius():.6f} (must be < 1)")

# Forward pass
ids = torch.randint(0, 32000, (2, 128))
logits = model(ids)  # (2, 128, 32000)

# Generation with depth extrapolation
prompt = torch.randint(0, 32000, (1, 16))
output = model.generate(prompt, max_new_tokens=64, n_loops=16)  # deeper reasoning

Pre-configured Variants

from ultron.variants import ultron_small, ultron_base, ultron_medium, ultron_large

cfg = ultron_small()   # ~75M params, effective depth 36 layers
cfg = ultron_base()    # ~166M params, effective depth 78 layers
cfg = ultron_medium()  # ~1B params, effective depth 136 layers
cfg = ultron_large()   # ~3B params, effective depth 300 layers

Variant	dim	heads	Prelude	Recurrent	Coda	Loops	Effective Depth	Params
`ultron_small`	768	12	2	4	2	8	36	~75M
`ultron_base`	1024	16	3	6	3	12	78	~166M
`ultron_medium`	2048	16	4	8	4	16	136	~1B
`ultron_large`	4096	32	6	12	6	24	300	~3B

Improvements over OpenMythos

Feature	OpenMythos	Ultron	Rationale
Prelude norm	Missing	✅ RMSNorm on encoded input	Critical for stability at 1.3B+ scale (Parcae Appendix J)
C output projection	Missing	✅ Diagonal C matrix	Completes the LTI dynamical system (Parcae)
Recurrent depth	1 layer per loop	✅ Multiple layers per loop	More expressive recurrent block
ACT bias init	Default	✅ Bias = -3 (encourage full loops early)	Prevents premature halting during early training
Grad checkpointing	None	✅ Built-in	Required for memory-efficient loop unrolling
MoE	Always on (64 experts)	✅ Optional (default OFF)	MoE + looping is unproven
Top-p sampling	Missing	✅ Nucleus sampling support	Better generation quality
LoRA init	Random	✅ Near-zero initialization	Starts as near-identity, prevents early instability

Research Foundation

Every component is grounded in published work:

Component	Paper	Key Result
LTI-stable injection	Parcae (Prairie et al., 2026)	6.3% lower PPL, eliminates training instability
Prelude normalization	Parcae, Appendix J	Critical for stability at 1.3B+ scale
Depth extrapolation	Loop, Think, & Generalize (2025)	Train 5-hop, test 10-hop by increasing loops
Depth-wise LoRA	Relaxed Recursive Transformers (Bae et al., 2024)	Recursive Gemma 1B recovers most of Gemma 2B
Looped = implicit CoT	Saunshi et al., 2025	Formally proven: T loops simulate T steps of CoT
ACT halting	Graves, 2016	Per-position adaptive computation
GQA	Ainslie et al., 2023	Efficient KV cache, proven with looping
RMSNorm	Zhang & Sennrich, 2019	Standard normalization
RoPE	Su et al., 2021	Rotary positional encoding
MLA (optional)	DeepSeek-V2, 2024	10-20× smaller KV cache
MoE (optional)	DeepSeekMoE, 2024	Fine-grained expert routing

Proven vs. Experimental

✅ Proven (default ON)

LTI-stable injection with spectral radius < 1
Prelude normalization
Depth extrapolation via inference-time loops
ACT halting for adaptive compute
Depth-wise LoRA adaptation
GQA attention

⚠️ Experimental (optional, default OFF)

MoE FFN in recurrent block (use_moe=True)
MLA attention (attn_type="mla")
Loop-index sinusoidal embedding

Training Recipe (from Parcae)

Based on published scaling laws:

Setting	Value	Source
Optimizer	AdamW (β1=0.9, β2=0.95)	Standard
Learning rate	3e-4 (140M), 2e-4 (370M+)	Parcae
Schedule	Cosine decay with warmup	Parcae
Warmup steps	2000	Parcae
Weight decay	0.1	Parcae
Batch size	512 × 1280 tokens	Saunshi et al.
Dataset	FineWeb-Edu	Parcae / FineWeb
μ_bwd	⌈μ_rec/2⌉	Parcae (backprop truncation)
Depth sampling	Per-sequence within micro-batch	Parcae

License

MIT License

Citation

@software{ultron2026,
  title   = {Ultron: An Open-Source Recurrent-Depth Transformer},
  year    = {2026},
  url     = {https://huggingface.co/trojan0x/ultron},
  note    = {Grounded in Parcae, Relaxed Recursive Transformers, and looped transformer theory}
}