File size: 6,902 Bytes

f61c86a

# 🤖 Ultron — Recurrent-Depth Transformer

> **An open-source, research-grounded looped transformer for latent reasoning.**

Ultron is a clean implementation of a Recurrent-Depth Transformer (RDT) that combines **only proven techniques** from the latest research. Unlike speculative reconstructions, every architectural choice in Ultron is backed by published results with clear attribution.

## Architecture

```
Input tokens (B, T)
    ↓
[Embedding + RoPE]
    ↓
[Prelude]              — L_p standard transformer blocks, run once
    ↓
[LayerNorm(e)]         — Prelude normalization (Parcae stability trick)
    ↓
[Recurrent Block ×T]   — L_r transformer layers, looped T times
    ↑_________↓          h_{t+1} = A·h_t + B·e + R(h_t, e)  [LTI-stable]
    ↓                    + depth-wise LoRA + ACT halting
[C · h_T]              — Output projection
    ↓
[Coda]                 — L_c standard transformer blocks, run once
    ↓
[RMSNorm → LM Head]
    ↓
Output logits (B, T, vocab_size)
```

### Key Design Principles

1. **Only proven components**: Every technique has published results. MoE is optional (default OFF) because MoE + looping is untested at scale.
2. **Parcae stability**: LTI-constrained injection (ρ(A) < 1 by construction), prelude normalization, per-sequence depth sampling.
3. **Depth extrapolation**: Train on N loops, test on N+k. More loops at inference = deeper reasoning.
4. **Adaptive compute**: ACT halting lets easy tokens exit early, hard tokens get full depth.
5. **Parameter efficiency**: A 770M looped model matches a 1.3B standard transformer (Parcae, 2026).

## Installation

```bash
pip install torch
git clone https://huggingface.co/trojan0x/ultron
cd ultron
```

## Quick Start

```python
import torch
from ultron.model import Ultron, UltronConfig

# Minimal config for testing
cfg = UltronConfig(
    vocab_size=32000, dim=768, n_heads=12, n_kv_heads=4,
    max_seq_len=2048,
    prelude_layers=2, coda_layers=2,
    recurrent_layers=4, max_loop_iters=8,
    lora_rank=8,
)

model = Ultron(cfg)
print(f"Parameters: {model.get_num_params():,}")
print(f"Spectral radius ρ(A): {model.get_spectral_radius():.6f} (must be < 1)")

# Forward pass
ids = torch.randint(0, 32000, (2, 128))
logits = model(ids)  # (2, 128, 32000)

# Generation with depth extrapolation
prompt = torch.randint(0, 32000, (1, 16))
output = model.generate(prompt, max_new_tokens=64, n_loops=16)  # deeper reasoning
```

## Pre-configured Variants

```python
from ultron.variants import ultron_small, ultron_base, ultron_medium, ultron_large

cfg = ultron_small()   # ~75M params, effective depth 36 layers
cfg = ultron_base()    # ~166M params, effective depth 78 layers
cfg = ultron_medium()  # ~1B params, effective depth 136 layers
cfg = ultron_large()   # ~3B params, effective depth 300 layers
```

| Variant | dim | heads | Prelude | Recurrent | Coda | Loops | Effective Depth | Params |
|---|---|---|---|---|---|---|---|---|
| `ultron_small` | 768 | 12 | 2 | 4 | 2 | 8 | 36 | ~75M |
| `ultron_base` | 1024 | 16 | 3 | 6 | 3 | 12 | 78 | ~166M |
| `ultron_medium` | 2048 | 16 | 4 | 8 | 4 | 16 | 136 | ~1B |
| `ultron_large` | 4096 | 32 | 6 | 12 | 6 | 24 | 300 | ~3B |

## Improvements over OpenMythos

| Feature | OpenMythos | **Ultron** | Rationale |
|---|---|---|---|
| **Prelude norm** | Missing | ✅ RMSNorm on encoded input | Critical for stability at 1.3B+ scale (Parcae Appendix J) |
| **C output projection** | Missing | ✅ Diagonal C matrix | Completes the LTI dynamical system (Parcae) |
| **Recurrent depth** | 1 layer per loop | ✅ Multiple layers per loop | More expressive recurrent block |
| **ACT bias init** | Default | ✅ Bias = -3 (encourage full loops early) | Prevents premature halting during early training |
| **Grad checkpointing** | None | ✅ Built-in | Required for memory-efficient loop unrolling |
| **MoE** | Always on (64 experts) | ✅ Optional (default OFF) | MoE + looping is unproven |
| **Top-p sampling** | Missing | ✅ Nucleus sampling support | Better generation quality |
| **LoRA init** | Random | ✅ Near-zero initialization | Starts as near-identity, prevents early instability |

## Research Foundation

Every component is grounded in published work:

| Component | Paper | Key Result |
|---|---|---|
| LTI-stable injection | [Parcae (Prairie et al., 2026)](https://arxiv.org/abs/2604.12946) | 6.3% lower PPL, eliminates training instability |
| Prelude normalization | [Parcae, Appendix J](https://arxiv.org/abs/2604.12946) | Critical for stability at 1.3B+ scale |
| Depth extrapolation | [Loop, Think, & Generalize (2025)](https://arxiv.org/abs/2604.07822) | Train 5-hop, test 10-hop by increasing loops |
| Depth-wise LoRA | [Relaxed Recursive Transformers (Bae et al., 2024)](https://arxiv.org/abs/2410.20672) | Recursive Gemma 1B recovers most of Gemma 2B |
| Looped = implicit CoT | [Saunshi et al., 2025](https://arxiv.org/abs/2502.17416) | Formally proven: T loops simulate T steps of CoT |
| ACT halting | [Graves, 2016](https://arxiv.org/abs/1603.08983) | Per-position adaptive computation |
| GQA | [Ainslie et al., 2023](https://arxiv.org/abs/2305.13245) | Efficient KV cache, proven with looping |
| RMSNorm | [Zhang & Sennrich, 2019](https://arxiv.org/abs/1910.07467) | Standard normalization |
| RoPE | [Su et al., 2021](https://arxiv.org/abs/2104.09864) | Rotary positional encoding |
| MLA (optional) | [DeepSeek-V2, 2024](https://arxiv.org/abs/2405.04434) | 10-20× smaller KV cache |
| MoE (optional) | [DeepSeekMoE, 2024](https://arxiv.org/abs/2401.06066) | Fine-grained expert routing |

## Proven vs. Experimental

### ✅ Proven (default ON)
- LTI-stable injection with spectral radius < 1
- Prelude normalization
- Depth extrapolation via inference-time loops
- ACT halting for adaptive compute
- Depth-wise LoRA adaptation
- GQA attention

### ⚠️ Experimental (optional, default OFF)
- MoE FFN in recurrent block (`use_moe=True`)
- MLA attention (`attn_type="mla"`)
- Loop-index sinusoidal embedding

## Training Recipe (from Parcae)

Based on published scaling laws:

| Setting | Value | Source |
|---|---|---|
| Optimizer | AdamW (β1=0.9, β2=0.95) | Standard |
| Learning rate | 3e-4 (140M), 2e-4 (370M+) | Parcae |
| Schedule | Cosine decay with warmup | Parcae |
| Warmup steps | 2000 | Parcae |
| Weight decay | 0.1 | Parcae |
| Batch size | 512 × 1280 tokens | Saunshi et al. |
| Dataset | FineWeb-Edu | Parcae / FineWeb |
| μ_bwd | ⌈μ_rec/2⌉ | Parcae (backprop truncation) |
| Depth sampling | Per-sequence within micro-batch | Parcae |

## License

MIT License

## Citation

```bibtex
@software{ultron2026,
  title   = {Ultron: An Open-Source Recurrent-Depth Transformer},
  year    = {2026},
  url     = {https://huggingface.co/trojan0x/ultron},
  note    = {Grounded in Parcae, Relaxed Recursive Transformers, and looped transformer theory}
}
```