File size: 6,902 Bytes
f61c86a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
# πŸ€– Ultron β€” Recurrent-Depth Transformer

> **An open-source, research-grounded looped transformer for latent reasoning.**

Ultron is a clean implementation of a Recurrent-Depth Transformer (RDT) that combines **only proven techniques** from the latest research. Unlike speculative reconstructions, every architectural choice in Ultron is backed by published results with clear attribution.

## Architecture

```
Input tokens (B, T)
    ↓
[Embedding + RoPE]
    ↓
[Prelude]              β€” L_p standard transformer blocks, run once
    ↓
[LayerNorm(e)]         β€” Prelude normalization (Parcae stability trick)
    ↓
[Recurrent Block Γ—T]   β€” L_r transformer layers, looped T times
    ↑_________↓          h_{t+1} = AΒ·h_t + BΒ·e + R(h_t, e)  [LTI-stable]
    ↓                    + depth-wise LoRA + ACT halting
[C Β· h_T]              β€” Output projection
    ↓
[Coda]                 β€” L_c standard transformer blocks, run once
    ↓
[RMSNorm β†’ LM Head]
    ↓
Output logits (B, T, vocab_size)
```

### Key Design Principles

1. **Only proven components**: Every technique has published results. MoE is optional (default OFF) because MoE + looping is untested at scale.
2. **Parcae stability**: LTI-constrained injection (ρ(A) < 1 by construction), prelude normalization, per-sequence depth sampling.
3. **Depth extrapolation**: Train on N loops, test on N+k. More loops at inference = deeper reasoning.
4. **Adaptive compute**: ACT halting lets easy tokens exit early, hard tokens get full depth.
5. **Parameter efficiency**: A 770M looped model matches a 1.3B standard transformer (Parcae, 2026).

## Installation

```bash
pip install torch
git clone https://huggingface.co/trojan0x/ultron
cd ultron
```

## Quick Start

```python
import torch
from ultron.model import Ultron, UltronConfig

# Minimal config for testing
cfg = UltronConfig(
    vocab_size=32000, dim=768, n_heads=12, n_kv_heads=4,
    max_seq_len=2048,
    prelude_layers=2, coda_layers=2,
    recurrent_layers=4, max_loop_iters=8,
    lora_rank=8,
)

model = Ultron(cfg)
print(f"Parameters: {model.get_num_params():,}")
print(f"Spectral radius ρ(A): {model.get_spectral_radius():.6f} (must be < 1)")

# Forward pass
ids = torch.randint(0, 32000, (2, 128))
logits = model(ids)  # (2, 128, 32000)

# Generation with depth extrapolation
prompt = torch.randint(0, 32000, (1, 16))
output = model.generate(prompt, max_new_tokens=64, n_loops=16)  # deeper reasoning
```

## Pre-configured Variants

```python
from ultron.variants import ultron_small, ultron_base, ultron_medium, ultron_large

cfg = ultron_small()   # ~75M params, effective depth 36 layers
cfg = ultron_base()    # ~166M params, effective depth 78 layers
cfg = ultron_medium()  # ~1B params, effective depth 136 layers
cfg = ultron_large()   # ~3B params, effective depth 300 layers
```

| Variant | dim | heads | Prelude | Recurrent | Coda | Loops | Effective Depth | Params |
|---|---|---|---|---|---|---|---|---|
| `ultron_small` | 768 | 12 | 2 | 4 | 2 | 8 | 36 | ~75M |
| `ultron_base` | 1024 | 16 | 3 | 6 | 3 | 12 | 78 | ~166M |
| `ultron_medium` | 2048 | 16 | 4 | 8 | 4 | 16 | 136 | ~1B |
| `ultron_large` | 4096 | 32 | 6 | 12 | 6 | 24 | 300 | ~3B |

## Improvements over OpenMythos

| Feature | OpenMythos | **Ultron** | Rationale |
|---|---|---|---|
| **Prelude norm** | Missing | βœ… RMSNorm on encoded input | Critical for stability at 1.3B+ scale (Parcae Appendix J) |
| **C output projection** | Missing | βœ… Diagonal C matrix | Completes the LTI dynamical system (Parcae) |
| **Recurrent depth** | 1 layer per loop | βœ… Multiple layers per loop | More expressive recurrent block |
| **ACT bias init** | Default | βœ… Bias = -3 (encourage full loops early) | Prevents premature halting during early training |
| **Grad checkpointing** | None | βœ… Built-in | Required for memory-efficient loop unrolling |
| **MoE** | Always on (64 experts) | βœ… Optional (default OFF) | MoE + looping is unproven |
| **Top-p sampling** | Missing | βœ… Nucleus sampling support | Better generation quality |
| **LoRA init** | Random | βœ… Near-zero initialization | Starts as near-identity, prevents early instability |

## Research Foundation

Every component is grounded in published work:

| Component | Paper | Key Result |
|---|---|---|
| LTI-stable injection | [Parcae (Prairie et al., 2026)](https://arxiv.org/abs/2604.12946) | 6.3% lower PPL, eliminates training instability |
| Prelude normalization | [Parcae, Appendix J](https://arxiv.org/abs/2604.12946) | Critical for stability at 1.3B+ scale |
| Depth extrapolation | [Loop, Think, & Generalize (2025)](https://arxiv.org/abs/2604.07822) | Train 5-hop, test 10-hop by increasing loops |
| Depth-wise LoRA | [Relaxed Recursive Transformers (Bae et al., 2024)](https://arxiv.org/abs/2410.20672) | Recursive Gemma 1B recovers most of Gemma 2B |
| Looped = implicit CoT | [Saunshi et al., 2025](https://arxiv.org/abs/2502.17416) | Formally proven: T loops simulate T steps of CoT |
| ACT halting | [Graves, 2016](https://arxiv.org/abs/1603.08983) | Per-position adaptive computation |
| GQA | [Ainslie et al., 2023](https://arxiv.org/abs/2305.13245) | Efficient KV cache, proven with looping |
| RMSNorm | [Zhang & Sennrich, 2019](https://arxiv.org/abs/1910.07467) | Standard normalization |
| RoPE | [Su et al., 2021](https://arxiv.org/abs/2104.09864) | Rotary positional encoding |
| MLA (optional) | [DeepSeek-V2, 2024](https://arxiv.org/abs/2405.04434) | 10-20Γ— smaller KV cache |
| MoE (optional) | [DeepSeekMoE, 2024](https://arxiv.org/abs/2401.06066) | Fine-grained expert routing |

## Proven vs. Experimental

### βœ… Proven (default ON)
- LTI-stable injection with spectral radius < 1
- Prelude normalization
- Depth extrapolation via inference-time loops
- ACT halting for adaptive compute
- Depth-wise LoRA adaptation
- GQA attention

### ⚠️ Experimental (optional, default OFF)
- MoE FFN in recurrent block (`use_moe=True`)
- MLA attention (`attn_type="mla"`)
- Loop-index sinusoidal embedding

## Training Recipe (from Parcae)

Based on published scaling laws:

| Setting | Value | Source |
|---|---|---|
| Optimizer | AdamW (Ξ²1=0.9, Ξ²2=0.95) | Standard |
| Learning rate | 3e-4 (140M), 2e-4 (370M+) | Parcae |
| Schedule | Cosine decay with warmup | Parcae |
| Warmup steps | 2000 | Parcae |
| Weight decay | 0.1 | Parcae |
| Batch size | 512 Γ— 1280 tokens | Saunshi et al. |
| Dataset | FineWeb-Edu | Parcae / FineWeb |
| ΞΌ_bwd | ⌈μ_rec/2βŒ‰ | Parcae (backprop truncation) |
| Depth sampling | Per-sequence within micro-batch | Parcae |

## License

MIT License

## Citation

```bibtex
@software{ultron2026,
  title   = {Ultron: An Open-Source Recurrent-Depth Transformer},
  year    = {2026},
  url     = {https://huggingface.co/trojan0x/ultron},
  note    = {Grounded in Parcae, Relaxed Recursive Transformers, and looped transformer theory}
}
```