trojan0x commited on
Commit
f61c86a
Β·
verified Β·
1 Parent(s): 02ba0f0

Add Ultron README

Browse files
Files changed (1) hide show
  1. README.md +167 -0
README.md ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ€– Ultron β€” Recurrent-Depth Transformer
2
+
3
+ > **An open-source, research-grounded looped transformer for latent reasoning.**
4
+
5
+ Ultron is a clean implementation of a Recurrent-Depth Transformer (RDT) that combines **only proven techniques** from the latest research. Unlike speculative reconstructions, every architectural choice in Ultron is backed by published results with clear attribution.
6
+
7
+ ## Architecture
8
+
9
+ ```
10
+ Input tokens (B, T)
11
+ ↓
12
+ [Embedding + RoPE]
13
+ ↓
14
+ [Prelude] β€” L_p standard transformer blocks, run once
15
+ ↓
16
+ [LayerNorm(e)] β€” Prelude normalization (Parcae stability trick)
17
+ ↓
18
+ [Recurrent Block Γ—T] β€” L_r transformer layers, looped T times
19
+ ↑_________↓ h_{t+1} = AΒ·h_t + BΒ·e + R(h_t, e) [LTI-stable]
20
+ ↓ + depth-wise LoRA + ACT halting
21
+ [C Β· h_T] β€” Output projection
22
+ ↓
23
+ [Coda] β€” L_c standard transformer blocks, run once
24
+ ↓
25
+ [RMSNorm β†’ LM Head]
26
+ ↓
27
+ Output logits (B, T, vocab_size)
28
+ ```
29
+
30
+ ### Key Design Principles
31
+
32
+ 1. **Only proven components**: Every technique has published results. MoE is optional (default OFF) because MoE + looping is untested at scale.
33
+ 2. **Parcae stability**: LTI-constrained injection (ρ(A) < 1 by construction), prelude normalization, per-sequence depth sampling.
34
+ 3. **Depth extrapolation**: Train on N loops, test on N+k. More loops at inference = deeper reasoning.
35
+ 4. **Adaptive compute**: ACT halting lets easy tokens exit early, hard tokens get full depth.
36
+ 5. **Parameter efficiency**: A 770M looped model matches a 1.3B standard transformer (Parcae, 2026).
37
+
38
+ ## Installation
39
+
40
+ ```bash
41
+ pip install torch
42
+ git clone https://huggingface.co/trojan0x/ultron
43
+ cd ultron
44
+ ```
45
+
46
+ ## Quick Start
47
+
48
+ ```python
49
+ import torch
50
+ from ultron.model import Ultron, UltronConfig
51
+
52
+ # Minimal config for testing
53
+ cfg = UltronConfig(
54
+ vocab_size=32000, dim=768, n_heads=12, n_kv_heads=4,
55
+ max_seq_len=2048,
56
+ prelude_layers=2, coda_layers=2,
57
+ recurrent_layers=4, max_loop_iters=8,
58
+ lora_rank=8,
59
+ )
60
+
61
+ model = Ultron(cfg)
62
+ print(f"Parameters: {model.get_num_params():,}")
63
+ print(f"Spectral radius ρ(A): {model.get_spectral_radius():.6f} (must be < 1)")
64
+
65
+ # Forward pass
66
+ ids = torch.randint(0, 32000, (2, 128))
67
+ logits = model(ids) # (2, 128, 32000)
68
+
69
+ # Generation with depth extrapolation
70
+ prompt = torch.randint(0, 32000, (1, 16))
71
+ output = model.generate(prompt, max_new_tokens=64, n_loops=16) # deeper reasoning
72
+ ```
73
+
74
+ ## Pre-configured Variants
75
+
76
+ ```python
77
+ from ultron.variants import ultron_small, ultron_base, ultron_medium, ultron_large
78
+
79
+ cfg = ultron_small() # ~75M params, effective depth 36 layers
80
+ cfg = ultron_base() # ~166M params, effective depth 78 layers
81
+ cfg = ultron_medium() # ~1B params, effective depth 136 layers
82
+ cfg = ultron_large() # ~3B params, effective depth 300 layers
83
+ ```
84
+
85
+ | Variant | dim | heads | Prelude | Recurrent | Coda | Loops | Effective Depth | Params |
86
+ |---|---|---|---|---|---|---|---|---|
87
+ | `ultron_small` | 768 | 12 | 2 | 4 | 2 | 8 | 36 | ~75M |
88
+ | `ultron_base` | 1024 | 16 | 3 | 6 | 3 | 12 | 78 | ~166M |
89
+ | `ultron_medium` | 2048 | 16 | 4 | 8 | 4 | 16 | 136 | ~1B |
90
+ | `ultron_large` | 4096 | 32 | 6 | 12 | 6 | 24 | 300 | ~3B |
91
+
92
+ ## Improvements over OpenMythos
93
+
94
+ | Feature | OpenMythos | **Ultron** | Rationale |
95
+ |---|---|---|---|
96
+ | **Prelude norm** | Missing | βœ… RMSNorm on encoded input | Critical for stability at 1.3B+ scale (Parcae Appendix J) |
97
+ | **C output projection** | Missing | βœ… Diagonal C matrix | Completes the LTI dynamical system (Parcae) |
98
+ | **Recurrent depth** | 1 layer per loop | βœ… Multiple layers per loop | More expressive recurrent block |
99
+ | **ACT bias init** | Default | βœ… Bias = -3 (encourage full loops early) | Prevents premature halting during early training |
100
+ | **Grad checkpointing** | None | βœ… Built-in | Required for memory-efficient loop unrolling |
101
+ | **MoE** | Always on (64 experts) | βœ… Optional (default OFF) | MoE + looping is unproven |
102
+ | **Top-p sampling** | Missing | βœ… Nucleus sampling support | Better generation quality |
103
+ | **LoRA init** | Random | βœ… Near-zero initialization | Starts as near-identity, prevents early instability |
104
+
105
+ ## Research Foundation
106
+
107
+ Every component is grounded in published work:
108
+
109
+ | Component | Paper | Key Result |
110
+ |---|---|---|
111
+ | LTI-stable injection | [Parcae (Prairie et al., 2026)](https://arxiv.org/abs/2604.12946) | 6.3% lower PPL, eliminates training instability |
112
+ | Prelude normalization | [Parcae, Appendix J](https://arxiv.org/abs/2604.12946) | Critical for stability at 1.3B+ scale |
113
+ | Depth extrapolation | [Loop, Think, & Generalize (2025)](https://arxiv.org/abs/2604.07822) | Train 5-hop, test 10-hop by increasing loops |
114
+ | Depth-wise LoRA | [Relaxed Recursive Transformers (Bae et al., 2024)](https://arxiv.org/abs/2410.20672) | Recursive Gemma 1B recovers most of Gemma 2B |
115
+ | Looped = implicit CoT | [Saunshi et al., 2025](https://arxiv.org/abs/2502.17416) | Formally proven: T loops simulate T steps of CoT |
116
+ | ACT halting | [Graves, 2016](https://arxiv.org/abs/1603.08983) | Per-position adaptive computation |
117
+ | GQA | [Ainslie et al., 2023](https://arxiv.org/abs/2305.13245) | Efficient KV cache, proven with looping |
118
+ | RMSNorm | [Zhang & Sennrich, 2019](https://arxiv.org/abs/1910.07467) | Standard normalization |
119
+ | RoPE | [Su et al., 2021](https://arxiv.org/abs/2104.09864) | Rotary positional encoding |
120
+ | MLA (optional) | [DeepSeek-V2, 2024](https://arxiv.org/abs/2405.04434) | 10-20Γ— smaller KV cache |
121
+ | MoE (optional) | [DeepSeekMoE, 2024](https://arxiv.org/abs/2401.06066) | Fine-grained expert routing |
122
+
123
+ ## Proven vs. Experimental
124
+
125
+ ### βœ… Proven (default ON)
126
+ - LTI-stable injection with spectral radius < 1
127
+ - Prelude normalization
128
+ - Depth extrapolation via inference-time loops
129
+ - ACT halting for adaptive compute
130
+ - Depth-wise LoRA adaptation
131
+ - GQA attention
132
+
133
+ ### ⚠️ Experimental (optional, default OFF)
134
+ - MoE FFN in recurrent block (`use_moe=True`)
135
+ - MLA attention (`attn_type="mla"`)
136
+ - Loop-index sinusoidal embedding
137
+
138
+ ## Training Recipe (from Parcae)
139
+
140
+ Based on published scaling laws:
141
+
142
+ | Setting | Value | Source |
143
+ |---|---|---|
144
+ | Optimizer | AdamW (Ξ²1=0.9, Ξ²2=0.95) | Standard |
145
+ | Learning rate | 3e-4 (140M), 2e-4 (370M+) | Parcae |
146
+ | Schedule | Cosine decay with warmup | Parcae |
147
+ | Warmup steps | 2000 | Parcae |
148
+ | Weight decay | 0.1 | Parcae |
149
+ | Batch size | 512 Γ— 1280 tokens | Saunshi et al. |
150
+ | Dataset | FineWeb-Edu | Parcae / FineWeb |
151
+ | ΞΌ_bwd | ⌈μ_rec/2βŒ‰ | Parcae (backprop truncation) |
152
+ | Depth sampling | Per-sequence within micro-batch | Parcae |
153
+
154
+ ## License
155
+
156
+ MIT License
157
+
158
+ ## Citation
159
+
160
+ ```bibtex
161
+ @software{ultron2026,
162
+ title = {Ultron: An Open-Source Recurrent-Depth Transformer},
163
+ year = {2026},
164
+ url = {https://huggingface.co/trojan0x/ultron},
165
+ note = {Grounded in Parcae, Relaxed Recursive Transformers, and looped transformer theory}
166
+ }
167
+ ```