ultron / README.md

Add Ultron README

f61c86a verified 30 days ago

6.9 kB

	# 🤖 Ultron — Recurrent-Depth Transformer

	> An open-source, research-grounded looped transformer for latent reasoning.

	Ultron is a clean implementation of a Recurrent-Depth Transformer (RDT) that combines only proven techniques from the latest research. Unlike speculative reconstructions, every architectural choice in Ultron is backed by published results with clear attribution.

	## Architecture

	```
	Input tokens (B, T)
	↓
	[Embedding + RoPE]
	↓
	[Prelude] — L_p standard transformer blocks, run once
	↓
	[LayerNorm(e)] — Prelude normalization (Parcae stability trick)
	↓
	[Recurrent Block ×T] — L_r transformer layers, looped T times
	↑_________↓ h_{t+1} = A·h_t + B·e + R(h_t, e) [LTI-stable]
	↓ + depth-wise LoRA + ACT halting
	[C · h_T] — Output projection
	↓
	[Coda] — L_c standard transformer blocks, run once
	↓
	[RMSNorm → LM Head]
	↓
	Output logits (B, T, vocab_size)
	```

	### Key Design Principles

	1. Only proven components: Every technique has published results. MoE is optional (default OFF) because MoE + looping is untested at scale.
	2. Parcae stability: LTI-constrained injection (ρ(A) < 1 by construction), prelude normalization, per-sequence depth sampling.
	3. Depth extrapolation: Train on N loops, test on N+k. More loops at inference = deeper reasoning.
	4. Adaptive compute: ACT halting lets easy tokens exit early, hard tokens get full depth.
	5. Parameter efficiency: A 770M looped model matches a 1.3B standard transformer (Parcae, 2026).

	## Installation

	```bash
	pip install torch
	git clone https://huggingface.co/trojan0x/ultron
	cd ultron
	```

	## Quick Start

	```python
	import torch
	from ultron.model import Ultron, UltronConfig

	# Minimal config for testing
	cfg = UltronConfig(
	vocab_size=32000, dim=768, n_heads=12, n_kv_heads=4,
	max_seq_len=2048,
	prelude_layers=2, coda_layers=2,
	recurrent_layers=4, max_loop_iters=8,
	lora_rank=8,
	)

	model = Ultron(cfg)
	print(f"Parameters: {model.get_num_params():,}")
	print(f"Spectral radius ρ(A): {model.get_spectral_radius():.6f} (must be < 1)")

	# Forward pass
	ids = torch.randint(0, 32000, (2, 128))
	logits = model(ids) # (2, 128, 32000)

	# Generation with depth extrapolation
	prompt = torch.randint(0, 32000, (1, 16))
	output = model.generate(prompt, max_new_tokens=64, n_loops=16) # deeper reasoning
	```

	## Pre-configured Variants

	```python
	from ultron.variants import ultron_small, ultron_base, ultron_medium, ultron_large

	cfg = ultron_small() # ~75M params, effective depth 36 layers
	cfg = ultron_base() # ~166M params, effective depth 78 layers
	cfg = ultron_medium() # ~1B params, effective depth 136 layers
	cfg = ultron_large() # ~3B params, effective depth 300 layers
	```

	\| Variant \| dim \| heads \| Prelude \| Recurrent \| Coda \| Loops \| Effective Depth \| Params \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| `ultron_small` \| 768 \| 12 \| 2 \| 4 \| 2 \| 8 \| 36 \| ~75M \|
	\| `ultron_base` \| 1024 \| 16 \| 3 \| 6 \| 3 \| 12 \| 78 \| ~166M \|
	\| `ultron_medium` \| 2048 \| 16 \| 4 \| 8 \| 4 \| 16 \| 136 \| ~1B \|
	\| `ultron_large` \| 4096 \| 32 \| 6 \| 12 \| 6 \| 24 \| 300 \| ~3B \|

	## Improvements over OpenMythos

	\| Feature \| OpenMythos \| Ultron \| Rationale \|
	\|---\|---\|---\|---\|
	\| Prelude norm \| Missing \| ✅ RMSNorm on encoded input \| Critical for stability at 1.3B+ scale (Parcae Appendix J) \|
	\| C output projection \| Missing \| ✅ Diagonal C matrix \| Completes the LTI dynamical system (Parcae) \|
	\| Recurrent depth \| 1 layer per loop \| ✅ Multiple layers per loop \| More expressive recurrent block \|
	\| ACT bias init \| Default \| ✅ Bias = -3 (encourage full loops early) \| Prevents premature halting during early training \|
	\| Grad checkpointing \| None \| ✅ Built-in \| Required for memory-efficient loop unrolling \|
	\| MoE \| Always on (64 experts) \| ✅ Optional (default OFF) \| MoE + looping is unproven \|
	\| Top-p sampling \| Missing \| ✅ Nucleus sampling support \| Better generation quality \|
	\| LoRA init \| Random \| ✅ Near-zero initialization \| Starts as near-identity, prevents early instability \|

	## Research Foundation

	Every component is grounded in published work:

	\| Component \| Paper \| Key Result \|
	\|---\|---\|---\|
	\| LTI-stable injection \| [Parcae (Prairie et al., 2026)](https://arxiv.org/abs/2604.12946) \| 6.3% lower PPL, eliminates training instability \|
	\| Prelude normalization \| [Parcae, Appendix J](https://arxiv.org/abs/2604.12946) \| Critical for stability at 1.3B+ scale \|
	\| Depth extrapolation \| [Loop, Think, & Generalize (2025)](https://arxiv.org/abs/2604.07822) \| Train 5-hop, test 10-hop by increasing loops \|
	\| Depth-wise LoRA \| [Relaxed Recursive Transformers (Bae et al., 2024)](https://arxiv.org/abs/2410.20672) \| Recursive Gemma 1B recovers most of Gemma 2B \|
	\| Looped = implicit CoT \| [Saunshi et al., 2025](https://arxiv.org/abs/2502.17416) \| Formally proven: T loops simulate T steps of CoT \|
	\| ACT halting \| [Graves, 2016](https://arxiv.org/abs/1603.08983) \| Per-position adaptive computation \|
	\| GQA \| [Ainslie et al., 2023](https://arxiv.org/abs/2305.13245) \| Efficient KV cache, proven with looping \|
	\| RMSNorm \| [Zhang & Sennrich, 2019](https://arxiv.org/abs/1910.07467) \| Standard normalization \|
	\| RoPE \| [Su et al., 2021](https://arxiv.org/abs/2104.09864) \| Rotary positional encoding \|
	\| MLA (optional) \| [DeepSeek-V2, 2024](https://arxiv.org/abs/2405.04434) \| 10-20× smaller KV cache \|
	\| MoE (optional) \| [DeepSeekMoE, 2024](https://arxiv.org/abs/2401.06066) \| Fine-grained expert routing \|

	## Proven vs. Experimental

	### ✅ Proven (default ON)
	- LTI-stable injection with spectral radius < 1
	- Prelude normalization
	- Depth extrapolation via inference-time loops
	- ACT halting for adaptive compute
	- Depth-wise LoRA adaptation
	- GQA attention

	### ⚠️ Experimental (optional, default OFF)
	- MoE FFN in recurrent block (`use_moe=True`)
	- MLA attention (`attn_type="mla"`)
	- Loop-index sinusoidal embedding

	## Training Recipe (from Parcae)

	Based on published scaling laws:

	\| Setting \| Value \| Source \|
	\|---\|---\|---\|
	\| Optimizer \| AdamW (β1=0.9, β2=0.95) \| Standard \|
	\| Learning rate \| 3e-4 (140M), 2e-4 (370M+) \| Parcae \|
	\| Schedule \| Cosine decay with warmup \| Parcae \|
	\| Warmup steps \| 2000 \| Parcae \|
	\| Weight decay \| 0.1 \| Parcae \|
	\| Batch size \| 512 × 1280 tokens \| Saunshi et al. \|
	\| Dataset \| FineWeb-Edu \| Parcae / FineWeb \|
	\| μ_bwd \| ⌈μ_rec/2⌉ \| Parcae (backprop truncation) \|
	\| Depth sampling \| Per-sequence within micro-batch \| Parcae \|

	## License

	MIT License

	## Citation

	```bibtex
	@software{ultron2026,
	title = {Ultron: An Open-Source Recurrent-Depth Transformer},
	year = {2026},
	url = {https://huggingface.co/trojan0x/ultron},
	note = {Grounded in Parcae, Relaxed Recursive Transformers, and looped transformer theory}
	}
	```