trojan0x
/

ultron

Model card Files Files and versions

xet

Community

trojan0x commited on Apr 21

Commit

f61c86a

verified ·

1 Parent(s): 02ba0f0

Add Ultron README

Browse files

Files changed (1) hide show

README.md +167 -0

README.md ADDED Viewed

	@@ -0,0 +1,167 @@

+# 🤖 Ultron — Recurrent-Depth Transformer
+> **An open-source, research-grounded looped transformer for latent reasoning.**
+Ultron is a clean implementation of a Recurrent-Depth Transformer (RDT) that combines **only proven techniques** from the latest research. Unlike speculative reconstructions, every architectural choice in Ultron is backed by published results with clear attribution.
+## Architecture
+```
+Input tokens (B, T)
+    ↓
+[Embedding + RoPE]
+    ↓
+[Prelude]              — L_p standard transformer blocks, run once
+    ↓
+[LayerNorm(e)]         — Prelude normalization (Parcae stability trick)
+    ↓
+[Recurrent Block ×T]   — L_r transformer layers, looped T times
+    ↑_________↓          h_{t+1} = A·h_t + B·e + R(h_t, e)  [LTI-stable]
+    ↓                    + depth-wise LoRA + ACT halting
+[C · h_T]              — Output projection
+    ↓
+[Coda]                 — L_c standard transformer blocks, run once
+    ↓
+[RMSNorm → LM Head]
+    ↓
+Output logits (B, T, vocab_size)
+```
+### Key Design Principles
+1. **Only proven components**: Every technique has published results. MoE is optional (default OFF) because MoE + looping is untested at scale.
+2. **Parcae stability**: LTI-constrained injection (ρ(A) < 1 by construction), prelude normalization, per-sequence depth sampling.
+3. **Depth extrapolation**: Train on N loops, test on N+k. More loops at inference = deeper reasoning.
+4. **Adaptive compute**: ACT halting lets easy tokens exit early, hard tokens get full depth.
+5. **Parameter efficiency**: A 770M looped model matches a 1.3B standard transformer (Parcae, 2026).
+## Installation
+```bash
+pip install torch
+git clone https://huggingface.co/trojan0x/ultron
+cd ultron
+```
+## Quick Start
+```python
+import torch
+from ultron.model import Ultron, UltronConfig
+# Minimal config for testing
+cfg = UltronConfig(
+    vocab_size=32000, dim=768, n_heads=12, n_kv_heads=4,
+    max_seq_len=2048,
+    prelude_layers=2, coda_layers=2,
+    recurrent_layers=4, max_loop_iters=8,
+    lora_rank=8,
+)
+model = Ultron(cfg)
+print(f"Parameters: {model.get_num_params():,}")
+print(f"Spectral radius ρ(A): {model.get_spectral_radius():.6f} (must be < 1)")
+# Forward pass
+ids = torch.randint(0, 32000, (2, 128))
+logits = model(ids)  # (2, 128, 32000)
+# Generation with depth extrapolation
+prompt = torch.randint(0, 32000, (1, 16))
+output = model.generate(prompt, max_new_tokens=64, n_loops=16)  # deeper reasoning
+```
+## Pre-configured Variants
+```python
+from ultron.variants import ultron_small, ultron_base, ultron_medium, ultron_large
+cfg = ultron_small()   # ~75M params, effective depth 36 layers
+cfg = ultron_base()    # ~166M params, effective depth 78 layers
+cfg = ultron_medium()  # ~1B params, effective depth 136 layers
+cfg = ultron_large()   # ~3B params, effective depth 300 layers
+```
+| Variant | dim | heads | Prelude | Recurrent | Coda | Loops | Effective Depth | Params |
+|---|---|---|---|---|---|---|---|---|
+| `ultron_small` | 768 | 12 | 2 | 4 | 2 | 8 | 36 | ~75M |
+| `ultron_base` | 1024 | 16 | 3 | 6 | 3 | 12 | 78 | ~166M |
+| `ultron_medium` | 2048 | 16 | 4 | 8 | 4 | 16 | 136 | ~1B |
+| `ultron_large` | 4096 | 32 | 6 | 12 | 6 | 24 | 300 | ~3B |
+## Improvements over OpenMythos
+| Feature | OpenMythos | **Ultron** | Rationale |
+|---|---|---|---|
+| **Prelude norm** | Missing | ✅ RMSNorm on encoded input | Critical for stability at 1.3B+ scale (Parcae Appendix J) |
+| **C output projection** | Missing | ✅ Diagonal C matrix | Completes the LTI dynamical system (Parcae) |
+| **Recurrent depth** | 1 layer per loop | ✅ Multiple layers per loop | More expressive recurrent block |
+| **ACT bias init** | Default | ✅ Bias = -3 (encourage full loops early) | Prevents premature halting during early training |
+| **Grad checkpointing** | None | ✅ Built-in | Required for memory-efficient loop unrolling |
+| **MoE** | Always on (64 experts) | ✅ Optional (default OFF) | MoE + looping is unproven |
+| **Top-p sampling** | Missing | ✅ Nucleus sampling support | Better generation quality |
+| **LoRA init** | Random | ✅ Near-zero initialization | Starts as near-identity, prevents early instability |
+## Research Foundation
+Every component is grounded in published work:
+| Component | Paper | Key Result |
+|---|---|---|
+| LTI-stable injection | [Parcae (Prairie et al., 2026)](https://arxiv.org/abs/2604.12946) | 6.3% lower PPL, eliminates training instability |
+| Prelude normalization | [Parcae, Appendix J](https://arxiv.org/abs/2604.12946) | Critical for stability at 1.3B+ scale |
+| Depth extrapolation | [Loop, Think, & Generalize (2025)](https://arxiv.org/abs/2604.07822) | Train 5-hop, test 10-hop by increasing loops |
+| Depth-wise LoRA | [Relaxed Recursive Transformers (Bae et al., 2024)](https://arxiv.org/abs/2410.20672) | Recursive Gemma 1B recovers most of Gemma 2B |
+| Looped = implicit CoT | [Saunshi et al., 2025](https://arxiv.org/abs/2502.17416) | Formally proven: T loops simulate T steps of CoT |
+| ACT halting | [Graves, 2016](https://arxiv.org/abs/1603.08983) | Per-position adaptive computation |
+| GQA | [Ainslie et al., 2023](https://arxiv.org/abs/2305.13245) | Efficient KV cache, proven with looping |
+| RMSNorm | [Zhang & Sennrich, 2019](https://arxiv.org/abs/1910.07467) | Standard normalization |
+| RoPE | [Su et al., 2021](https://arxiv.org/abs/2104.09864) | Rotary positional encoding |
+| MLA (optional) | [DeepSeek-V2, 2024](https://arxiv.org/abs/2405.04434) | 10-20× smaller KV cache |
+| MoE (optional) | [DeepSeekMoE, 2024](https://arxiv.org/abs/2401.06066) | Fine-grained expert routing |
+## Proven vs. Experimental
+### ✅ Proven (default ON)
+- LTI-stable injection with spectral radius < 1
+- Prelude normalization
+- Depth extrapolation via inference-time loops
+- ACT halting for adaptive compute
+- Depth-wise LoRA adaptation
+- GQA attention
+### ⚠️ Experimental (optional, default OFF)
+- MoE FFN in recurrent block (`use_moe=True`)
+- MLA attention (`attn_type="mla"`)
+- Loop-index sinusoidal embedding
+## Training Recipe (from Parcae)
+Based on published scaling laws:
+| Setting | Value | Source |
+|---|---|---|
+| Optimizer | AdamW (β1=0.9, β2=0.95) | Standard |
+| Learning rate | 3e-4 (140M), 2e-4 (370M+) | Parcae |
+| Schedule | Cosine decay with warmup | Parcae |
+| Warmup steps | 2000 | Parcae |
+| Weight decay | 0.1 | Parcae |
+| Batch size | 512 × 1280 tokens | Saunshi et al. |
+| Dataset | FineWeb-Edu | Parcae / FineWeb |
+| μ_bwd | ⌈μ_rec/2⌉ | Parcae (backprop truncation) |
+| Depth sampling | Per-sequence within micro-batch | Parcae |
+## License
+MIT License
+## Citation
+```bibtex
+@software{ultron2026,
+  title   = {Ultron: An Open-Source Recurrent-Depth Transformer},
+  year    = {2026},
+  url     = {https://huggingface.co/trojan0x/ultron},
+  note    = {Grounded in Parcae, Relaxed Recursive Transformers, and looped transformer theory}
+}
+```