Ultron-Small Baseline

Recurrent-Depth Transformer pretrained on FineWeb-Edu

Architecture

Ultron is a Recurrent-Depth Transformer (RDT) that loops a small set of transformer layers multiple times per forward pass, enabling "thinking in latent space" — implicit chain-of-thought reasoning without generating intermediate tokens.

Input → [Embedding + RoPE] → [Prelude (2 layers)] → [LN(e)] → [RecurrentBlock ×8 loops (4 layers/loop)] → [C·h] → [Coda (2 layers)] → [LM Head]

Property	Value
Total Parameters	88.9M
Non-embedding Parameters	50.4M
Effective Depth	36 layers (8 loops × 4 layers + 2 prelude + 2 coda)
Attention	GQA (12 heads, 4 KV heads)
Hidden Dimension	768
Sequence Length	1024
MoE	No (dense FFN)
Spectral Radius ρ(A)	< 1 by construction (LTI stability)

Training

Dataset: FineWeb-Edu (sample-10BT)
Optimizer: AdamW (β1=0.9, β2=0.95, ε=1e-8, weight decay=0.1)
LR Schedule: Linear warmup (1000 steps) + cosine decay (3e-4 → 3e-5)
Batch Size: 8 × 8 gradient accumulation = 64 effective
Precision: bf16
Depth Sampling: Parcae geometric per-sequence (μ=8)
Gradient Checkpointing: Enabled

Key Research Papers

Parcae (2026) — LTI-stable injection
Loop, Think, & Generalize (2025) — Depth extrapolation
Saunshi et al. (2025) — Looped = implicit CoT

Usage

from ultron.model import Ultron, UltronConfig
import torch

# Load from checkpoint
ckpt = torch.load("ultron_final.pt", weights_only=False)
cfg = UltronConfig(**ckpt["config"])
model = Ultron(cfg)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

# Inference with depth extrapolation
logits = model(input_ids, n_loops=8)   # standard
logits = model(input_ids, n_loops=16)  # deeper reasoning

Repository

trojan0x/ultron — Full source code, training scripts, and notebook

Downloads last month: 18

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train trojan0x/ultron-small-baseline

Papers for trojan0x/ultron-small-baseline