Ultron-Small Baseline

Recurrent-Depth Transformer pretrained on FineWeb-Edu

Architecture

Ultron is a Recurrent-Depth Transformer (RDT) that loops a small set of transformer layers multiple times per forward pass, enabling "thinking in latent space" — implicit chain-of-thought reasoning without generating intermediate tokens.

Input → [Embedding + RoPE] → [Prelude (2 layers)] → [LN(e)] → [RecurrentBlock ×8 loops (4 layers/loop)] → [C·h] → [Coda (2 layers)] → [LM Head]
Property Value
Total Parameters 88.9M
Non-embedding Parameters 50.4M
Effective Depth 36 layers (8 loops × 4 layers + 2 prelude + 2 coda)
Attention GQA (12 heads, 4 KV heads)
Hidden Dimension 768
Sequence Length 1024
MoE No (dense FFN)
Spectral Radius ρ(A) < 1 by construction (LTI stability)

Training

  • Dataset: FineWeb-Edu (sample-10BT)
  • Optimizer: AdamW (β1=0.9, β2=0.95, ε=1e-8, weight decay=0.1)
  • LR Schedule: Linear warmup (1000 steps) + cosine decay (3e-4 → 3e-5)
  • Batch Size: 8 × 8 gradient accumulation = 64 effective
  • Precision: bf16
  • Depth Sampling: Parcae geometric per-sequence (μ=8)
  • Gradient Checkpointing: Enabled

Key Research Papers

Usage

from ultron.model import Ultron, UltronConfig
import torch

# Load from checkpoint
ckpt = torch.load("ultron_final.pt", weights_only=False)
cfg = UltronConfig(**ckpt["config"])
model = Ultron(cfg)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

# Inference with depth extrapolation
logits = model(input_ids, n_loops=8)   # standard
logits = model(input_ids, n_loops=16)  # deeper reasoning

Repository

trojan0x/ultron — Full source code, training scripts, and notebook

Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train trojan0x/ultron-small-baseline

Papers for trojan0x/ultron-small-baseline