Parcae: Scaling Laws For Stable Looped Language Models
Paper • 2604.12946 • Published • 6
Recurrent-Depth Transformer pretrained on FineWeb-Edu
Ultron is a Recurrent-Depth Transformer (RDT) that loops a small set of transformer layers multiple times per forward pass, enabling "thinking in latent space" — implicit chain-of-thought reasoning without generating intermediate tokens.
Input → [Embedding + RoPE] → [Prelude (2 layers)] → [LN(e)] → [RecurrentBlock ×8 loops (4 layers/loop)] → [C·h] → [Coda (2 layers)] → [LM Head]
| Property | Value |
|---|---|
| Total Parameters | 88.9M |
| Non-embedding Parameters | 50.4M |
| Effective Depth | 36 layers (8 loops × 4 layers + 2 prelude + 2 coda) |
| Attention | GQA (12 heads, 4 KV heads) |
| Hidden Dimension | 768 |
| Sequence Length | 1024 |
| MoE | No (dense FFN) |
| Spectral Radius ρ(A) | < 1 by construction (LTI stability) |
from ultron.model import Ultron, UltronConfig
import torch
# Load from checkpoint
ckpt = torch.load("ultron_final.pt", weights_only=False)
cfg = UltronConfig(**ckpt["config"])
model = Ultron(cfg)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()
# Inference with depth extrapolation
logits = model(input_ids, n_loops=8) # standard
logits = model(input_ids, n_loops=16) # deeper reasoning
trojan0x/ultron — Full source code, training scripts, and notebook