Ultron-Small MoE (Experimental)

Recurrent-Depth Transformer with Mixture-of-Experts, pretrained on FineWeb-Edu

⚠️ Experimental: MoE inside a recurrent loop has never been validated in published research. This is an exploration to test whether sparse expert routing helps or hurts when weights are shared across loop iterations.

Architecture

Same as ultron-small-baseline, but with MoE replacing the dense FFN in the recurrent block.

Property Value
Total Parameters 105.5M
Non-embedding Parameters 66.9M
Effective Depth 36 layers (8 loops × 4 layers + 2 prelude + 2 coda)
Attention GQA (12 heads, 4 KV heads)
Hidden Dimension 768
MoE 8 experts, 1 shared, top-2 routing
Expert Dimension 384
Spectral Radius ρ(A) < 1 by construction

Research Question

Does MoE help inside a recurrent loop?

The hypothesis: different loop iterations may activate different experts, effectively creating depth-conditional routing. If expert specialization emerges per-depth, MoE could unlock more expressive recurrent computation without proportionally increasing FLOPs.

Training

Same recipe as the baseline — only difference is use_moe=True.

Repository

trojan0x/ultron — Full source code, training scripts, and notebook

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train trojan0x/ultron-small-moe