Ultron-Small MoE (Experimental)
Recurrent-Depth Transformer with Mixture-of-Experts, pretrained on FineWeb-Edu
⚠️ Experimental: MoE inside a recurrent loop has never been validated in published research. This is an exploration to test whether sparse expert routing helps or hurts when weights are shared across loop iterations.
Architecture
Same as ultron-small-baseline, but with MoE replacing the dense FFN in the recurrent block.
| Property | Value |
|---|---|
| Total Parameters | 105.5M |
| Non-embedding Parameters | 66.9M |
| Effective Depth | 36 layers (8 loops × 4 layers + 2 prelude + 2 coda) |
| Attention | GQA (12 heads, 4 KV heads) |
| Hidden Dimension | 768 |
| MoE | 8 experts, 1 shared, top-2 routing |
| Expert Dimension | 384 |
| Spectral Radius ρ(A) | < 1 by construction |
Research Question
Does MoE help inside a recurrent loop?
The hypothesis: different loop iterations may activate different experts, effectively creating depth-conditional routing. If expert specialization emerges per-depth, MoE could unlock more expressive recurrent computation without proportionally increasing FLOPs.
Training
Same recipe as the baseline — only difference is use_moe=True.
Repository
trojan0x/ultron — Full source code, training scripts, and notebook
- Downloads last month
- 14