Ultron-Small MoE (Experimental)

Recurrent-Depth Transformer with Mixture-of-Experts, pretrained on FineWeb-Edu

⚠️ Experimental: MoE inside a recurrent loop has never been validated in published research. This is an exploration to test whether sparse expert routing helps or hurts when weights are shared across loop iterations.

Architecture

Same as ultron-small-baseline, but with MoE replacing the dense FFN in the recurrent block.

Property	Value
Total Parameters	105.5M
Non-embedding Parameters	66.9M
Effective Depth	36 layers (8 loops × 4 layers + 2 prelude + 2 coda)
Attention	GQA (12 heads, 4 KV heads)
Hidden Dimension	768
MoE	8 experts, 1 shared, top-2 routing
Expert Dimension	384
Spectral Radius ρ(A)	< 1 by construction

Research Question

Does MoE help inside a recurrent loop?

The hypothesis: different loop iterations may activate different experts, effectively creating depth-conditional routing. If expert specialization emerges per-depth, MoE could unlock more expressive recurrent computation without proportionally increasing FLOPs.

Training

Same recipe as the baseline — only difference is use_moe=True.

Repository

trojan0x/ultron — Full source code, training scripts, and notebook

Downloads last month: 14

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

trojan0x
/

ultron-small-moe

Ultron-Small MoE (Experimental)

Architecture

Research Question

Training

Repository

Dataset used to train trojan0x/ultron-small-moe