COMPLEXITY-DEEP Token-Routed MoE (187M)
Model Details
- Architecture: Token-Routed MLP + Mu-Guidance + Shared Lexical Expert
- Parameters: 187M total
- Hidden size: 768
- Layers: 18
- Attention heads: 12 (GQA, 4 KV heads)
- Intermediate size: 2048 (512 per expert)
- Experts: 4 (deterministic Zipf-balanced routing)
- Shared expert: 512 intermediate
- Vocabulary: 32,000
- Max context: 2,048
Training
- Dataset: FineWeb-Edu (streaming)
- Tokens: 500M (954 steps)
- Optimizer: AdamW (lr=3e-4, auto-scaled to 6e-4)
- Scheduler: Cosine with 5% warmup
- Precision: BF16
Results
Loss (Ablation, 500M tokens)
| Configuration | Params | Avg Loss | vs Dense |
|---|---|---|---|
| Run 1: Dense (SwiGLU) | 171M | 4.905 | --- |
| Run 2: TR + Shared + Mu + Zipf | 187M | 4.793 | -0.112 |
| Run 3: TR + Shared + Zipf (no Mu) | 187M | 4.916 | +0.011 |
| Run 4: Mixtral (learned router) | 187M | 4.843 | -0.062 |
Inference (vLLM 0.18, RTX PRO 6000 96GB)
- Sustained throughput: 8,078 tok/s
- Peak throughput: 10,179 tok/s
- Median TTFT: 29.3 ms
- Median ITL: 7.9 ms
- CUDA graph: natively compatible (deterministic routing)
Files
model.safetensors- Model weightsmodel_config.yaml- Architecture configurationconfig.json- HuggingFace-compatible configbenchmark_throughput.png- vLLM inference benchmark figure
Usage
from complexity.config import ModelConfig
from complexity.models import ComplexityModel
from safetensors.torch import load_file
config = ModelConfig.load("model_config.yaml")
model = ComplexityModel(config)
state = load_file("model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model.eval().cuda()
Paper
Under review at TMLR: https://openreview.net/forum?id=jZq6EVboC6
License
CC-BY-NC-4.0
Complexity-ML -- 2026
- Downloads last month
- -
