COMPLEXITY-DEEP Token-Routed MoE (187M)

Model Details

Architecture: Token-Routed MLP + Mu-Guidance + Shared Lexical Expert
Parameters: 187M total
Hidden size: 768
Layers: 18
Attention heads: 12 (GQA, 4 KV heads)
Intermediate size: 2048 (512 per expert)
Experts: 4 (deterministic Zipf-balanced routing)
Shared expert: 512 intermediate
Vocabulary: 32,000
Max context: 2,048

Training

Dataset: FineWeb-Edu (streaming)
Tokens: 500M (954 steps)
Optimizer: AdamW (lr=3e-4, auto-scaled to 6e-4)
Scheduler: Cosine with 5% warmup
Precision: BF16

Results

Loss (Ablation, 500M tokens)

Configuration	Params	Avg Loss	vs Dense
Run 1: Dense (SwiGLU)	171M	4.905	---
Run 2: TR + Shared + Mu + Zipf	187M	4.793	-0.112
Run 3: TR + Shared + Zipf (no Mu)	187M	4.916	+0.011
Run 4: Mixtral (learned router)	187M	4.843	-0.062

Inference (vLLM 0.18, RTX PRO 6000 96GB)

Sustained throughput: 8,078 tok/s
Peak throughput: 10,179 tok/s
Median TTFT: 29.3 ms
Median ITL: 7.9 ms
CUDA graph: natively compatible (deterministic routing)

Files

model.safetensors - Model weights
model_config.yaml - Architecture configuration
config.json - HuggingFace-compatible config
benchmark_throughput.png - vLLM inference benchmark figure

Usage

from complexity.config import ModelConfig
from complexity.models import ComplexityModel
from safetensors.torch import load_file

config = ModelConfig.load("model_config.yaml")
model = ComplexityModel(config)
state = load_file("model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model.eval().cuda()

Paper

Under review at TMLR: https://openreview.net/forum?id=jZq6EVboC6

License

CC-BY-NC-4.0

Complexity-ML -- 2026

Downloads last month: 7

Safetensors

Model size

0.2B params

Tensor type

I64

BF16