COMPLEXITY-DEEP Token-Routed MoE (187M)

Model Details

  • Architecture: Token-Routed MLP + Mu-Guidance + Shared Lexical Expert
  • Parameters: 187M total
  • Hidden size: 768
  • Layers: 18
  • Attention heads: 12 (GQA, 4 KV heads)
  • Intermediate size: 2048 (512 per expert)
  • Experts: 4 (deterministic Zipf-balanced routing)
  • Shared expert: 512 intermediate
  • Vocabulary: 32,000
  • Max context: 2,048

Training

  • Dataset: FineWeb-Edu (streaming)
  • Tokens: 500M (954 steps)
  • Optimizer: AdamW (lr=3e-4, auto-scaled to 6e-4)
  • Scheduler: Cosine with 5% warmup
  • Precision: BF16

Results

Loss (Ablation, 500M tokens)

Configuration Params Avg Loss vs Dense
Run 1: Dense (SwiGLU) 171M 4.905 ---
Run 2: TR + Shared + Mu + Zipf 187M 4.793 -0.112
Run 3: TR + Shared + Zipf (no Mu) 187M 4.916 +0.011
Run 4: Mixtral (learned router) 187M 4.843 -0.062

Inference (vLLM 0.18, RTX PRO 6000 96GB)

  • Sustained throughput: 8,078 tok/s
  • Peak throughput: 10,179 tok/s
  • Median TTFT: 29.3 ms
  • Median ITL: 7.9 ms
  • CUDA graph: natively compatible (deterministic routing)

Inference Benchmark

Files

  • model.safetensors - Model weights
  • model_config.yaml - Architecture configuration
  • config.json - HuggingFace-compatible config
  • benchmark_throughput.png - vLLM inference benchmark figure

Usage

from complexity.config import ModelConfig
from complexity.models import ComplexityModel
from safetensors.torch import load_file

config = ModelConfig.load("model_config.yaml")
model = ComplexityModel(config)
state = load_file("model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model.eval().cuda()

Paper

Under review at TMLR: https://openreview.net/forum?id=jZq6EVboC6

License

CC-BY-NC-4.0

Complexity-ML -- 2026

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support