COMPLEXITY-DEEP Token-Routed MoE (383.5M)
Model Details
- Architecture: Token-Routed MLP + Mu-Guidance + Shared Lexical Expert
- Parameters: 383.5M total, ~105M active per token
- Hidden size: 1024
- Layers: 20
- Attention heads: 16 (GQA, 4 KV heads)
- Intermediate size: 3200 (800 per expert)
- Experts: 4 (deterministic Zipf-balanced routing)
- Shared expert: 800 intermediate
- Vocabulary: 32,000
- Max context: 4,096
Training
- Dataset: FineWeb-Edu (streaming)
- Tokens: 8B (15,259 steps)
- Batch size: 128 per GPU x 2 GPUs = 256 effective
- Optimizer: AdamW (lr=2.1e-4, auto-scaled to 4.2e-4)
- Scheduler: Cosine with 5% warmup (762 steps)
- Precision: BF16
- Hardware: 2x NVIDIA RTX PRO 6000 (96GB each)
- Training time: ~30 hours
Results
Loss
- Final loss: ~2.96
- vs Dense baseline (384.5M): +0.09 gap (stable from step 5K)
- Gap trend: 0.28 (step 1K) -> 0.09 (step 5K+)
Zero-Shot Benchmarks
| Benchmark | MoE (383.5M) | Dense (384.5M) |
|---|---|---|
| ARC-Easy | 43.6% | 45.9% |
| HellaSwag | 28.7% | 30.1% |
| MMLU | 23.0% | 23.1% |
Inference (vLLM 0.18, RTX PRO 6000 96GB)
- Sustained throughput: 4,900 tok/s
- Peak throughput: 5,700 tok/s
- Median TTFT: 39.6 ms
- Median ITL: 16.0 ms
- CUDA graph: natively compatible (deterministic routing)
Generation Example
No supervised fine-tuning. Raw base model output:
Prompt: "The meaning of life is"
Output: "very much the same. The same thing happens to all living things. They live in a constant state of flux. The single cell of a living cell, in this case a cell nucleus, constantly changes to become an organism, and that organism is the organism. The human body is a system of interconnected cells. Each cell is made up of a set of parts, which are connected by a network of specialized cells. The set of all the parts is composed of different cells. These cells themselves are"
Files
model.safetensors- Model weightsmodel_config.yaml- Architecture configurationconfig.json- HuggingFace-compatible config
Usage
from complexity.config import ModelConfig
from complexity.models import ComplexityModel
from safetensors.torch import load_file
config = ModelConfig.load("model_config.yaml")
model = ComplexityModel(config)
state = load_file("model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model.eval().cuda()
Paper
Under review at TMLR: https://openreview.net/forum?id=jZq6EVboC6
License
CC-BY-NC-4.0
Complexity-ML -- 2026
- Downloads last month
- -