COMPLEXITY-DEEP Token-Routed MoE (383.5M)

Model Details

  • Architecture: Token-Routed MLP + Mu-Guidance + Shared Lexical Expert
  • Parameters: 383.5M total, ~105M active per token
  • Hidden size: 1024
  • Layers: 20
  • Attention heads: 16 (GQA, 4 KV heads)
  • Intermediate size: 3200 (800 per expert)
  • Experts: 4 (deterministic Zipf-balanced routing)
  • Shared expert: 800 intermediate
  • Vocabulary: 32,000
  • Max context: 4,096

Training

  • Dataset: FineWeb-Edu (streaming)
  • Tokens: 8B (15,259 steps)
  • Batch size: 128 per GPU x 2 GPUs = 256 effective
  • Optimizer: AdamW (lr=2.1e-4, auto-scaled to 4.2e-4)
  • Scheduler: Cosine with 5% warmup (762 steps)
  • Precision: BF16
  • Hardware: 2x NVIDIA RTX PRO 6000 (96GB each)
  • Training time: ~30 hours

Results

Loss

  • Final loss: ~2.96
  • vs Dense baseline (384.5M): +0.09 gap (stable from step 5K)
  • Gap trend: 0.28 (step 1K) -> 0.09 (step 5K+)

Zero-Shot Benchmarks

Benchmark MoE (383.5M) Dense (384.5M)
ARC-Easy 43.6% 45.9%
HellaSwag 28.7% 30.1%
MMLU 23.0% 23.1%

Inference (vLLM 0.18, RTX PRO 6000 96GB)

  • Sustained throughput: 4,900 tok/s
  • Peak throughput: 5,700 tok/s
  • Median TTFT: 39.6 ms
  • Median ITL: 16.0 ms
  • CUDA graph: natively compatible (deterministic routing)

Generation Example

No supervised fine-tuning. Raw base model output:

Prompt: "The meaning of life is"

Output: "very much the same. The same thing happens to all living things. They live in a constant state of flux. The single cell of a living cell, in this case a cell nucleus, constantly changes to become an organism, and that organism is the organism. The human body is a system of interconnected cells. Each cell is made up of a set of parts, which are connected by a network of specialized cells. The set of all the parts is composed of different cells. These cells themselves are"

Files

  • model.safetensors - Model weights
  • model_config.yaml - Architecture configuration
  • config.json - HuggingFace-compatible config

Usage

from complexity.config import ModelConfig
from complexity.models import ComplexityModel
from safetensors.torch import load_file

config = ModelConfig.load("model_config.yaml")
model = ComplexityModel(config)
state = load_file("model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model.eval().cuda()

Paper

Under review at TMLR: https://openreview.net/forum?id=jZq6EVboC6

License

CC-BY-NC-4.0

Complexity-ML -- 2026

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using Pacific-i64/TR-MoE-400M 1