COMPLEXITY-DEEP Token-Routed MoE (383.5M)

Model Details

Architecture: Token-Routed MLP + Mu-Guidance + Shared Lexical Expert
Parameters: 383.5M total, ~105M active per token
Hidden size: 1024
Layers: 20
Attention heads: 16 (GQA, 4 KV heads)
Intermediate size: 3200 (800 per expert)
Experts: 4 (deterministic Zipf-balanced routing)
Shared expert: 800 intermediate
Vocabulary: 32,000
Max context: 4,096

Training

Dataset: FineWeb-Edu (streaming)
Tokens: 8B (15,259 steps)
Batch size: 128 per GPU x 2 GPUs = 256 effective
Optimizer: AdamW (lr=2.1e-4, auto-scaled to 4.2e-4)
Scheduler: Cosine with 5% warmup (762 steps)
Precision: BF16
Hardware: 2x NVIDIA RTX PRO 6000 (96GB each)
Training time: ~30 hours

Results

Loss

Final loss: ~2.96
vs Dense baseline (384.5M): +0.09 gap (stable from step 5K)
Gap trend: 0.28 (step 1K) -> 0.09 (step 5K+)

Zero-Shot Benchmarks

Benchmark	MoE (383.5M)	Dense (384.5M)
ARC-Easy	43.6%	45.9%
HellaSwag	28.7%	30.1%
MMLU	23.0%	23.1%

Inference (vLLM 0.18, RTX PRO 6000 96GB)

Sustained throughput: 4,900 tok/s
Peak throughput: 5,700 tok/s
Median TTFT: 39.6 ms
Median ITL: 16.0 ms
CUDA graph: natively compatible (deterministic routing)

Expert Specialization (3D t-SNE)

Interactive visualization of expert activations across layers. Each point is an expert at a given layer; proximity = similar activation patterns.

▶ Open Interactive 3D t-SNE Visualization

Generation Example

No supervised fine-tuning. Raw base model output:

Prompt: "The meaning of life is"

Output: "very much the same. The same thing happens to all living things. They live in a constant state of flux. The single cell of a living cell, in this case a cell nucleus, constantly changes to become an organism, and that organism is the organism. The human body is a system of interconnected cells. Each cell is made up of a set of parts, which are connected by a network of specialized cells."

Files

model.safetensors - Model weights
model_config.yaml - Architecture configuration
config.json - HuggingFace-compatible config

Usage

from complexity.config import ModelConfig
from complexity.models import ComplexityModel
from safetensors.torch import load_file

config = ModelConfig.load("model_config.yaml")
model = ComplexityModel(config)
state = load_file("model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model.eval().cuda()

Paper

Under review at TMLR: https://openreview.net/forum?id=jZq6EVboC6

License

CC-BY-NC-4.0

Complexity-ML -- 2026

Downloads last month: 6

Safetensors

Model size

0.4B params

Tensor type

I64

BF16

Pacific-i64
/

TR-MoE-400M