COMPLEXITY-DEEP Token-Routed MoE (383.5M) โ Training Checkpoint (Step 15,259)
Resumable training checkpoint with full optimizer state at the end of 8B tokens training.
Note: This model was trained with a Chinchilla-like token budget (8B tokens for 383.5M parameters, ~21 tokens/param). The model may benefit from continued training beyond this point.
Contents
checkpoint.pt- Model weights + training statemodel.safetensors- Model weights (safetensors format)optimizer_rank0.pt- AdamW optimizer state (GPU 0)optimizer_rank1.pt- AdamW optimizer state (GPU 1)training_state.json- Step counter, LR, etc.
Model Config
- Parameters: 383.5M total, ~105M active per token
- Hidden: 1024, Layers: 20, Heads: 16, KV Heads: 4
- Experts: 4, Intermediate: 3200 (800/expert), Shared: 800
- Training: 8B tokens (15,259 steps), AdamW lr=2.1e-4, cosine 5% warmup
Resume Training
import torch
checkpoint = torch.load("checkpoint.pt", map_location="cpu")
model.load_state_dict(checkpoint["model"])
# Load optimizer for your GPU rank (0 or 1)
rank = torch.distributed.get_rank()
optimizer_state = torch.load(f"optimizer_rank{rank}.pt", map_location="cpu")
optimizer.load_state_dict(optimizer_state)
# Resume from step 15,259
Pretrained Weights (inference)
For inference use the safetensors checkpoint in ../final/ instead.
License
CC-BY-NC-4.0
Complexity-ML -- 2026
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support