Dense SwiGLU Baseline (384.5M)
Model Details
- Architecture: Standard dense transformer (Llama-style SwiGLU)
- Parameters: 384.5M (all active per token)
- Hidden size: 1024
- Layers: 20
- Attention heads: 16 (GQA, 4 KV heads)
- Intermediate size: 4,358
- Vocabulary: 32,000
- Max context: 4,096
- Mu-Guidance: No
- MoE: No (single dense MLP)
Training
- Dataset: FineWeb-Edu (streaming)
- Tokens: 8B (15,259 steps)
- Batch size: 128 per GPU x 2 GPUs = 256 effective
- Optimizer: AdamW (lr=2.1e-4, auto-scaled to 4.2e-4)
- Scheduler: Cosine with 5% warmup (762 steps)
- Precision: BF16
- Hardware: 2x NVIDIA RTX PRO 6000 WS (96GB each)
- Training time: ~30 hours
Results
Loss
- Final loss: ~2.87
Zero-Shot Benchmarks
| Benchmark | Dense (384.5M) | MoE (383.5M) |
|---|---|---|
| ARC-Easy | 45.9% | 43.6% |
| HellaSwag | 30.1% | 28.7% |
| MMLU | 23.1% | 23.0% |
Purpose
This model serves as the iso-parameter dense baseline for the COMPLEXITY-DEEP Token-Routed MoE comparison. Same hidden size, layers, attention heads, training data, optimizer, and hyperparameters. Only difference: single dense SwiGLU MLP instead of 4 routed experts + shared expert.
Files
model.safetensors- Model weightsmodel_config.yaml- Architecture configurationconfig.json- HuggingFace-compatible config
Usage
from complexity.config import ModelConfig
from complexity.models import ComplexityModel
from safetensors.torch import load_file
config = ModelConfig.load("model_config.yaml")
model = ComplexityModel(config)
state = load_file("model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model.eval().cuda()
Paper
Under review at TMLR: https://openreview.net/forum?id=jZq6EVboC6
License
CC-BY-NC-4.0
Complexity-ML -- 2026
- Downloads last month
- -