Dense SwiGLU Baseline (384.5M)

Model Details

Architecture: Standard dense transformer (Llama-style SwiGLU)
Parameters: 384.5M (all active per token)
Hidden size: 1024
Layers: 20
Attention heads: 16 (GQA, 4 KV heads)
Intermediate size: 4,358
Vocabulary: 32,000
Max context: 4,096
Mu-Guidance: No
MoE: No (single dense MLP)

Training

Dataset: FineWeb-Edu (streaming)
Tokens: 8B (15,259 steps)
Batch size: 128 per GPU x 2 GPUs = 256 effective
Optimizer: AdamW (lr=2.1e-4, auto-scaled to 4.2e-4)
Scheduler: Cosine with 5% warmup (762 steps)
Precision: BF16
Hardware: 2x NVIDIA RTX PRO 6000 WS (96GB each)
Training time: ~30 hours

Results

Loss

Final loss: ~2.87

Zero-Shot Benchmarks

Benchmark	Dense (384.5M)	MoE (383.5M)
ARC-Easy	45.9%	43.6%
HellaSwag	30.1%	28.7%
MMLU	23.1%	23.0%

Purpose

This model serves as the iso-parameter dense baseline for the COMPLEXITY-DEEP Token-Routed MoE comparison. Same hidden size, layers, attention heads, training data, optimizer, and hyperparameters. Only difference: single dense SwiGLU MLP instead of 4 routed experts + shared expert.

Files

model.safetensors - Model weights
model_config.yaml - Architecture configuration
config.json - HuggingFace-compatible config

Usage

from complexity.config import ModelConfig
from complexity.models import ComplexityModel
from safetensors.torch import load_file

config = ModelConfig.load("model_config.yaml")
model = ComplexityModel(config)
state = load_file("model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model.eval().cuda()

Paper

Under review at TMLR: https://openreview.net/forum?id=jZq6EVboC6

License

CC-BY-NC-4.0

Complexity-ML -- 2026

Downloads last month: 5

Safetensors

Model size

0.4B params

Tensor type

BF16

Pacific-i64
/

Dense-400M