Dense SwiGLU Baseline (384.5M)

Model Details

  • Architecture: Standard dense transformer (Llama-style SwiGLU)
  • Parameters: 384.5M (all active per token)
  • Hidden size: 1024
  • Layers: 20
  • Attention heads: 16 (GQA, 4 KV heads)
  • Intermediate size: 4,358
  • Vocabulary: 32,000
  • Max context: 4,096
  • Mu-Guidance: No
  • MoE: No (single dense MLP)

Training

  • Dataset: FineWeb-Edu (streaming)
  • Tokens: 8B (15,259 steps)
  • Batch size: 128 per GPU x 2 GPUs = 256 effective
  • Optimizer: AdamW (lr=2.1e-4, auto-scaled to 4.2e-4)
  • Scheduler: Cosine with 5% warmup (762 steps)
  • Precision: BF16
  • Hardware: 2x NVIDIA RTX PRO 6000 WS (96GB each)
  • Training time: ~30 hours

Results

Loss

  • Final loss: ~2.87

Zero-Shot Benchmarks

Benchmark Dense (384.5M) MoE (383.5M)
ARC-Easy 45.9% 43.6%
HellaSwag 30.1% 28.7%
MMLU 23.1% 23.0%

Purpose

This model serves as the iso-parameter dense baseline for the COMPLEXITY-DEEP Token-Routed MoE comparison. Same hidden size, layers, attention heads, training data, optimizer, and hyperparameters. Only difference: single dense SwiGLU MLP instead of 4 routed experts + shared expert.

Files

  • model.safetensors - Model weights
  • model_config.yaml - Architecture configuration
  • config.json - HuggingFace-compatible config

Usage

from complexity.config import ModelConfig
from complexity.models import ComplexityModel
from safetensors.torch import load_file

config = ModelConfig.load("model_config.yaml")
model = ComplexityModel(config)
state = load_file("model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model.eval().cuda()

Paper

Under review at TMLR: https://openreview.net/forum?id=jZq6EVboC6

License

CC-BY-NC-4.0

Complexity-ML -- 2026

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using Pacific-i64/Dense-400M 1