cayley-10b-k8-2l-mlp_in

205M-param GPT with 2-level CayleySAE at mlp_in. Trained on FineWeb-Edu-10B for 16k iters (~10.5B tokens). Reaches 3.228 CE val loss.

CayleySAE config

Level	n	k	delta	Features
L0	10	8	0 (root)	1,024
L1	13	16	64	8,192

Total features per layer: 9,216
Active per token: 24
per_parent_budget=True, score_standardize=True
Location: mlp_in

Training config

optimizer: muon (lr=0.006), adamw (lr=0.006)
lr_schedule: linear_warmdown, warmdown_frac=0.2
batch_size: 40 × 2 GPUs, gradient_accumulation_steps: 8 (effective batch: 640)
seq_len: 1024, max_iters: 16000

Downloads last month: 10

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support