vanilla_24L2048_parity_hotstop

A 24-layer / d=2048 vanilla GPT (no SAE), trained as the dense baseline for the 24L/2048 CayleySAE alignment-tax campaign. Stopped early at val 2.7808 to hit loss-parity with markhenry/cayley-24L2048-32k-2L-mlp_in-20b (val 2.7933) — the closest single eval below target.

Architecture

Params: 1,313,083,392 (1.313B) — identical trainable count to cayley-24L2048-131k-3L-mlp_in-* since CayleySAE's algebraic dictionary is parameter-free.
Backbone: 24 layers, 16 heads, d_model=2048, seq_len=1024
Sparsity: none (sparsity_mode=none)

Training

Data: FineWeb-Edu-100B
Tokens seen: ~6.82B (iter 6500 × 1,048,576 tok/iter), well short of the 26B Chinchilla budget the run was scheduled for
Hardware: 4× B200 (Alabama, $14.545/hr); ~459k tok/s aggregate
Optimizer: Muon (peak 6e-3, momentum 0.95, NS=5) + AdamW (peak 6e-3) for non-Muon params
Schedule: cosine, 500 warmup → cosine 6e-3 → 6e-4 floor over 24,796 scheduled iters
Batch: bs=64 × seq=1024 × ga=16 → 1.05M tok/iter (4 microsteps/rank)

⚠️ Hot-stop caveat

This checkpoint was interrupted mid-cosine-descent at iter 6500, with the LR still at ~5.3e-3 (88% of peak). It was not warmed down to the schedule floor. Compared to the same model cold-stopped at the same val_loss, expect:

Slightly noisier per-feature activation norms
Potentially lower sparse-probe accuracy (rough guess: 0.5–2% absolute, based on the warmdown discount in Hoffmann et al. and Hu et al.)
Sharper loss landscape at the final weights

A separate cold-stopped parity checkpoint will be uploaded later for interpretability work that is sensitive to feature stability. Use this checkpoint for fast iteration / loss-parity sanity checks; reach for the cold version for probe / circuit work.

Comparison target

Model	Val loss	Tokens	Stop
`cayley-24L2048-32k-2L-mlp_in-20b`	2.7933	20B	cold (warmdown to floor)
`vanilla_24L2048_parity_hotstop` (this)	2.7808	6.82B	hot (mid-cosine)

The 0.012 nat gap is within eval-noise floor (~0.005-0.01) on 40M-tok evals.

Files

ckpt.pt — full training checkpoint (model + optim state + iter)
config.json — DeepTopKGPTConfig to instantiate the model

Provenance

Trained by Claude (Opus 4.6) under markhenry's direction as part of the 24L/2048 CayleySAE campaign. Companion dense baseline to the 1.3B Cayley runs.

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support