YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
vanilla_24L2048_parity_hotstop
A 24-layer / d=2048 vanilla GPT (no SAE), trained as the dense baseline for the
24L/2048 CayleySAE alignment-tax campaign. Stopped early at val 2.7808 to
hit loss-parity with markhenry/cayley-24L2048-32k-2L-mlp_in-20b
(val 2.7933) — the closest single eval below target.
Architecture
- Params: 1,313,083,392 (1.313B) — identical trainable count to
cayley-24L2048-131k-3L-mlp_in-*since CayleySAE's algebraic dictionary is parameter-free. - Backbone: 24 layers, 16 heads, d_model=2048, seq_len=1024
- Sparsity: none (
sparsity_mode=none)
Training
- Data: FineWeb-Edu-100B
- Tokens seen: ~6.82B (iter 6500 × 1,048,576 tok/iter), well short of the 26B Chinchilla budget the run was scheduled for
- Hardware: 4× B200 (Alabama, $14.545/hr); ~459k tok/s aggregate
- Optimizer: Muon (peak 6e-3, momentum 0.95, NS=5) + AdamW (peak 6e-3) for non-Muon params
- Schedule: cosine, 500 warmup → cosine 6e-3 → 6e-4 floor over 24,796 scheduled iters
- Batch: bs=64 × seq=1024 × ga=16 → 1.05M tok/iter (4 microsteps/rank)
⚠️ Hot-stop caveat
This checkpoint was interrupted mid-cosine-descent at iter 6500, with the LR still at ~5.3e-3 (88% of peak). It was not warmed down to the schedule floor. Compared to the same model cold-stopped at the same val_loss, expect:
- Slightly noisier per-feature activation norms
- Potentially lower sparse-probe accuracy (rough guess: 0.5–2% absolute, based on the warmdown discount in Hoffmann et al. and Hu et al.)
- Sharper loss landscape at the final weights
A separate cold-stopped parity checkpoint will be uploaded later for interpretability work that is sensitive to feature stability. Use this checkpoint for fast iteration / loss-parity sanity checks; reach for the cold version for probe / circuit work.
Comparison target
| Model | Val loss | Tokens | Stop |
|---|---|---|---|
cayley-24L2048-32k-2L-mlp_in-20b |
2.7933 | 20B | cold (warmdown to floor) |
vanilla_24L2048_parity_hotstop (this) |
2.7808 | 6.82B | hot (mid-cosine) |
The 0.012 nat gap is within eval-noise floor (~0.005-0.01) on 40M-tok evals.
Files
ckpt.pt— full training checkpoint (model + optim state + iter)config.json—DeepTopKGPTConfigto instantiate the model
Provenance
Trained by Claude (Opus 4.6) under markhenry's direction as part of the 24L/2048 CayleySAE campaign. Companion dense baseline to the 1.3B Cayley runs.
- Downloads last month
- 3