YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

vanilla_24L2048_parity_hotstop

A 24-layer / d=2048 vanilla GPT (no SAE), trained as the dense baseline for the 24L/2048 CayleySAE alignment-tax campaign. Stopped early at val 2.7808 to hit loss-parity with markhenry/cayley-24L2048-32k-2L-mlp_in-20b (val 2.7933) — the closest single eval below target.

Architecture

  • Params: 1,313,083,392 (1.313B) — identical trainable count to cayley-24L2048-131k-3L-mlp_in-* since CayleySAE's algebraic dictionary is parameter-free.
  • Backbone: 24 layers, 16 heads, d_model=2048, seq_len=1024
  • Sparsity: none (sparsity_mode=none)

Training

  • Data: FineWeb-Edu-100B
  • Tokens seen: ~6.82B (iter 6500 × 1,048,576 tok/iter), well short of the 26B Chinchilla budget the run was scheduled for
  • Hardware: 4× B200 (Alabama, $14.545/hr); ~459k tok/s aggregate
  • Optimizer: Muon (peak 6e-3, momentum 0.95, NS=5) + AdamW (peak 6e-3) for non-Muon params
  • Schedule: cosine, 500 warmup → cosine 6e-3 → 6e-4 floor over 24,796 scheduled iters
  • Batch: bs=64 × seq=1024 × ga=16 → 1.05M tok/iter (4 microsteps/rank)

⚠️ Hot-stop caveat

This checkpoint was interrupted mid-cosine-descent at iter 6500, with the LR still at ~5.3e-3 (88% of peak). It was not warmed down to the schedule floor. Compared to the same model cold-stopped at the same val_loss, expect:

  • Slightly noisier per-feature activation norms
  • Potentially lower sparse-probe accuracy (rough guess: 0.5–2% absolute, based on the warmdown discount in Hoffmann et al. and Hu et al.)
  • Sharper loss landscape at the final weights

A separate cold-stopped parity checkpoint will be uploaded later for interpretability work that is sensitive to feature stability. Use this checkpoint for fast iteration / loss-parity sanity checks; reach for the cold version for probe / circuit work.

Comparison target

Model Val loss Tokens Stop
cayley-24L2048-32k-2L-mlp_in-20b 2.7933 20B cold (warmdown to floor)
vanilla_24L2048_parity_hotstop (this) 2.7808 6.82B hot (mid-cosine)

The 0.012 nat gap is within eval-noise floor (~0.005-0.01) on 40M-tok evals.

Files

  • ckpt.pt — full training checkpoint (model + optim state + iter)
  • config.jsonDeepTopKGPTConfig to instantiate the model

Provenance

Trained by Claude (Opus 4.6) under markhenry's direction as part of the 24L/2048 CayleySAE campaign. Companion dense baseline to the 1.3B Cayley runs.

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support