vanilla-24L2048-parity-cold

A 1.3B-parameter vanilla GPT (no SAE bottleneck) trained to val-loss parity with markhenry/cayley-24L2048-32k-2L-mlp_in-20b, the CayleySAE variant of the same backbone. This is the primary baseline for alignment-tax comparisons: same architecture minus the sparsity-enforcing bottleneck, cold-stopped at matching loss.

Numbers

	cayley-24L2048-32k-2L-mlp_in-20b	vanilla-24L2048-parity-cold
val_loss (CE)	2.7933	2.7926
tokens seen	20.0B	3.8B
pile_ppl	20.32	18.89
hellaswag_acc	0.383	0.379
lambada_acc	0.304	0.304

Val-loss delta: 0.0007 nats (inside ~0.005 eval noise floor). This is parity.

Architecture

24 layers, 16 heads, d_model = 2048 (~1.3B params)
sparsity_mode = none (standard GPT, no CayleySAE)
Trained with Muon + AdamW (peak LR 1e-2, min LR 1e-3)

Training

Dataset: FineWeb-Edu-100B (same shard ordering as the Cayley run)
Schedule: parity_adaptive -- flat at peak LR until val enters target band, then 895-iter linear warmdown to min LR
Trigger: val <= 2.7933 + 0.207 = 3.0003 (fired at iter 1950)
Cold stop: iter 2900, LR = 1e-3
Hardware: 4x NVIDIA B200, ~32 min wall clock
Wandb: vanilla-24L2048-parity-fresh-v2-extend

Files

ckpt.pt -- PyTorch checkpoint (5.1 GB). Contains model, optimizer_states, config, model_config, iter_num, best_val_loss, wandb_step_offset, parity_trigger_iter.
config.json -- training config snapshot.

Loading

import torch
from sparse_nanogpt.model import GPT
from sparse_nanogpt.config import DeepTopKGPTConfig

ckpt = torch.load("ckpt.pt", map_location="cpu", weights_only=False)
model_config = DeepTopKGPTConfig(**ckpt["model_config"])
model = GPT(model_config)
model.load_state_dict(ckpt["model"])

Citation

Part of the Sparse NanoGPT project.

Downloads last month: 3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

markhenry
/

vanilla-24L2048-parity-cold