HuggingFaceFW/fineweb-edu
Viewer • Updated • 3.5B • 634k • 1.1k
A 1.3B-parameter vanilla GPT (no SAE bottleneck) trained to val-loss parity
with markhenry/cayley-24L2048-32k-2L-mlp_in-20b,
the CayleySAE variant of the same backbone. This is the primary baseline for
alignment-tax comparisons: same architecture minus the sparsity-enforcing
bottleneck, cold-stopped at matching loss.
| cayley-24L2048-32k-2L-mlp_in-20b | vanilla-24L2048-parity-cold | |
|---|---|---|
| val_loss (CE) | 2.7933 | 2.7926 |
| tokens seen | 20.0B | 3.8B |
| pile_ppl | 20.32 | 18.89 |
| hellaswag_acc | 0.383 | 0.379 |
| lambada_acc | 0.304 | 0.304 |
Val-loss delta: 0.0007 nats (inside ~0.005 eval noise floor). This is parity.
sparsity_mode = none (standard GPT, no CayleySAE)parity_adaptive -- flat at peak LR until val enters target band,
then 895-iter linear warmdown to min LRckpt.pt -- PyTorch checkpoint (5.1 GB). Contains model, optimizer_states,
config, model_config, iter_num, best_val_loss, wandb_step_offset,
parity_trigger_iter.config.json -- training config snapshot.import torch
from sparse_nanogpt.model import GPT
from sparse_nanogpt.config import DeepTopKGPTConfig
ckpt = torch.load("ckpt.pt", map_location="cpu", weights_only=False)
model_config = DeepTopKGPTConfig(**ckpt["model_config"])
model = GPT(model_config)
model.load_state_dict(ckpt["model"])
Part of the Sparse NanoGPT project.