vanilla-v5-parity
Vanilla GPT baseline trained to compare against aemack-org/cayley-10b.
Architecture
| Parameter | Value |
|---|---|
| n_layer | 12 |
| n_head | 8 |
| n_embd | 1024 |
| block_size | 1024 |
| vocab_size | 50304 |
| bias | False |
| norm | RMSNorm (affine) |
| MLP | GELU, 4x expansion |
| tokenizer | GPT-2 (tiktoken) |
| dtype | bfloat16 |
| sparsity | none (vanilla) |
Training
| Parameter | Value |
|---|---|
| optimizer | Muon (hidden 2D) + AdamW (embeddings) |
| muon_lr | 0.006 |
| adamw_lr | 0.006 |
| lr_schedule | linear_warmdown (warmdown_frac=0.4) |
| batch_size | 80 |
| seq_len | 1024 |
| max_iters | 16000 |
| tokens seen | ~2.79B |
| dataset | FineWeb-Edu-10B |
| best_val_loss | 3.1728 |
Purpose
Interpretability comparison baseline. Trained with identical hyperparameters to
cayley-10b but without the CayleySAE bottleneck at mlp_in. Enables direct
comparison of residual stream representations.
Early stopping (cayley val parity)
Training stopped at the first evaluation where validation CE was at or below 3.173 (published validation loss for aemack-org/cayley-10b). This checkpoint is from iter 4250 (~2.79B tokens seen), not the full max_iters budget in config.json.
- Downloads last month
- 22
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support