vanilla-10b
Vanilla GPT baseline trained to compare against aemack-org/cayley-10b.
Architecture
| Parameter | Value |
|---|---|
| n_layer | 12 |
| n_head | 8 |
| n_embd | 1024 |
| block_size | 1024 |
| vocab_size | 50304 |
| bias | False |
| norm | RMSNorm (affine) |
| MLP | GELU, 4x expansion |
| tokenizer | GPT-2 (tiktoken) |
| dtype | bfloat16 |
| sparsity | none (vanilla) |
Training
| Parameter | Value |
|---|---|
| optimizer | Muon (hidden 2D) + AdamW (embeddings) |
| muon_lr | 0.006 |
| adamw_lr | 0.006 |
| lr_schedule | linear_warmdown (warmdown_frac=0.2) |
| batch_size | 80 |
| seq_len | 1024 |
| max_iters | 16000 |
| tokens seen | ~1.3B |
| dataset | FineWeb-Edu-10B |
| best_val_loss | 6.2834 |
Purpose
Interpretability comparison baseline. Trained with identical hyperparameters to
cayley-10b but without the CayleySAE bottleneck at mlp_in. Enables direct
comparison of residual stream representations.
- Downloads last month
- 20
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support