| tags: | |
| - pytorch | |
| - language-model | |
| - gpt | |
| license: mit | |
| # vanilla-10b | |
| Vanilla GPT baseline trained to compare against [aemack-org/cayley-10b](https://huggingface.co/aemack-org/cayley-10b). | |
| ## Architecture | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | n_layer | 12 | | |
| | n_head | 8 | | |
| | n_embd | 1024 | | |
| | block_size | 1024 | | |
| | vocab_size | 50304 | | |
| | bias | False | | |
| | norm | RMSNorm (affine) | | |
| | MLP | GELU, 4x expansion | | |
| | tokenizer | GPT-2 (tiktoken) | | |
| | dtype | bfloat16 | | |
| | sparsity | none (vanilla) | | |
| ## Training | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | optimizer | Muon (hidden 2D) + AdamW (embeddings) | | |
| | muon_lr | 0.006 | | |
| | adamw_lr | 0.006 | | |
| | lr_schedule | linear_warmdown (warmdown_frac=0.2) | | |
| | batch_size | 80 | | |
| | seq_len | 1024 | | |
| | max_iters | 16000 | | |
| | tokens seen | ~1.3B | | |
| | dataset | FineWeb-Edu-10B | | |
| | best_val_loss | 6.2834 | | |
| ## Purpose | |
| Interpretability comparison baseline. Trained with identical hyperparameters to | |
| `cayley-10b` but without the CayleySAE bottleneck at `mlp_in`. Enables direct | |
| comparison of residual stream representations. | |