vanilla-10b / README.md
markhenry's picture
Upload README.md with huggingface_hub
f9d8091 verified
---
tags:
- pytorch
- language-model
- gpt
license: mit
---
# vanilla-10b
Vanilla GPT baseline trained to compare against [aemack-org/cayley-10b](https://huggingface.co/aemack-org/cayley-10b).
## Architecture
| Parameter | Value |
|-----------|-------|
| n_layer | 12 |
| n_head | 8 |
| n_embd | 1024 |
| block_size | 1024 |
| vocab_size | 50304 |
| bias | False |
| norm | RMSNorm (affine) |
| MLP | GELU, 4x expansion |
| tokenizer | GPT-2 (tiktoken) |
| dtype | bfloat16 |
| sparsity | none (vanilla) |
## Training
| Parameter | Value |
|-----------|-------|
| optimizer | Muon (hidden 2D) + AdamW (embeddings) |
| muon_lr | 0.006 |
| adamw_lr | 0.006 |
| lr_schedule | linear_warmdown (warmdown_frac=0.2) |
| batch_size | 80 |
| seq_len | 1024 |
| max_iters | 16000 |
| tokens seen | ~1.3B |
| dataset | FineWeb-Edu-10B |
| best_val_loss | 6.2834 |
## Purpose
Interpretability comparison baseline. Trained with identical hyperparameters to
`cayley-10b` but without the CayleySAE bottleneck at `mlp_in`. Enables direct
comparison of residual stream representations.