markhenry
/

vanilla-10b

+---
+tags:
+  - pytorch
+  - language-model
+  - gpt
+license: mit
+---
+# vanilla-10b
+Vanilla GPT baseline trained to compare against [aemack-org/cayley-10b](https://huggingface.co/aemack-org/cayley-10b).
+## Architecture
+| Parameter | Value |
+|-----------|-------|
+| n_layer | 12 |
+| n_head | 8 |
+| n_embd | 1024 |
+| block_size | 1024 |
+| vocab_size | 50304 |
+| bias | False |
+| norm | RMSNorm (affine) |
+| MLP | GELU, 4x expansion |
+| tokenizer | GPT-2 (tiktoken) |
+| dtype | bfloat16 |
+| sparsity | none (vanilla) |
+## Training
+| Parameter | Value |
+|-----------|-------|
+| optimizer | Muon (hidden 2D) + AdamW (embeddings) |
+| muon_lr | 0.006 |
+| adamw_lr | 0.006 |
+| lr_schedule | linear_warmdown (warmdown_frac=0.2) |
+| batch_size | 80 |
+| seq_len | 1024 |
+| max_iters | 16000 |
+| tokens seen | ~1.3B |
+| dataset | FineWeb-Edu-10B |
+| best_val_loss | 6.2834 |
+## Purpose
+Interpretability comparison baseline. Trained with identical hyperparameters to
+`cayley-10b` but without the CayleySAE bottleneck at `mlp_in`. Enables direct
+comparison of residual stream representations.