---
tags:
  - pytorch
  - language-model
  - gpt
license: mit
---

# vanilla-10b

Vanilla GPT baseline trained to compare against [aemack-org/cayley-10b](https://huggingface.co/aemack-org/cayley-10b).

## Architecture

| Parameter | Value |
|-----------|-------|
| n_layer | 12 |
| n_head | 8 |
| n_embd | 1024 |
| block_size | 1024 |
| vocab_size | 50304 |
| bias | False |
| norm | RMSNorm (affine) |
| MLP | GELU, 4x expansion |
| tokenizer | GPT-2 (tiktoken) |
| dtype | bfloat16 |
| sparsity | none (vanilla) |

## Training

| Parameter | Value |
|-----------|-------|
| optimizer | Muon (hidden 2D) + AdamW (embeddings) |
| muon_lr | 0.006 |
| adamw_lr | 0.006 |
| lr_schedule | linear_warmdown (warmdown_frac=0.2) |
| batch_size | 80 |
| seq_len | 1024 |
| max_iters | 16000 |
| tokens seen | ~1.3B |
| dataset | FineWeb-Edu-10B |
| best_val_loss | 6.2834 |

## Purpose

Interpretability comparison baseline. Trained with identical hyperparameters to
`cayley-10b` but without the CayleySAE bottleneck at `mlp_in`. Enables direct
comparison of residual stream representations.