vanilla-10b / README.md
markhenry's picture
Upload README.md with huggingface_hub
f9d8091 verified
metadata
tags:
  - pytorch
  - language-model
  - gpt
license: mit

vanilla-10b

Vanilla GPT baseline trained to compare against aemack-org/cayley-10b.

Architecture

Parameter Value
n_layer 12
n_head 8
n_embd 1024
block_size 1024
vocab_size 50304
bias False
norm RMSNorm (affine)
MLP GELU, 4x expansion
tokenizer GPT-2 (tiktoken)
dtype bfloat16
sparsity none (vanilla)

Training

Parameter Value
optimizer Muon (hidden 2D) + AdamW (embeddings)
muon_lr 0.006
adamw_lr 0.006
lr_schedule linear_warmdown (warmdown_frac=0.2)
batch_size 80
seq_len 1024
max_iters 16000
tokens seen ~1.3B
dataset FineWeb-Edu-10B
best_val_loss 6.2834

Purpose

Interpretability comparison baseline. Trained with identical hyperparameters to cayley-10b but without the CayleySAE bottleneck at mlp_in. Enables direct comparison of residual stream representations.