--- tags: - pytorch - language-model - gpt license: mit --- # vanilla-10b Vanilla GPT baseline trained to compare against [aemack-org/cayley-10b](https://huggingface.co/aemack-org/cayley-10b). ## Architecture | Parameter | Value | |-----------|-------| | n_layer | 12 | | n_head | 8 | | n_embd | 1024 | | block_size | 1024 | | vocab_size | 50304 | | bias | False | | norm | RMSNorm (affine) | | MLP | GELU, 4x expansion | | tokenizer | GPT-2 (tiktoken) | | dtype | bfloat16 | | sparsity | none (vanilla) | ## Training | Parameter | Value | |-----------|-------| | optimizer | Muon (hidden 2D) + AdamW (embeddings) | | muon_lr | 0.006 | | adamw_lr | 0.006 | | lr_schedule | linear_warmdown (warmdown_frac=0.2) | | batch_size | 80 | | seq_len | 1024 | | max_iters | 16000 | | tokens seen | ~1.3B | | dataset | FineWeb-Edu-10B | | best_val_loss | 6.2834 | ## Purpose Interpretability comparison baseline. Trained with identical hyperparameters to `cayley-10b` but without the CayleySAE bottleneck at `mlp_in`. Enables direct comparison of residual stream representations.