markhenry
/

vanilla-10b

Model card Files Files and versions

vanilla-10b / README.md

markhenry's picture

Upload README.md with huggingface_hub

f9d8091 verified 16 days ago

|

history blame contribute delete

1.08 kB

	---
	tags:
	- pytorch
	- language-model
	- gpt
	license: mit
	---

	# vanilla-10b

	Vanilla GPT baseline trained to compare against [aemack-org/cayley-10b](https://huggingface.co/aemack-org/cayley-10b).

	## Architecture

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| n_layer \| 12 \|
	\| n_head \| 8 \|
	\| n_embd \| 1024 \|
	\| block_size \| 1024 \|
	\| vocab_size \| 50304 \|
	\| bias \| False \|
	\| norm \| RMSNorm (affine) \|
	\| MLP \| GELU, 4x expansion \|
	\| tokenizer \| GPT-2 (tiktoken) \|
	\| dtype \| bfloat16 \|
	\| sparsity \| none (vanilla) \|

	## Training

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| optimizer \| Muon (hidden 2D) + AdamW (embeddings) \|
	\| muon_lr \| 0.006 \|
	\| adamw_lr \| 0.006 \|
	\| lr_schedule \| linear_warmdown (warmdown_frac=0.2) \|
	\| batch_size \| 80 \|
	\| seq_len \| 1024 \|
	\| max_iters \| 16000 \|
	\| tokens seen \| ~1.3B \|
	\| dataset \| FineWeb-Edu-10B \|
	\| best_val_loss \| 6.2834 \|

	## Purpose

	Interpretability comparison baseline. Trained with identical hyperparameters to
	`cayley-10b` but without the CayleySAE bottleneck at `mlp_in`. Enables direct
	comparison of residual stream representations.