Upload README.md with huggingface_hub

f9d8091 verified 15 days ago

1.08 kB

tags:
  - pytorch
  - language-model
  - gpt
license: mit

vanilla-10b

Vanilla GPT baseline trained to compare against aemack-org/cayley-10b.

Architecture

Parameter	Value
n_layer	12
n_head	8
n_embd	1024
block_size	1024
vocab_size	50304
bias	False
norm	RMSNorm (affine)
MLP	GELU, 4x expansion
tokenizer	GPT-2 (tiktoken)
dtype	bfloat16
sparsity	none (vanilla)

Training

Parameter	Value
optimizer	Muon (hidden 2D) + AdamW (embeddings)
muon_lr	0.006
adamw_lr	0.006
lr_schedule	linear_warmdown (warmdown_frac=0.2)
batch_size	80
seq_len	1024
max_iters	16000
tokens seen	~1.3B
dataset	FineWeb-Edu-10B
best_val_loss	6.2834

Purpose

Interpretability comparison baseline. Trained with identical hyperparameters to cayley-10b but without the CayleySAE bottleneck at mlp_in. Enables direct comparison of residual stream representations.