vanilla-v5-parity

Vanilla GPT baseline trained to compare against aemack-org/cayley-10b.

Architecture

Parameter	Value
n_layer	12
n_head	8
n_embd	1024
block_size	1024
vocab_size	50304
bias	False
norm	RMSNorm (affine)
MLP	GELU, 4x expansion
tokenizer	GPT-2 (tiktoken)
dtype	bfloat16
sparsity	none (vanilla)

Training

Parameter	Value
optimizer	Muon (hidden 2D) + AdamW (embeddings)
muon_lr	0.006
adamw_lr	0.006
lr_schedule	linear_warmdown (warmdown_frac=0.4)
batch_size	80
seq_len	1024
max_iters	16000
tokens seen	~2.79B
dataset	FineWeb-Edu-10B
best_val_loss	3.1728

Purpose

Interpretability comparison baseline. Trained with identical hyperparameters to cayley-10b but without the CayleySAE bottleneck at mlp_in. Enables direct comparison of residual stream representations.

Early stopping (cayley val parity)

Training stopped at the first evaluation where validation CE was at or below 3.173 (published validation loss for aemack-org/cayley-10b). This checkpoint is from iter 4250 (~2.79B tokens seen), not the full max_iters budget in config.json.

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support