vanilla-v5-parity

Vanilla GPT baseline trained to compare against aemack-org/cayley-10b.

Architecture

Parameter Value
n_layer 12
n_head 8
n_embd 1024
block_size 1024
vocab_size 50304
bias False
norm RMSNorm (affine)
MLP GELU, 4x expansion
tokenizer GPT-2 (tiktoken)
dtype bfloat16
sparsity none (vanilla)

Training

Parameter Value
optimizer Muon (hidden 2D) + AdamW (embeddings)
muon_lr 0.006
adamw_lr 0.006
lr_schedule linear_warmdown (warmdown_frac=0.4)
batch_size 80
seq_len 1024
max_iters 16000
tokens seen ~2.79B
dataset FineWeb-Edu-10B
best_val_loss 3.1728

Purpose

Interpretability comparison baseline. Trained with identical hyperparameters to cayley-10b but without the CayleySAE bottleneck at mlp_in. Enables direct comparison of residual stream representations.

Early stopping (cayley val parity)

Training stopped at the first evaluation where validation CE was at or below 3.173 (published validation loss for aemack-org/cayley-10b). This checkpoint is from iter 4250 (~2.79B tokens seen), not the full max_iters budget in config.json.

Downloads last month
22
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support