vanilla-10b

Vanilla GPT baseline trained to compare against aemack-org/cayley-10b.

Architecture

Parameter Value
n_layer 12
n_head 8
n_embd 1024
block_size 1024
vocab_size 50304
bias False
norm RMSNorm (affine)
MLP GELU, 4x expansion
tokenizer GPT-2 (tiktoken)
dtype bfloat16
sparsity none (vanilla)

Training

Parameter Value
optimizer Muon (hidden 2D) + AdamW (embeddings)
muon_lr 0.006
adamw_lr 0.006
lr_schedule linear_warmdown (warmdown_frac=0.2)
batch_size 80
seq_len 1024
max_iters 16000
tokens seen ~1.3B
dataset FineWeb-Edu-10B
best_val_loss 6.2834

Purpose

Interpretability comparison baseline. Trained with identical hyperparameters to cayley-10b but without the CayleySAE bottleneck at mlp_in. Enables direct comparison of residual stream representations.

Downloads last month
20
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support