nanochat-d12-step87k

Architecture: 12 layers, 768 dim, 6 heads, RoPE, GQA, ReLU² MLP
Context: 2048 tokens, full attention (window_pattern=L)
Training: 87,000 steps, ~5.7B tokens, Chinchilla-optimal (ratio=12)
Val bpb: 0.8658
GPU: RTX 4070 12GB, bf16, 28.4 hours

286M parameter GPT model trained with nanochat framework.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support