MathNano — from-scratch math LM (nanochat depth-16, ~200M)

A ~200M-parameter transformer pretrained from random initialization on mathematical text, using Karpathy's nanochat. The deep-learning artifact of the MathNano project — trained on a single RTX 4090.

Details

Architecture: nanochat depth=16 — 16 layers, d_model=1024, 8 heads (MHA), vocab 32,768, seq_len 2048; relu² MLP, untied embeddings, QK-norm, RoPE, value embeddings (modded-nanoGPT).
Data: ~2.5 B tokens sampled balanced across MathPile sources (arXiv, textbooks, proofwiki, stackexchange, wikipedia, commoncrawl).
Compute: ~11.8 h, 72% MFU (--window-pattern=L, SDPA fallback on Ada).
Result: minimum validation bits-per-byte ≈ 0.731.

What it can and can't do

Generates coherent mathematical/English text with real factual recall (e.g. "the chemical symbol of gold is Au", "if yesterday was Friday, then tomorrow will be Saturday"). It does not reliably solve problems or follow instructions — that requires SFT/RL (see the project's Track B). This is a base model for study and further fine-tuning, not an instruct model.

Files

model_005376.pt — final model weights (nanochat checkpoint format).
meta_005376.json — config/metadata.
tokenizer/ — the trained nanochat (RustBPE) tokenizer.

Loading requires the nanochat codebase. See the project repo for details: https://github.com/adimunot21/mathnano

License

Trained on MathPile (CC BY-NC-SA 4.0); this model inherits the non-commercial restriction.

Downloads last month: -; Downloads are not tracked for this model. How to track

adimunot
/

mathnano-pretrain-d16

MathNano — from-scratch math LM (nanochat depth-16, ~200M)

Details

What it can and can't do

Files

License

Dataset used to train adimunot/mathnano-pretrain-d16