MathNano β€” from-scratch math LM (nanochat depth-16, ~200M)

A ~200M-parameter transformer pretrained from random initialization on mathematical text, using Karpathy's nanochat. The deep-learning artifact of the MathNano project β€” trained on a single RTX 4090.

Details

  • Architecture: nanochat depth=16 β€” 16 layers, d_model=1024, 8 heads (MHA), vocab 32,768, seq_len 2048; reluΒ² MLP, untied embeddings, QK-norm, RoPE, value embeddings (modded-nanoGPT).
  • Data: ~2.5 B tokens sampled balanced across MathPile sources (arXiv, textbooks, proofwiki, stackexchange, wikipedia, commoncrawl).
  • Compute: ~11.8 h, 72% MFU (--window-pattern=L, SDPA fallback on Ada).
  • Result: minimum validation bits-per-byte β‰ˆ 0.731.

What it can and can't do

Generates coherent mathematical/English text with real factual recall (e.g. "the chemical symbol of gold is Au", "if yesterday was Friday, then tomorrow will be Saturday"). It does not reliably solve problems or follow instructions β€” that requires SFT/RL (see the project's Track B). This is a base model for study and further fine-tuning, not an instruct model.

Files

  • model_005376.pt β€” final model weights (nanochat checkpoint format).
  • meta_005376.json β€” config/metadata.
  • tokenizer/ β€” the trained nanochat (RustBPE) tokenizer.

Loading requires the nanochat codebase. See the project repo for details: https://github.com/adimunot21/mathnano

License

Trained on MathPile (CC BY-NC-SA 4.0); this model inherits the non-commercial restriction.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train adimunot/mathnano-pretrain-d16