--- license: mit library_name: transformers tags: - nanochat - causal-lm - long-context - rope datasets: - nvidia/ClimbMix - HuggingFaceTB/smol-smoltalk - cais/mmlu - openai/gsm8k - allenai/tulu-v2-sft-long-mixture pipeline_tag: text-generation --- # nanochat miniseries This repository is part of a miniseries of small (~360M–480M parameter) decoder-only transformers trained on top of Andrej Karpathy's [`nanochat`](https://github.com/karpathy/nanochat) codebase. The series varies three axes: **depth** (model size), **tokens-per-parameter** (pretraining horizon), and **RoPE removal schedule** (fraction of the pretraining token budget spent with RoPE before it is dropped for the remainder, used to study positional encoding in long-context generalization). A subset of the SFT models is additionally fine-tuned on a long-context mixture (`_long` variants). All models share the same tokenizer: a BPE tokenizer with vocab size 32,768 trained on ~2B characters of the pretraining corpus. ## Training pipeline Each model goes through the following stages: 1. **Tokenizer training** — 32,768-vocab BPE trained on ~2B characters of the pretraining dataset. 2. **Pretraining (base)** — Next-token prediction on NVIDIA's ClimbMix-400B corpus, hosted at [`karpathy/climbmix-400b-shuffle`](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle). Horizon is controlled by `target_param_data_ratio` (aka "tpp" in model names), i.e. tokens trained per model parameter. Sequence length 4096, batch size 1,048,576 tokens, AdamW + Muon optimizer. 3. **Supervised fine-tuning (SFT)** — Instruction tuning on a mixture of: - [`HuggingFaceTB/smol-smoltalk`](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) — 460K general conversations - Synthetic identity conversations (from [karpathy-public S3](https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl)) — 1K rows × 2 epochs - [`cais/mmlu`](https://huggingface.co/datasets/cais/mmlu) `auxiliary_train` — 100K rows × 3 epochs (multiple choice) - [`openai/gsm8k`](https://huggingface.co/datasets/openai/gsm8k) `main` — 8K rows × 4 epochs (math + tool use) - SimpleSpelling — 200K synthetic spelling examples - SpellingBee — 80K synthetic letter-counting examples 4. **Long-context SFT (`_long` variants only)** — Same mixture plus 100K rows of [`allenai/tulu-v2-sft-long-mixture`](https://huggingface.co/datasets/allenai/tulu-v2-sft-long-mixture), with sequence length extended to 8,192. ## RoPE removal (drope) experiment Model names containing `drope_XX` follow the recipe from [*"Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings"*](https://arxiv.org/pdf/2512.12167): the model is pretrained normally with RoPE for the first `XX%` of its token budget, RoPE is then removed, and the remaining `(100 − XX)%` of the pretraining budget is used to recalibrate the model without positional encodings. For example, `drope_50` means 50% of the token budget was spent with RoPE and the remaining 50% was spent with RoPE removed. This is intended to preserve the optimization benefits of RoPE early in training while producing a NoPE-style model that generalizes better to long contexts at inference time. Models without `drope` in the name keep RoPE in every layer for the full pretraining budget (theta = 100,000). ## Model sizes | Depth | Layers | Hidden | Heads | Intermediate | Approx params | |-------|--------|--------|-------|--------------|---------------| | d18 | 18 | 1152 | 9 | 3072 | ~360M | | d20 | 20 | 1280 | 10 | 3456 | ~480M | All models use head_dim=128, vocab=32,768, RMSNorm (ε=1e-6), SwiGLU MLP, and final logit softcapping at 15.0. ## Released checkpoints RoPE schedule column: `none` means RoPE is kept on for the full pretraining budget. A percentage (e.g. `50%`) means RoPE is kept on for the first portion of the token budget and then removed for the remaining `(100 − XX)%` of pretraining, per the drope recipe above. | Model tag | Depth | tpp | RoPE schedule | Long-ctx SFT | |-------------------------------|-------|------|---------------|--------------| | d18_9tpp | 18 | 9 | none (always on) | no | | d18_9tpp_drope_25 | 18 | 9 | 25% then removed | no | | d18_9tpp_drope_50 | 18 | 9 | 50% then removed | no | | d18_9tpp_drope_75 | 18 | 9 | 75% then removed | no | | d18_20tpp | 18 | 20 | none (always on) | no | | d18_20tpp_long | 18 | 20 | none (always on) | yes | | d18_20tpp_drope_50 | 18 | 20 | 50% then removed | no | | d18_20tpp_drope_50_long | 18 | 20 | 50% then removed | yes | | d20_9tpp | 20 | 9 | none (always on) | no | | d20_9tpp_drope_25 | 20 | 9 | 25% then removed | no | | d20_9tpp_drope_50 | 20 | 9 | 50% then removed | no | | d20_9tpp_drope_75 | 20 | 9 | 75% then removed | no | | d20_20tpp | 20 | 20 | none (always on) | no | | d20_20tpp_long | 20 | 20 | none (always on) | yes | | d20_20tpp_drope_50 | 20 | 20 | 50% then removed | no | | d20_20tpp_drope_50_long | 20 | 20 | 50% then removed | yes | | d20_40tpp | 20 | 40 | none (always on) | no | | d20_40tpp_long | 20 | 40 | none (always on) | yes | | d20_40tpp_drope_50 | 20 | 40 | 50% then removed | no | | d20_40tpp_drope_50_long | 20 | 40 | 50% then removed | yes | `tpp` = tokens-per-parameter pretraining horizon. Total pretraining token budgets: | Depth | tpp | Total pretraining tokens | |-------|-----|--------------------------| | d18 | 9 | ≈ 2.92 B | | d18 | 20 | ≈ 6.49 B | | d20 | 9 | ≈ 3.95 B | | d20 | 20 | ≈ 8.77 B | | d20 | 40 | ≈ 17.54 B | `drope` variants use the same total token budget as their non-drope counterpart; the budget is split between the RoPE-on and RoPE-removed phases as described above. ## Checkpoint format: which repo should I download? For each model tag we publish **four** Hugging Face repositories: | Repo suffix | Stage | Format | Use case | |----------------------|------------------|-------------------------------------------------|----------| | `...-base` | post-pretraining | nanochat native (`model_XXXXXX.pt`, `meta_*.json`, optimizer shard) | continue training / run with the `nanochat` repo | | `...-sft` | post-SFT | nanochat native (`model_XXXXXX.pt`, `meta_*.json`, optimizer shard) | continue training / run with the `nanochat` repo | | `...-hf-base` | post-pretraining | Hugging Face `transformers` (`config.json`, `model.safetensors`, `tokenizer.json`) | drop-in `AutoModelForCausalLM` loading | | `...-hf-sft` | post-SFT | Hugging Face `transformers` (`config.json`, `model.safetensors`, `tokenizer.json`) | drop-in `AutoModelForCausalLM` loading | - The **`base_checkpoints`** and **`chatsft_checkpoints`** artifacts are the raw nanochat outputs. They include the optimizer state (`optim_*_rank0.pt`) and metadata (`meta_*.json` with training config, val BPB, step number, etc.), so you can resume training or evaluate with the nanochat scripts exactly as produced by `scripts.base_train` and `scripts.chat_sft`. - The **`hf_base`** and **`hf_sft`** artifacts are conversions of those same weights into the Hugging Face `transformers` layout (architecture name `NanoChatForCausalLM`, `model_type` `nanochat`). Load them with: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True) ``` `use_rope` in `config.json` reflects the drope setting: `true` for models that kept RoPE for the entire pretraining budget, and `false` for drope variants (where RoPE was removed partway through pretraining and the model was recalibrated without it). In the drope case, rotary embeddings are not applied at inference time. Pick `-hf-base` / `-hf-sft` for inference. Pick `-base` / `-sft` only if you plan to continue training inside the nanochat codebase. ## Inference sketch (HF format, SFT) ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch repo = "crellis/nanochat-d20-20tpp-hf-sft" tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda() messages = [{"role": "user", "content": "Why is the sky blue?"}] inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").cuda() out = model.generate(inputs, max_new_tokens=256) print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True)) ``` Base (pretrained-only) checkpoints are next-token predictors and do not understand the chat template; use `-hf-base` for completion-style prompting and `-hf-sft` for chat. ## Training compute All runs were trained on a single H100 GPU via Slurm. Pretraining wall-clock ranges from ~4 hours (d18 @ 9tpp) to ~15 hours (d20 @ 40tpp); SFT adds ~30–90 minutes depending on variant. ## Citation / acknowledgements - Codebase: [`karpathy/nanochat`](https://github.com/karpathy/nanochat) - Pretraining data: NVIDIA ClimbMix (via `karpathy/climbmix-400b-shuffle`) - SFT data: HuggingFaceTB SmolTalk, CAIS MMLU, OpenAI GSM8K, AI2 Tulu-v2 long-mixture - RoPE-removal recipe: [*Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings*](https://arxiv.org/pdf/2512.12167) (arXiv:2512.12167) ## License MIT (inherits from the nanochat repository).