---
license: mit
library_name: transformers
tags:
  - nanochat
  - causal-lm
  - long-context
  - rope
datasets:
  - nvidia/ClimbMix
  - HuggingFaceTB/smol-smoltalk
  - cais/mmlu
  - openai/gsm8k
  - allenai/tulu-v2-sft-long-mixture
pipeline_tag: text-generation
---

# nanochat miniseries

This repository is part of a miniseries of small (~360M–480M parameter) decoder-only transformers
trained on top of Andrej Karpathy's [`nanochat`](https://github.com/karpathy/nanochat) codebase.
The series varies three axes: **depth** (model size), **tokens-per-parameter** (pretraining horizon),
and **RoPE removal schedule** (fraction of the pretraining token budget spent with RoPE before it
is dropped for the remainder, used to study positional encoding in long-context generalization). A
subset of the SFT models is additionally fine-tuned on a long-context mixture (`_long` variants).

All models share the same tokenizer: a BPE tokenizer with vocab size 32,768 trained on ~2B characters
of the pretraining corpus.

## Training pipeline

Each model goes through the following stages:

1. **Tokenizer training** — 32,768-vocab BPE trained on ~2B characters of the pretraining dataset.
2. **Pretraining (base)** — Next-token prediction on NVIDIA's ClimbMix-400B corpus, hosted at
   [`karpathy/climbmix-400b-shuffle`](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle).
   Horizon is controlled by `target_param_data_ratio` (aka "tpp" in model names), i.e. tokens
   trained per model parameter. Sequence length 4096, batch size 1,048,576 tokens, AdamW + Muon
   optimizer.
3. **Supervised fine-tuning (SFT)** — Instruction tuning on a mixture of:
    - [`HuggingFaceTB/smol-smoltalk`](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) — 460K general conversations
    - Synthetic identity conversations (from [karpathy-public S3](https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl)) — 1K rows × 2 epochs
    - [`cais/mmlu`](https://huggingface.co/datasets/cais/mmlu) `auxiliary_train` — 100K rows × 3 epochs (multiple choice)
    - [`openai/gsm8k`](https://huggingface.co/datasets/openai/gsm8k) `main` — 8K rows × 4 epochs (math + tool use)
    - SimpleSpelling — 200K synthetic spelling examples
    - SpellingBee — 80K synthetic letter-counting examples
4. **Long-context SFT (`_long` variants only)** — Same mixture plus 100K rows of
   [`allenai/tulu-v2-sft-long-mixture`](https://huggingface.co/datasets/allenai/tulu-v2-sft-long-mixture),
   with sequence length extended to 8,192.

## RoPE removal (drope) experiment

Model names containing `drope_XX` follow the recipe from
[*"Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings"*](https://arxiv.org/pdf/2512.12167):
the model is pretrained normally with RoPE for the first `XX%` of its token budget, RoPE is then
removed, and the remaining `(100 − XX)%` of the pretraining budget is used to recalibrate the
model without positional encodings. For example, `drope_50` means 50% of the token budget was
spent with RoPE and the remaining 50% was spent with RoPE removed. This is intended to preserve
the optimization benefits of RoPE early in training while producing a NoPE-style model that
generalizes better to long contexts at inference time. Models without `drope` in the name keep
RoPE in every layer for the full pretraining budget (theta = 100,000).

## Model sizes

| Depth | Layers | Hidden | Heads | Intermediate | Approx params |
|-------|--------|--------|-------|--------------|---------------|
| d18   | 18     | 1152   | 9     | 3072         | ~360M         |
| d20   | 20     | 1280   | 10    | 3456         | ~480M         |

All models use head_dim=128, vocab=32,768, RMSNorm (ε=1e-6), SwiGLU MLP, and final logit softcapping at 15.0.

## Released checkpoints

RoPE schedule column: `none` means RoPE is kept on for the full pretraining budget. A percentage
(e.g. `50%`) means RoPE is kept on for the first portion of the token budget and then removed for
the remaining `(100 − XX)%` of pretraining, per the drope recipe above.

| Model tag                     | Depth | tpp  | RoPE schedule | Long-ctx SFT |
|-------------------------------|-------|------|---------------|--------------|
| d18_9tpp                      | 18    | 9    | none (always on) | no        |
| d18_9tpp_drope_25             | 18    | 9    | 25% then removed | no        |
| d18_9tpp_drope_50             | 18    | 9    | 50% then removed | no        |
| d18_9tpp_drope_75             | 18    | 9    | 75% then removed | no        |
| d18_20tpp                     | 18    | 20   | none (always on) | no        |
| d18_20tpp_long                | 18    | 20   | none (always on) | yes       |
| d18_20tpp_drope_50            | 18    | 20   | 50% then removed | no        |
| d18_20tpp_drope_50_long       | 18    | 20   | 50% then removed | yes       |
| d20_9tpp                      | 20    | 9    | none (always on) | no        |
| d20_9tpp_drope_25             | 20    | 9    | 25% then removed | no        |
| d20_9tpp_drope_50             | 20    | 9    | 50% then removed | no        |
| d20_9tpp_drope_75             | 20    | 9    | 75% then removed | no        |
| d20_20tpp                     | 20    | 20   | none (always on) | no        |
| d20_20tpp_long                | 20    | 20   | none (always on) | yes       |
| d20_20tpp_drope_50            | 20    | 20   | 50% then removed | no        |
| d20_20tpp_drope_50_long       | 20    | 20   | 50% then removed | yes       |
| d20_40tpp                     | 20    | 40   | none (always on) | no        |
| d20_40tpp_long                | 20    | 40   | none (always on) | yes       |
| d20_40tpp_drope_50            | 20    | 40   | 50% then removed | no        |
| d20_40tpp_drope_50_long       | 20    | 40   | 50% then removed | yes       |

`tpp` = tokens-per-parameter pretraining horizon. Total pretraining token budgets:

| Depth | tpp | Total pretraining tokens |
|-------|-----|--------------------------|
| d18   | 9   | ≈ 2.92 B |
| d18   | 20  | ≈ 6.49 B |
| d20   | 9   | ≈ 3.95 B |
| d20   | 20  | ≈ 8.77 B |
| d20   | 40  | ≈ 17.54 B |

`drope` variants use the same total token budget as their non-drope counterpart; the budget is
split between the RoPE-on and RoPE-removed phases as described above.

## Checkpoint format: which repo should I download?

For each model tag we publish **four** Hugging Face repositories:

| Repo suffix          | Stage            | Format                                          | Use case |
|----------------------|------------------|-------------------------------------------------|----------|
| `...-base`           | post-pretraining | nanochat native (`model_XXXXXX.pt`, `meta_*.json`, optimizer shard) | continue training / run with the `nanochat` repo |
| `...-sft`            | post-SFT         | nanochat native (`model_XXXXXX.pt`, `meta_*.json`, optimizer shard) | continue training / run with the `nanochat` repo |
| `...-hf-base`        | post-pretraining | Hugging Face `transformers` (`config.json`, `model.safetensors`, `tokenizer.json`) | drop-in `AutoModelForCausalLM` loading |
| `...-hf-sft`         | post-SFT         | Hugging Face `transformers` (`config.json`, `model.safetensors`, `tokenizer.json`) | drop-in `AutoModelForCausalLM` loading |

- The **`base_checkpoints`** and **`chatsft_checkpoints`** artifacts are the raw nanochat outputs. They
  include the optimizer state (`optim_*_rank0.pt`) and metadata (`meta_*.json` with training config,
  val BPB, step number, etc.), so you can resume training or evaluate with the nanochat scripts
  exactly as produced by `scripts.base_train` and `scripts.chat_sft`.
- The **`hf_base`** and **`hf_sft`** artifacts are conversions of those same weights into the
  Hugging Face `transformers` layout (architecture name `NanoChatForCausalLM`, `model_type`
  `nanochat`). Load them with:

  ```python
  from transformers import AutoModelForCausalLM, AutoTokenizer
  model = AutoModelForCausalLM.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True)
  tokenizer = AutoTokenizer.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True)
  ```

  `use_rope` in `config.json` reflects the drope setting: `true` for models that kept RoPE for the
  entire pretraining budget, and `false` for drope variants (where RoPE was removed partway through
  pretraining and the model was recalibrated without it). In the drope case, rotary embeddings are
  not applied at inference time.

Pick `-hf-base` / `-hf-sft` for inference. Pick `-base` / `-sft` only if you plan to continue
training inside the nanochat codebase.

## Inference sketch (HF format, SFT)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "crellis/nanochat-d20-20tpp-hf-sft"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()

messages = [{"role": "user", "content": "Why is the sky blue?"}]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").cuda()
out = model.generate(inputs, max_new_tokens=256)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
```

Base (pretrained-only) checkpoints are next-token predictors and do not understand the chat
template; use `-hf-base` for completion-style prompting and `-hf-sft` for chat.

## Training compute

All runs were trained on a single H100 GPU via Slurm. Pretraining wall-clock ranges from
~4 hours (d18 @ 9tpp) to ~15 hours (d20 @ 40tpp); SFT adds ~30–90 minutes depending on variant.

## Citation / acknowledgements

- Codebase: [`karpathy/nanochat`](https://github.com/karpathy/nanochat)
- Pretraining data: NVIDIA ClimbMix (via `karpathy/climbmix-400b-shuffle`)
- SFT data: HuggingFaceTB SmolTalk, CAIS MMLU, OpenAI GSM8K, AI2 Tulu-v2 long-mixture
- RoPE-removal recipe: [*Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings*](https://arxiv.org/pdf/2512.12167) (arXiv:2512.12167)

## License

MIT (inherits from the nanochat repository).