crellis's picture
Upload folder using huggingface_hub
94d09e0 verified
---
license: mit
library_name: transformers
tags:
- nanochat
- causal-lm
- long-context
- rope
datasets:
- nvidia/ClimbMix
- HuggingFaceTB/smol-smoltalk
- cais/mmlu
- openai/gsm8k
- allenai/tulu-v2-sft-long-mixture
pipeline_tag: text-generation
---
# nanochat miniseries
This repository is part of a miniseries of small (~360M–480M parameter) decoder-only transformers
trained on top of Andrej Karpathy's [`nanochat`](https://github.com/karpathy/nanochat) codebase.
The series varies three axes: **depth** (model size), **tokens-per-parameter** (pretraining horizon),
and **RoPE removal schedule** (fraction of the pretraining token budget spent with RoPE before it
is dropped for the remainder, used to study positional encoding in long-context generalization). A
subset of the SFT models is additionally fine-tuned on a long-context mixture (`_long` variants).
All models share the same tokenizer: a BPE tokenizer with vocab size 32,768 trained on ~2B characters
of the pretraining corpus.
## Training pipeline
Each model goes through the following stages:
1. **Tokenizer training** — 32,768-vocab BPE trained on ~2B characters of the pretraining dataset.
2. **Pretraining (base)** — Next-token prediction on NVIDIA's ClimbMix-400B corpus, hosted at
[`karpathy/climbmix-400b-shuffle`](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle).
Horizon is controlled by `target_param_data_ratio` (aka "tpp" in model names), i.e. tokens
trained per model parameter. Sequence length 4096, batch size 1,048,576 tokens, AdamW + Muon
optimizer.
3. **Supervised fine-tuning (SFT)** — Instruction tuning on a mixture of:
- [`HuggingFaceTB/smol-smoltalk`](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) — 460K general conversations
- Synthetic identity conversations (from [karpathy-public S3](https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl)) — 1K rows × 2 epochs
- [`cais/mmlu`](https://huggingface.co/datasets/cais/mmlu) `auxiliary_train` — 100K rows × 3 epochs (multiple choice)
- [`openai/gsm8k`](https://huggingface.co/datasets/openai/gsm8k) `main` — 8K rows × 4 epochs (math + tool use)
- SimpleSpelling — 200K synthetic spelling examples
- SpellingBee — 80K synthetic letter-counting examples
4. **Long-context SFT (`_long` variants only)** — Same mixture plus 100K rows of
[`allenai/tulu-v2-sft-long-mixture`](https://huggingface.co/datasets/allenai/tulu-v2-sft-long-mixture),
with sequence length extended to 8,192.
## RoPE removal (drope) experiment
Model names containing `drope_XX` follow the recipe from
[*"Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings"*](https://arxiv.org/pdf/2512.12167):
the model is pretrained normally with RoPE for the first `XX%` of its token budget, RoPE is then
removed, and the remaining `(100 − XX)%` of the pretraining budget is used to recalibrate the
model without positional encodings. For example, `drope_50` means 50% of the token budget was
spent with RoPE and the remaining 50% was spent with RoPE removed. This is intended to preserve
the optimization benefits of RoPE early in training while producing a NoPE-style model that
generalizes better to long contexts at inference time. Models without `drope` in the name keep
RoPE in every layer for the full pretraining budget (theta = 100,000).
## Model sizes
| Depth | Layers | Hidden | Heads | Intermediate | Approx params |
|-------|--------|--------|-------|--------------|---------------|
| d18 | 18 | 1152 | 9 | 3072 | ~360M |
| d20 | 20 | 1280 | 10 | 3456 | ~480M |
All models use head_dim=128, vocab=32,768, RMSNorm (ε=1e-6), SwiGLU MLP, and final logit softcapping at 15.0.
## Released checkpoints
RoPE schedule column: `none` means RoPE is kept on for the full pretraining budget. A percentage
(e.g. `50%`) means RoPE is kept on for the first portion of the token budget and then removed for
the remaining `(100 − XX)%` of pretraining, per the drope recipe above.
| Model tag | Depth | tpp | RoPE schedule | Long-ctx SFT |
|-------------------------------|-------|------|---------------|--------------|
| d18_9tpp | 18 | 9 | none (always on) | no |
| d18_9tpp_drope_25 | 18 | 9 | 25% then removed | no |
| d18_9tpp_drope_50 | 18 | 9 | 50% then removed | no |
| d18_9tpp_drope_75 | 18 | 9 | 75% then removed | no |
| d18_20tpp | 18 | 20 | none (always on) | no |
| d18_20tpp_long | 18 | 20 | none (always on) | yes |
| d18_20tpp_drope_50 | 18 | 20 | 50% then removed | no |
| d18_20tpp_drope_50_long | 18 | 20 | 50% then removed | yes |
| d20_9tpp | 20 | 9 | none (always on) | no |
| d20_9tpp_drope_25 | 20 | 9 | 25% then removed | no |
| d20_9tpp_drope_50 | 20 | 9 | 50% then removed | no |
| d20_9tpp_drope_75 | 20 | 9 | 75% then removed | no |
| d20_20tpp | 20 | 20 | none (always on) | no |
| d20_20tpp_long | 20 | 20 | none (always on) | yes |
| d20_20tpp_drope_50 | 20 | 20 | 50% then removed | no |
| d20_20tpp_drope_50_long | 20 | 20 | 50% then removed | yes |
| d20_40tpp | 20 | 40 | none (always on) | no |
| d20_40tpp_long | 20 | 40 | none (always on) | yes |
| d20_40tpp_drope_50 | 20 | 40 | 50% then removed | no |
| d20_40tpp_drope_50_long | 20 | 40 | 50% then removed | yes |
`tpp` = tokens-per-parameter pretraining horizon. Total pretraining token budgets:
| Depth | tpp | Total pretraining tokens |
|-------|-----|--------------------------|
| d18 | 9 | ≈ 2.92 B |
| d18 | 20 | ≈ 6.49 B |
| d20 | 9 | ≈ 3.95 B |
| d20 | 20 | ≈ 8.77 B |
| d20 | 40 | ≈ 17.54 B |
`drope` variants use the same total token budget as their non-drope counterpart; the budget is
split between the RoPE-on and RoPE-removed phases as described above.
## Checkpoint format: which repo should I download?
For each model tag we publish **four** Hugging Face repositories:
| Repo suffix | Stage | Format | Use case |
|----------------------|------------------|-------------------------------------------------|----------|
| `...-base` | post-pretraining | nanochat native (`model_XXXXXX.pt`, `meta_*.json`, optimizer shard) | continue training / run with the `nanochat` repo |
| `...-sft` | post-SFT | nanochat native (`model_XXXXXX.pt`, `meta_*.json`, optimizer shard) | continue training / run with the `nanochat` repo |
| `...-hf-base` | post-pretraining | Hugging Face `transformers` (`config.json`, `model.safetensors`, `tokenizer.json`) | drop-in `AutoModelForCausalLM` loading |
| `...-hf-sft` | post-SFT | Hugging Face `transformers` (`config.json`, `model.safetensors`, `tokenizer.json`) | drop-in `AutoModelForCausalLM` loading |
- The **`base_checkpoints`** and **`chatsft_checkpoints`** artifacts are the raw nanochat outputs. They
include the optimizer state (`optim_*_rank0.pt`) and metadata (`meta_*.json` with training config,
val BPB, step number, etc.), so you can resume training or evaluate with the nanochat scripts
exactly as produced by `scripts.base_train` and `scripts.chat_sft`.
- The **`hf_base`** and **`hf_sft`** artifacts are conversions of those same weights into the
Hugging Face `transformers` layout (architecture name `NanoChatForCausalLM`, `model_type`
`nanochat`). Load them with:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True)
```
`use_rope` in `config.json` reflects the drope setting: `true` for models that kept RoPE for the
entire pretraining budget, and `false` for drope variants (where RoPE was removed partway through
pretraining and the model was recalibrated without it). In the drope case, rotary embeddings are
not applied at inference time.
Pick `-hf-base` / `-hf-sft` for inference. Pick `-base` / `-sft` only if you plan to continue
training inside the nanochat codebase.
## Inference sketch (HF format, SFT)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
repo = "crellis/nanochat-d20-20tpp-hf-sft"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()
messages = [{"role": "user", "content": "Why is the sky blue?"}]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").cuda()
out = model.generate(inputs, max_new_tokens=256)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
```
Base (pretrained-only) checkpoints are next-token predictors and do not understand the chat
template; use `-hf-base` for completion-style prompting and `-hf-sft` for chat.
## Training compute
All runs were trained on a single H100 GPU via Slurm. Pretraining wall-clock ranges from
~4 hours (d18 @ 9tpp) to ~15 hours (d20 @ 40tpp); SFT adds ~30–90 minutes depending on variant.
## Citation / acknowledgements
- Codebase: [`karpathy/nanochat`](https://github.com/karpathy/nanochat)
- Pretraining data: NVIDIA ClimbMix (via `karpathy/climbmix-400b-shuffle`)
- SFT data: HuggingFaceTB SmolTalk, CAIS MMLU, OpenAI GSM8K, AI2 Tulu-v2 long-mixture
- RoPE-removal recipe: [*Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings*](https://arxiv.org/pdf/2512.12167) (arXiv:2512.12167)
## License
MIT (inherits from the nanochat repository).