Instructions to use crellis/d20-20tpp-base_checkpoints with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use crellis/d20-20tpp-base_checkpoints with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="crellis/d20-20tpp-base_checkpoints")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("crellis/d20-20tpp-base_checkpoints", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use crellis/d20-20tpp-base_checkpoints with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "crellis/d20-20tpp-base_checkpoints" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "crellis/d20-20tpp-base_checkpoints", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/crellis/d20-20tpp-base_checkpoints
- SGLang
How to use crellis/d20-20tpp-base_checkpoints with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "crellis/d20-20tpp-base_checkpoints" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "crellis/d20-20tpp-base_checkpoints", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "crellis/d20-20tpp-base_checkpoints" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "crellis/d20-20tpp-base_checkpoints", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use crellis/d20-20tpp-base_checkpoints with Docker Model Runner:
docker model run hf.co/crellis/d20-20tpp-base_checkpoints
| license: mit | |
| library_name: transformers | |
| tags: | |
| - nanochat | |
| - causal-lm | |
| - long-context | |
| - rope | |
| datasets: | |
| - nvidia/ClimbMix | |
| - HuggingFaceTB/smol-smoltalk | |
| - cais/mmlu | |
| - openai/gsm8k | |
| - allenai/tulu-v2-sft-long-mixture | |
| pipeline_tag: text-generation | |
| # nanochat miniseries | |
| This repository is part of a miniseries of small (~360M–480M parameter) decoder-only transformers | |
| trained on top of Andrej Karpathy's [`nanochat`](https://github.com/karpathy/nanochat) codebase. | |
| The series varies three axes: **depth** (model size), **tokens-per-parameter** (pretraining horizon), | |
| and **RoPE removal schedule** (fraction of the pretraining token budget spent with RoPE before it | |
| is dropped for the remainder, used to study positional encoding in long-context generalization). A | |
| subset of the SFT models is additionally fine-tuned on a long-context mixture (`_long` variants). | |
| All models share the same tokenizer: a BPE tokenizer with vocab size 32,768 trained on ~2B characters | |
| of the pretraining corpus. | |
| ## Training pipeline | |
| Each model goes through the following stages: | |
| 1. **Tokenizer training** — 32,768-vocab BPE trained on ~2B characters of the pretraining dataset. | |
| 2. **Pretraining (base)** — Next-token prediction on NVIDIA's ClimbMix-400B corpus, hosted at | |
| [`karpathy/climbmix-400b-shuffle`](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle). | |
| Horizon is controlled by `target_param_data_ratio` (aka "tpp" in model names), i.e. tokens | |
| trained per model parameter. Sequence length 4096, batch size 1,048,576 tokens, AdamW + Muon | |
| optimizer. | |
| 3. **Supervised fine-tuning (SFT)** — Instruction tuning on a mixture of: | |
| - [`HuggingFaceTB/smol-smoltalk`](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) — 460K general conversations | |
| - Synthetic identity conversations (from [karpathy-public S3](https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl)) — 1K rows × 2 epochs | |
| - [`cais/mmlu`](https://huggingface.co/datasets/cais/mmlu) `auxiliary_train` — 100K rows × 3 epochs (multiple choice) | |
| - [`openai/gsm8k`](https://huggingface.co/datasets/openai/gsm8k) `main` — 8K rows × 4 epochs (math + tool use) | |
| - SimpleSpelling — 200K synthetic spelling examples | |
| - SpellingBee — 80K synthetic letter-counting examples | |
| 4. **Long-context SFT (`_long` variants only)** — Same mixture plus 100K rows of | |
| [`allenai/tulu-v2-sft-long-mixture`](https://huggingface.co/datasets/allenai/tulu-v2-sft-long-mixture), | |
| with sequence length extended to 8,192. | |
| ## RoPE removal (drope) experiment | |
| Model names containing `drope_XX` follow the recipe from | |
| [*"Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings"*](https://arxiv.org/pdf/2512.12167): | |
| the model is pretrained normally with RoPE for the first `XX%` of its token budget, RoPE is then | |
| removed, and the remaining `(100 − XX)%` of the pretraining budget is used to recalibrate the | |
| model without positional encodings. For example, `drope_50` means 50% of the token budget was | |
| spent with RoPE and the remaining 50% was spent with RoPE removed. This is intended to preserve | |
| the optimization benefits of RoPE early in training while producing a NoPE-style model that | |
| generalizes better to long contexts at inference time. Models without `drope` in the name keep | |
| RoPE in every layer for the full pretraining budget (theta = 100,000). | |
| ## Model sizes | |
| | Depth | Layers | Hidden | Heads | Intermediate | Approx params | | |
| |-------|--------|--------|-------|--------------|---------------| | |
| | d18 | 18 | 1152 | 9 | 3072 | ~360M | | |
| | d20 | 20 | 1280 | 10 | 3456 | ~480M | | |
| All models use head_dim=128, vocab=32,768, RMSNorm (ε=1e-6), SwiGLU MLP, and final logit softcapping at 15.0. | |
| ## Released checkpoints | |
| RoPE schedule column: `none` means RoPE is kept on for the full pretraining budget. A percentage | |
| (e.g. `50%`) means RoPE is kept on for the first portion of the token budget and then removed for | |
| the remaining `(100 − XX)%` of pretraining, per the drope recipe above. | |
| | Model tag | Depth | tpp | RoPE schedule | Long-ctx SFT | | |
| |-------------------------------|-------|------|---------------|--------------| | |
| | d18_9tpp | 18 | 9 | none (always on) | no | | |
| | d18_9tpp_drope_25 | 18 | 9 | 25% then removed | no | | |
| | d18_9tpp_drope_50 | 18 | 9 | 50% then removed | no | | |
| | d18_9tpp_drope_75 | 18 | 9 | 75% then removed | no | | |
| | d18_20tpp | 18 | 20 | none (always on) | no | | |
| | d18_20tpp_long | 18 | 20 | none (always on) | yes | | |
| | d18_20tpp_drope_50 | 18 | 20 | 50% then removed | no | | |
| | d18_20tpp_drope_50_long | 18 | 20 | 50% then removed | yes | | |
| | d20_9tpp | 20 | 9 | none (always on) | no | | |
| | d20_9tpp_drope_25 | 20 | 9 | 25% then removed | no | | |
| | d20_9tpp_drope_50 | 20 | 9 | 50% then removed | no | | |
| | d20_9tpp_drope_75 | 20 | 9 | 75% then removed | no | | |
| | d20_20tpp | 20 | 20 | none (always on) | no | | |
| | d20_20tpp_long | 20 | 20 | none (always on) | yes | | |
| | d20_20tpp_drope_50 | 20 | 20 | 50% then removed | no | | |
| | d20_20tpp_drope_50_long | 20 | 20 | 50% then removed | yes | | |
| | d20_40tpp | 20 | 40 | none (always on) | no | | |
| | d20_40tpp_long | 20 | 40 | none (always on) | yes | | |
| | d20_40tpp_drope_50 | 20 | 40 | 50% then removed | no | | |
| | d20_40tpp_drope_50_long | 20 | 40 | 50% then removed | yes | | |
| `tpp` = tokens-per-parameter pretraining horizon. Total pretraining token budgets: | |
| | Depth | tpp | Total pretraining tokens | | |
| |-------|-----|--------------------------| | |
| | d18 | 9 | ≈ 2.92 B | | |
| | d18 | 20 | ≈ 6.49 B | | |
| | d20 | 9 | ≈ 3.95 B | | |
| | d20 | 20 | ≈ 8.77 B | | |
| | d20 | 40 | ≈ 17.54 B | | |
| `drope` variants use the same total token budget as their non-drope counterpart; the budget is | |
| split between the RoPE-on and RoPE-removed phases as described above. | |
| ## Checkpoint format: which repo should I download? | |
| For each model tag we publish **four** Hugging Face repositories: | |
| | Repo suffix | Stage | Format | Use case | | |
| |----------------------|------------------|-------------------------------------------------|----------| | |
| | `...-base` | post-pretraining | nanochat native (`model_XXXXXX.pt`, `meta_*.json`, optimizer shard) | continue training / run with the `nanochat` repo | | |
| | `...-sft` | post-SFT | nanochat native (`model_XXXXXX.pt`, `meta_*.json`, optimizer shard) | continue training / run with the `nanochat` repo | | |
| | `...-hf-base` | post-pretraining | Hugging Face `transformers` (`config.json`, `model.safetensors`, `tokenizer.json`) | drop-in `AutoModelForCausalLM` loading | | |
| | `...-hf-sft` | post-SFT | Hugging Face `transformers` (`config.json`, `model.safetensors`, `tokenizer.json`) | drop-in `AutoModelForCausalLM` loading | | |
| - The **`base_checkpoints`** and **`chatsft_checkpoints`** artifacts are the raw nanochat outputs. They | |
| include the optimizer state (`optim_*_rank0.pt`) and metadata (`meta_*.json` with training config, | |
| val BPB, step number, etc.), so you can resume training or evaluate with the nanochat scripts | |
| exactly as produced by `scripts.base_train` and `scripts.chat_sft`. | |
| - The **`hf_base`** and **`hf_sft`** artifacts are conversions of those same weights into the | |
| Hugging Face `transformers` layout (architecture name `NanoChatForCausalLM`, `model_type` | |
| `nanochat`). Load them with: | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model = AutoModelForCausalLM.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True) | |
| tokenizer = AutoTokenizer.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True) | |
| ``` | |
| `use_rope` in `config.json` reflects the drope setting: `true` for models that kept RoPE for the | |
| entire pretraining budget, and `false` for drope variants (where RoPE was removed partway through | |
| pretraining and the model was recalibrated without it). In the drope case, rotary embeddings are | |
| not applied at inference time. | |
| Pick `-hf-base` / `-hf-sft` for inference. Pick `-base` / `-sft` only if you plan to continue | |
| training inside the nanochat codebase. | |
| ## Inference sketch (HF format, SFT) | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| import torch | |
| repo = "crellis/nanochat-d20-20tpp-hf-sft" | |
| tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda() | |
| messages = [{"role": "user", "content": "Why is the sky blue?"}] | |
| inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").cuda() | |
| out = model.generate(inputs, max_new_tokens=256) | |
| print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| Base (pretrained-only) checkpoints are next-token predictors and do not understand the chat | |
| template; use `-hf-base` for completion-style prompting and `-hf-sft` for chat. | |
| ## Training compute | |
| All runs were trained on a single H100 GPU via Slurm. Pretraining wall-clock ranges from | |
| ~4 hours (d18 @ 9tpp) to ~15 hours (d20 @ 40tpp); SFT adds ~30–90 minutes depending on variant. | |
| ## Citation / acknowledgements | |
| - Codebase: [`karpathy/nanochat`](https://github.com/karpathy/nanochat) | |
| - Pretraining data: NVIDIA ClimbMix (via `karpathy/climbmix-400b-shuffle`) | |
| - SFT data: HuggingFaceTB SmolTalk, CAIS MMLU, OpenAI GSM8K, AI2 Tulu-v2 long-mixture | |
| - RoPE-removal recipe: [*Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings*](https://arxiv.org/pdf/2512.12167) (arXiv:2512.12167) | |
| ## License | |
| MIT (inherits from the nanochat repository). | |