Instructions to use crellis/d20-40tpp-base_checkpoints with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use crellis/d20-40tpp-base_checkpoints with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="crellis/d20-40tpp-base_checkpoints")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("crellis/d20-40tpp-base_checkpoints", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use crellis/d20-40tpp-base_checkpoints with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "crellis/d20-40tpp-base_checkpoints"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "crellis/d20-40tpp-base_checkpoints",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/crellis/d20-40tpp-base_checkpoints

SGLang

How to use crellis/d20-40tpp-base_checkpoints with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "crellis/d20-40tpp-base_checkpoints" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "crellis/d20-40tpp-base_checkpoints",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "crellis/d20-40tpp-base_checkpoints" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "crellis/d20-40tpp-base_checkpoints",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use crellis/d20-40tpp-base_checkpoints with Docker Model Runner:
```
docker model run hf.co/crellis/d20-40tpp-base_checkpoints
```

d20-40tpp-base_checkpoints / README.md

crellis

Upload folder using huggingface_hub

94d09e0 verified about 2 months ago

preview code

raw

history blame contribute delete

10.2 kB

	---
	license: mit
	library_name: transformers
	tags:
	- nanochat
	- causal-lm
	- long-context
	- rope
	datasets:
	- nvidia/ClimbMix
	- HuggingFaceTB/smol-smoltalk
	- cais/mmlu
	- openai/gsm8k
	- allenai/tulu-v2-sft-long-mixture
	pipeline_tag: text-generation
	---

	# nanochat miniseries

	This repository is part of a miniseries of small (~360M–480M parameter) decoder-only transformers
	trained on top of Andrej Karpathy's [`nanochat`](https://github.com/karpathy/nanochat) codebase.
	The series varies three axes: depth (model size), tokens-per-parameter (pretraining horizon),
	and RoPE removal schedule (fraction of the pretraining token budget spent with RoPE before it
	is dropped for the remainder, used to study positional encoding in long-context generalization). A
	subset of the SFT models is additionally fine-tuned on a long-context mixture (`_long` variants).

	All models share the same tokenizer: a BPE tokenizer with vocab size 32,768 trained on ~2B characters
	of the pretraining corpus.

	## Training pipeline

	Each model goes through the following stages:

	1. Tokenizer training — 32,768-vocab BPE trained on ~2B characters of the pretraining dataset.
	2. Pretraining (base) — Next-token prediction on NVIDIA's ClimbMix-400B corpus, hosted at
	[`karpathy/climbmix-400b-shuffle`](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle).
	Horizon is controlled by `target_param_data_ratio` (aka "tpp" in model names), i.e. tokens
	trained per model parameter. Sequence length 4096, batch size 1,048,576 tokens, AdamW + Muon
	optimizer.
	3. Supervised fine-tuning (SFT) — Instruction tuning on a mixture of:
	- [`HuggingFaceTB/smol-smoltalk`](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) — 460K general conversations
	- Synthetic identity conversations (from [karpathy-public S3](https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl)) — 1K rows × 2 epochs
	- [`cais/mmlu`](https://huggingface.co/datasets/cais/mmlu) `auxiliary_train` — 100K rows × 3 epochs (multiple choice)
	- [`openai/gsm8k`](https://huggingface.co/datasets/openai/gsm8k) `main` — 8K rows × 4 epochs (math + tool use)
	- SimpleSpelling — 200K synthetic spelling examples
	- SpellingBee — 80K synthetic letter-counting examples
	4. Long-context SFT (`_long` variants only) — Same mixture plus 100K rows of
	[`allenai/tulu-v2-sft-long-mixture`](https://huggingface.co/datasets/allenai/tulu-v2-sft-long-mixture),
	with sequence length extended to 8,192.

	## RoPE removal (drope) experiment

	Model names containing `drope_XX` follow the recipe from
	["Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings"](https://arxiv.org/pdf/2512.12167):
	the model is pretrained normally with RoPE for the first `XX%` of its token budget, RoPE is then
	removed, and the remaining `(100 − XX)%` of the pretraining budget is used to recalibrate the
	model without positional encodings. For example, `drope_50` means 50% of the token budget was
	spent with RoPE and the remaining 50% was spent with RoPE removed. This is intended to preserve
	the optimization benefits of RoPE early in training while producing a NoPE-style model that
	generalizes better to long contexts at inference time. Models without `drope` in the name keep
	RoPE in every layer for the full pretraining budget (theta = 100,000).

	## Model sizes

	\| Depth \| Layers \| Hidden \| Heads \| Intermediate \| Approx params \|
	\|-------\|--------\|--------\|-------\|--------------\|---------------\|
	\| d18 \| 18 \| 1152 \| 9 \| 3072 \| ~360M \|
	\| d20 \| 20 \| 1280 \| 10 \| 3456 \| ~480M \|

	All models use head_dim=128, vocab=32,768, RMSNorm (ε=1e-6), SwiGLU MLP, and final logit softcapping at 15.0.

	## Released checkpoints

	RoPE schedule column: `none` means RoPE is kept on for the full pretraining budget. A percentage
	(e.g. `50%`) means RoPE is kept on for the first portion of the token budget and then removed for
	the remaining `(100 − XX)%` of pretraining, per the drope recipe above.

	\| Model tag \| Depth \| tpp \| RoPE schedule \| Long-ctx SFT \|
	\|-------------------------------\|-------\|------\|---------------\|--------------\|
	\| d18_9tpp \| 18 \| 9 \| none (always on) \| no \|
	\| d18_9tpp_drope_25 \| 18 \| 9 \| 25% then removed \| no \|
	\| d18_9tpp_drope_50 \| 18 \| 9 \| 50% then removed \| no \|
	\| d18_9tpp_drope_75 \| 18 \| 9 \| 75% then removed \| no \|
	\| d18_20tpp \| 18 \| 20 \| none (always on) \| no \|
	\| d18_20tpp_long \| 18 \| 20 \| none (always on) \| yes \|
	\| d18_20tpp_drope_50 \| 18 \| 20 \| 50% then removed \| no \|
	\| d18_20tpp_drope_50_long \| 18 \| 20 \| 50% then removed \| yes \|
	\| d20_9tpp \| 20 \| 9 \| none (always on) \| no \|
	\| d20_9tpp_drope_25 \| 20 \| 9 \| 25% then removed \| no \|
	\| d20_9tpp_drope_50 \| 20 \| 9 \| 50% then removed \| no \|
	\| d20_9tpp_drope_75 \| 20 \| 9 \| 75% then removed \| no \|
	\| d20_20tpp \| 20 \| 20 \| none (always on) \| no \|
	\| d20_20tpp_long \| 20 \| 20 \| none (always on) \| yes \|
	\| d20_20tpp_drope_50 \| 20 \| 20 \| 50% then removed \| no \|
	\| d20_20tpp_drope_50_long \| 20 \| 20 \| 50% then removed \| yes \|
	\| d20_40tpp \| 20 \| 40 \| none (always on) \| no \|
	\| d20_40tpp_long \| 20 \| 40 \| none (always on) \| yes \|
	\| d20_40tpp_drope_50 \| 20 \| 40 \| 50% then removed \| no \|
	\| d20_40tpp_drope_50_long \| 20 \| 40 \| 50% then removed \| yes \|

	`tpp` = tokens-per-parameter pretraining horizon. Total pretraining token budgets:

	\| Depth \| tpp \| Total pretraining tokens \|
	\|-------\|-----\|--------------------------\|
	\| d18 \| 9 \| ≈ 2.92 B \|
	\| d18 \| 20 \| ≈ 6.49 B \|
	\| d20 \| 9 \| ≈ 3.95 B \|
	\| d20 \| 20 \| ≈ 8.77 B \|
	\| d20 \| 40 \| ≈ 17.54 B \|

	`drope` variants use the same total token budget as their non-drope counterpart; the budget is
	split between the RoPE-on and RoPE-removed phases as described above.

	## Checkpoint format: which repo should I download?

	For each model tag we publish four Hugging Face repositories:

	\| Repo suffix \| Stage \| Format \| Use case \|
	\|----------------------\|------------------\|-------------------------------------------------\|----------\|
	\| `...-base` \| post-pretraining \| nanochat native (`model_XXXXXX.pt`, `meta_*.json`, optimizer shard) \| continue training / run with the `nanochat` repo \|
	\| `...-sft` \| post-SFT \| nanochat native (`model_XXXXXX.pt`, `meta_*.json`, optimizer shard) \| continue training / run with the `nanochat` repo \|
	\| `...-hf-base` \| post-pretraining \| Hugging Face `transformers` (`config.json`, `model.safetensors`, `tokenizer.json`) \| drop-in `AutoModelForCausalLM` loading \|
	\| `...-hf-sft` \| post-SFT \| Hugging Face `transformers` (`config.json`, `model.safetensors`, `tokenizer.json`) \| drop-in `AutoModelForCausalLM` loading \|

	- The `base_checkpoints` and `chatsft_checkpoints` artifacts are the raw nanochat outputs. They
	include the optimizer state (`optim__rank0.pt`) and metadata (`meta_.json` with training config,
	val BPB, step number, etc.), so you can resume training or evaluate with the nanochat scripts
	exactly as produced by `scripts.base_train` and `scripts.chat_sft`.
	- The `hf_base` and `hf_sft` artifacts are conversions of those same weights into the
	Hugging Face `transformers` layout (architecture name `NanoChatForCausalLM`, `model_type`
	`nanochat`). Load them with:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	model = AutoModelForCausalLM.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True)
	```

	`use_rope` in `config.json` reflects the drope setting: `true` for models that kept RoPE for the
	entire pretraining budget, and `false` for drope variants (where RoPE was removed partway through
	pretraining and the model was recalibrated without it). In the drope case, rotary embeddings are
	not applied at inference time.

	Pick `-hf-base` / `-hf-sft` for inference. Pick `-base` / `-sft` only if you plan to continue
	training inside the nanochat codebase.

	## Inference sketch (HF format, SFT)

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	repo = "crellis/nanochat-d20-20tpp-hf-sft"
	tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()

	messages = [{"role": "user", "content": "Why is the sky blue?"}]
	inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").cuda()
	out = model.generate(inputs, max_new_tokens=256)
	print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
	```

	Base (pretrained-only) checkpoints are next-token predictors and do not understand the chat
	template; use `-hf-base` for completion-style prompting and `-hf-sft` for chat.

	## Training compute

	All runs were trained on a single H100 GPU via Slurm. Pretraining wall-clock ranges from
	~4 hours (d18 @ 9tpp) to ~15 hours (d20 @ 40tpp); SFT adds ~30–90 minutes depending on variant.

	## Citation / acknowledgements

	- Codebase: [`karpathy/nanochat`](https://github.com/karpathy/nanochat)
	- Pretraining data: NVIDIA ClimbMix (via `karpathy/climbmix-400b-shuffle`)
	- SFT data: HuggingFaceTB SmolTalk, CAIS MMLU, OpenAI GSM8K, AI2 Tulu-v2 long-mixture
	- RoPE-removal recipe: [Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings](https://arxiv.org/pdf/2512.12167) (arXiv:2512.12167)

	## License

	MIT (inherits from the nanochat repository).