Instructions to use crellis/d18-20tpp-base_checkpoints with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use crellis/d18-20tpp-base_checkpoints with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="crellis/d18-20tpp-base_checkpoints")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("crellis/d18-20tpp-base_checkpoints", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use crellis/d18-20tpp-base_checkpoints with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "crellis/d18-20tpp-base_checkpoints"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "crellis/d18-20tpp-base_checkpoints",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/crellis/d18-20tpp-base_checkpoints

SGLang

How to use crellis/d18-20tpp-base_checkpoints with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "crellis/d18-20tpp-base_checkpoints" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "crellis/d18-20tpp-base_checkpoints",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "crellis/d18-20tpp-base_checkpoints" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "crellis/d18-20tpp-base_checkpoints",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use crellis/d18-20tpp-base_checkpoints with Docker Model Runner:
```
docker model run hf.co/crellis/d18-20tpp-base_checkpoints
```

crellis commited on Apr 19

Commit

b1af749

verified ·

1 Parent(s): d665b85

Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

README.md +180 -0
meta_006187.json +57 -0
model_006187.pt +3 -0
optim_006187_rank0.pt +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,180 @@

+---
+license: mit
+library_name: transformers
+tags:
+  - nanochat
+  - causal-lm
+  - long-context
+  - rope
+datasets:
+  - nvidia/ClimbMix
+  - HuggingFaceTB/smol-smoltalk
+  - cais/mmlu
+  - openai/gsm8k
+  - allenai/tulu-v2-sft-long-mixture
+pipeline_tag: text-generation
+---
+# nanochat miniseries
+This repository is part of a miniseries of small (~360M–480M parameter) decoder-only transformers
+trained on top of Andrej Karpathy's [`nanochat`](https://github.com/karpathy/nanochat) codebase.
+The series varies three axes: **depth** (model size), **tokens-per-parameter** (pretraining horizon),
+and **RoPE removal schedule** (fraction of the pretraining token budget spent with RoPE before it
+is dropped for the remainder, used to study positional encoding in long-context generalization). A
+subset of the SFT models is additionally fine-tuned on a long-context mixture (`_long` variants).
+All models share the same tokenizer: a BPE tokenizer with vocab size 32,768 trained on ~2B characters
+of the pretraining corpus.
+## Training pipeline
+Each model goes through the following stages:
+1. **Tokenizer training** — 32,768-vocab BPE trained on ~2B characters of the pretraining dataset.
+2. **Pretraining (base)** — Next-token prediction on NVIDIA's ClimbMix-400B corpus, hosted at
+   [`karpathy/climbmix-400b-shuffle`](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle).
+   Horizon is controlled by `target_param_data_ratio` (aka "tpp" in model names), i.e. tokens
+   trained per model parameter. Sequence length 4096, batch size 1,048,576 tokens, AdamW + Muon
+   optimizer.
+3. **Supervised fine-tuning (SFT)** — Instruction tuning on a mixture of:
+    - [`HuggingFaceTB/smol-smoltalk`](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) — 460K general conversations
+    - Synthetic identity conversations (from [karpathy-public S3](https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl)) — 1K rows × 2 epochs
+    - [`cais/mmlu`](https://huggingface.co/datasets/cais/mmlu) `auxiliary_train` — 100K rows × 3 epochs (multiple choice)
+    - [`openai/gsm8k`](https://huggingface.co/datasets/openai/gsm8k) `main` — 8K rows × 4 epochs (math + tool use)
+    - SimpleSpelling — 200K synthetic spelling examples
+    - SpellingBee — 80K synthetic letter-counting examples
+4. **Long-context SFT (`_long` variants only)** — Same mixture plus 100K rows of
+   [`allenai/tulu-v2-sft-long-mixture`](https://huggingface.co/datasets/allenai/tulu-v2-sft-long-mixture),
+   with sequence length extended to 8,192.
+## RoPE removal (drope) experiment
+Model names containing `drope_XX` follow the recipe from
+[*"Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings"*](https://arxiv.org/pdf/2512.12167):
+the model is pretrained normally with RoPE for the first `XX%` of its token budget, RoPE is then
+removed, and the remaining `(100 − XX)%` of the pretraining budget is used to recalibrate the
+model without positional encodings. For example, `drope_50` means 50% of the token budget was
+spent with RoPE and the remaining 50% was spent with RoPE removed. This is intended to preserve
+the optimization benefits of RoPE early in training while producing a NoPE-style model that
+generalizes better to long contexts at inference time. Models without `drope` in the name keep
+RoPE in every layer for the full pretraining budget (theta = 100,000).
+## Model sizes
+| Depth | Layers | Hidden | Heads | Intermediate | Approx params |
+|-------|--------|--------|-------|--------------|---------------|
+| d18   | 18     | 1152   | 9     | 3072         | ~360M         |
+| d20   | 20     | 1280   | 10    | 3456         | ~480M         |
+All models use head_dim=128, vocab=32,768, RMSNorm (ε=1e-6), SwiGLU MLP, and final logit softcapping at 15.0.
+## Released checkpoints
+RoPE schedule column: `none` means RoPE is kept on for the full pretraining budget. A percentage
+(e.g. `50%`) means RoPE is kept on for the first portion of the token budget and then removed for
+the remaining `(100 − XX)%` of pretraining, per the drope recipe above.
+| Model tag                     | Depth | tpp  | RoPE schedule | Long-ctx SFT |
+|-------------------------------|-------|------|---------------|--------------|
+| d18_9tpp                      | 18    | 9    | none (always on) | no        |
+| d18_9tpp_drope_25             | 18    | 9    | 25% then removed | no        |
+| d18_9tpp_drope_50             | 18    | 9    | 50% then removed | no        |
+| d18_9tpp_drope_75             | 18    | 9    | 75% then removed | no        |
+| d18_20tpp                     | 18    | 20   | none (always on) | no        |
+| d18_20tpp_long                | 18    | 20   | none (always on) | yes       |
+| d18_20tpp_drope_50            | 18    | 20   | 50% then removed | no        |
+| d18_20tpp_drope_50_long       | 18    | 20   | 50% then removed | yes       |
+| d20_9tpp                      | 20    | 9    | none (always on) | no        |
+| d20_9tpp_drope_25             | 20    | 9    | 25% then removed | no        |
+| d20_9tpp_drope_50             | 20    | 9    | 50% then removed | no        |
+| d20_9tpp_drope_75             | 20    | 9    | 75% then removed | no        |
+| d20_20tpp                     | 20    | 20   | none (always on) | no        |
+| d20_20tpp_long                | 20    | 20   | none (always on) | yes       |
+| d20_20tpp_drope_50            | 20    | 20   | 50% then removed | no        |
+| d20_20tpp_drope_50_long       | 20    | 20   | 50% then removed | yes       |
+| d20_40tpp                     | 20    | 40   | none (always on) | no        |
+| d20_40tpp_long                | 20    | 40   | none (always on) | yes       |
+| d20_40tpp_drope_50            | 20    | 40   | 50% then removed | no        |
+| d20_40tpp_drope_50_long       | 20    | 40   | 50% then removed | yes       |
+`tpp` = tokens-per-parameter pretraining horizon. Total pretraining token budgets:
+| Depth | tpp | Total pretraining tokens |
+|-------|-----|--------------------------|
+| d18   | 9   | ≈ 2.92 B |
+| d18   | 20  | ≈ 6.49 B |
+| d20   | 9   | ≈ 3.95 B |
+| d20   | 20  | ≈ 8.77 B |
+| d20   | 40  | ≈ 17.54 B |
+`drope` variants use the same total token budget as their non-drope counterpart; the budget is
+split between the RoPE-on and RoPE-removed phases as described above.
+## Checkpoint format: which repo should I download?
+For each model tag we publish **four** Hugging Face repositories:
+| Repo suffix          | Stage            | Format                                          | Use case |
+|----------------------|------------------|-------------------------------------------------|----------|
+| `...-base`           | post-pretraining | nanochat native (`model_XXXXXX.pt`, `meta_*.json`, optimizer shard) | continue training / run with the `nanochat` repo |
+| `...-sft`            | post-SFT         | nanochat native (`model_XXXXXX.pt`, `meta_*.json`, optimizer shard) | continue training / run with the `nanochat` repo |
+| `...-hf-base`        | post-pretraining | Hugging Face `transformers` (`config.json`, `model.safetensors`, `tokenizer.json`) | drop-in `AutoModelForCausalLM` loading |
+| `...-hf-sft`         | post-SFT         | Hugging Face `transformers` (`config.json`, `model.safetensors`, `tokenizer.json`) | drop-in `AutoModelForCausalLM` loading |
+- The **`base_checkpoints`** and **`chatsft_checkpoints`** artifacts are the raw nanochat outputs. They
+  include the optimizer state (`optim_*_rank0.pt`) and metadata (`meta_*.json` with training config,
+  val BPB, step number, etc.), so you can resume training or evaluate with the nanochat scripts
+  exactly as produced by `scripts.base_train` and `scripts.chat_sft`.
+- The **`hf_base`** and **`hf_sft`** artifacts are conversions of those same weights into the
+  Hugging Face `transformers` layout (architecture name `NanoChatForCausalLM`, `model_type`
+  `nanochat`). Load them with:
+  ```python
+  from transformers import AutoModelForCausalLM, AutoTokenizer
+  model = AutoModelForCausalLM.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True)
+  tokenizer = AutoTokenizer.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True)
+  ```
+  `use_rope` in `config.json` reflects the drope setting: `true` for models that kept RoPE for the
+  entire pretraining budget, and `false` for drope variants (where RoPE was removed partway through
+  pretraining and the model was recalibrated without it). In the drope case, rotary embeddings are
+  not applied at inference time.
+Pick `-hf-base` / `-hf-sft` for inference. Pick `-base` / `-sft` only if you plan to continue
+training inside the nanochat codebase.
+## Inference sketch (HF format, SFT)
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+repo = "crellis/nanochat-d20-20tpp-hf-sft"
+tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()
+messages = [{"role": "user", "content": "Why is the sky blue?"}]
+inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").cuda()
+out = model.generate(inputs, max_new_tokens=256)
+print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
+```
+Base (pretrained-only) checkpoints are next-token predictors and do not understand the chat
+template; use `-hf-base` for completion-style prompting and `-hf-sft` for chat.
+## Training compute
+All runs were trained on a single H100 GPU via Slurm. Pretraining wall-clock ranges from
+~4 hours (d18 @ 9tpp) to ~15 hours (d20 @ 40tpp); SFT adds ~30–90 minutes depending on variant.
+## Citation / acknowledgements
+- Codebase: [`karpathy/nanochat`](https://github.com/karpathy/nanochat)
+- Pretraining data: NVIDIA ClimbMix (via `karpathy/climbmix-400b-shuffle`)
+- SFT data: HuggingFaceTB SmolTalk, CAIS MMLU, OpenAI GSM8K, AI2 Tulu-v2 long-mixture
+- RoPE-removal recipe: [*Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings*](https://arxiv.org/pdf/2512.12167) (arXiv:2512.12167)
+## License
+MIT (inherits from the nanochat repository).

meta_006187.json ADDED Viewed

	@@ -0,0 +1,57 @@

+{
+  "step": 6187,
+  "val_bpb": 0.7721254779551417,
+  "model_config": {
+    "sequence_len": 4096,
+    "vocab_size": 32768,
+    "n_layer": 18,
+    "n_head": 9,
+    "n_kv_head": 9,
+    "n_embd": 1152
+  },
+  "user_config": {
+    "run": "d18",
+    "device_type": "",
+    "fp8": true,
+    "fp8_recipe": "tensorwise",
+    "depth": 18,
+    "aspect_ratio": 64,
+    "head_dim": 128,
+    "max_seq_len": 4096,
+    "num_iterations": -1,
+    "target_flops": -1.0,
+    "target_param_data_ratio": 20.0,
+    "device_batch_size": 16,
+    "total_batch_size": -1,
+    "embedding_lr": 0.3,
+    "unembedding_lr": 0.008,
+    "weight_decay": 0.28,
+    "matrix_lr": 0.02,
+    "warmup_steps": 40,
+    "warmdown_ratio": 0.65,
+    "final_lr_frac": 0.05,
+    "resume_from_step": -1,
+    "eval_every": 250,
+    "eval_tokens": 41943040,
+    "core_metric_every": 2000,
+    "core_metric_max_per_task": 500,
+    "sample_every": 2000,
+    "save_every_pct": 25.0,
+    "model_tag": "d18",
+    "rope_removal_pct": -1
+  },
+  "device_batch_size": 16,
+  "max_seq_len": 4096,
+  "total_batch_size": 1048576,
+  "dataloader_state_dict": {
+    "pq_idx": 133,
+    "rg_idx": 6,
+    "epoch": 1
+  },
+  "loop_state": {
+    "min_val_bpb": 0.7721254779551417,
+    "smooth_train_loss": 2.535478062795301,
+    "total_training_time": 15976.807270765305,
+    "use_rope": true
+  }
+}

model_006187.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a99e443949e8bd838d6bdc80f9333255fb8fbfe64f26d431c1a80a2ae01bfa3d
+size 1373162077

optim_006187_rank0.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c8b6732960db8d0ffef63b8123c0219dbaefd136bc696e55f3be6d33e22e0503
+size 1600603109