crellis commited on
Commit
b1af749
·
verified ·
1 Parent(s): d665b85

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +180 -0
  2. meta_006187.json +57 -0
  3. model_006187.pt +3 -0
  4. optim_006187_rank0.pt +3 -0
README.md ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ tags:
5
+ - nanochat
6
+ - causal-lm
7
+ - long-context
8
+ - rope
9
+ datasets:
10
+ - nvidia/ClimbMix
11
+ - HuggingFaceTB/smol-smoltalk
12
+ - cais/mmlu
13
+ - openai/gsm8k
14
+ - allenai/tulu-v2-sft-long-mixture
15
+ pipeline_tag: text-generation
16
+ ---
17
+
18
+ # nanochat miniseries
19
+
20
+ This repository is part of a miniseries of small (~360M–480M parameter) decoder-only transformers
21
+ trained on top of Andrej Karpathy's [`nanochat`](https://github.com/karpathy/nanochat) codebase.
22
+ The series varies three axes: **depth** (model size), **tokens-per-parameter** (pretraining horizon),
23
+ and **RoPE removal schedule** (fraction of the pretraining token budget spent with RoPE before it
24
+ is dropped for the remainder, used to study positional encoding in long-context generalization). A
25
+ subset of the SFT models is additionally fine-tuned on a long-context mixture (`_long` variants).
26
+
27
+ All models share the same tokenizer: a BPE tokenizer with vocab size 32,768 trained on ~2B characters
28
+ of the pretraining corpus.
29
+
30
+ ## Training pipeline
31
+
32
+ Each model goes through the following stages:
33
+
34
+ 1. **Tokenizer training** — 32,768-vocab BPE trained on ~2B characters of the pretraining dataset.
35
+ 2. **Pretraining (base)** — Next-token prediction on NVIDIA's ClimbMix-400B corpus, hosted at
36
+ [`karpathy/climbmix-400b-shuffle`](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle).
37
+ Horizon is controlled by `target_param_data_ratio` (aka "tpp" in model names), i.e. tokens
38
+ trained per model parameter. Sequence length 4096, batch size 1,048,576 tokens, AdamW + Muon
39
+ optimizer.
40
+ 3. **Supervised fine-tuning (SFT)** — Instruction tuning on a mixture of:
41
+ - [`HuggingFaceTB/smol-smoltalk`](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) — 460K general conversations
42
+ - Synthetic identity conversations (from [karpathy-public S3](https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl)) — 1K rows × 2 epochs
43
+ - [`cais/mmlu`](https://huggingface.co/datasets/cais/mmlu) `auxiliary_train` — 100K rows × 3 epochs (multiple choice)
44
+ - [`openai/gsm8k`](https://huggingface.co/datasets/openai/gsm8k) `main` — 8K rows × 4 epochs (math + tool use)
45
+ - SimpleSpelling — 200K synthetic spelling examples
46
+ - SpellingBee — 80K synthetic letter-counting examples
47
+ 4. **Long-context SFT (`_long` variants only)** — Same mixture plus 100K rows of
48
+ [`allenai/tulu-v2-sft-long-mixture`](https://huggingface.co/datasets/allenai/tulu-v2-sft-long-mixture),
49
+ with sequence length extended to 8,192.
50
+
51
+ ## RoPE removal (drope) experiment
52
+
53
+ Model names containing `drope_XX` follow the recipe from
54
+ [*"Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings"*](https://arxiv.org/pdf/2512.12167):
55
+ the model is pretrained normally with RoPE for the first `XX%` of its token budget, RoPE is then
56
+ removed, and the remaining `(100 − XX)%` of the pretraining budget is used to recalibrate the
57
+ model without positional encodings. For example, `drope_50` means 50% of the token budget was
58
+ spent with RoPE and the remaining 50% was spent with RoPE removed. This is intended to preserve
59
+ the optimization benefits of RoPE early in training while producing a NoPE-style model that
60
+ generalizes better to long contexts at inference time. Models without `drope` in the name keep
61
+ RoPE in every layer for the full pretraining budget (theta = 100,000).
62
+
63
+ ## Model sizes
64
+
65
+ | Depth | Layers | Hidden | Heads | Intermediate | Approx params |
66
+ |-------|--------|--------|-------|--------------|---------------|
67
+ | d18 | 18 | 1152 | 9 | 3072 | ~360M |
68
+ | d20 | 20 | 1280 | 10 | 3456 | ~480M |
69
+
70
+ All models use head_dim=128, vocab=32,768, RMSNorm (ε=1e-6), SwiGLU MLP, and final logit softcapping at 15.0.
71
+
72
+ ## Released checkpoints
73
+
74
+ RoPE schedule column: `none` means RoPE is kept on for the full pretraining budget. A percentage
75
+ (e.g. `50%`) means RoPE is kept on for the first portion of the token budget and then removed for
76
+ the remaining `(100 − XX)%` of pretraining, per the drope recipe above.
77
+
78
+ | Model tag | Depth | tpp | RoPE schedule | Long-ctx SFT |
79
+ |-------------------------------|-------|------|---------------|--------------|
80
+ | d18_9tpp | 18 | 9 | none (always on) | no |
81
+ | d18_9tpp_drope_25 | 18 | 9 | 25% then removed | no |
82
+ | d18_9tpp_drope_50 | 18 | 9 | 50% then removed | no |
83
+ | d18_9tpp_drope_75 | 18 | 9 | 75% then removed | no |
84
+ | d18_20tpp | 18 | 20 | none (always on) | no |
85
+ | d18_20tpp_long | 18 | 20 | none (always on) | yes |
86
+ | d18_20tpp_drope_50 | 18 | 20 | 50% then removed | no |
87
+ | d18_20tpp_drope_50_long | 18 | 20 | 50% then removed | yes |
88
+ | d20_9tpp | 20 | 9 | none (always on) | no |
89
+ | d20_9tpp_drope_25 | 20 | 9 | 25% then removed | no |
90
+ | d20_9tpp_drope_50 | 20 | 9 | 50% then removed | no |
91
+ | d20_9tpp_drope_75 | 20 | 9 | 75% then removed | no |
92
+ | d20_20tpp | 20 | 20 | none (always on) | no |
93
+ | d20_20tpp_long | 20 | 20 | none (always on) | yes |
94
+ | d20_20tpp_drope_50 | 20 | 20 | 50% then removed | no |
95
+ | d20_20tpp_drope_50_long | 20 | 20 | 50% then removed | yes |
96
+ | d20_40tpp | 20 | 40 | none (always on) | no |
97
+ | d20_40tpp_long | 20 | 40 | none (always on) | yes |
98
+ | d20_40tpp_drope_50 | 20 | 40 | 50% then removed | no |
99
+ | d20_40tpp_drope_50_long | 20 | 40 | 50% then removed | yes |
100
+
101
+ `tpp` = tokens-per-parameter pretraining horizon. Total pretraining token budgets:
102
+
103
+ | Depth | tpp | Total pretraining tokens |
104
+ |-------|-----|--------------------------|
105
+ | d18 | 9 | ≈ 2.92 B |
106
+ | d18 | 20 | ≈ 6.49 B |
107
+ | d20 | 9 | ≈ 3.95 B |
108
+ | d20 | 20 | ≈ 8.77 B |
109
+ | d20 | 40 | ≈ 17.54 B |
110
+
111
+ `drope` variants use the same total token budget as their non-drope counterpart; the budget is
112
+ split between the RoPE-on and RoPE-removed phases as described above.
113
+
114
+ ## Checkpoint format: which repo should I download?
115
+
116
+ For each model tag we publish **four** Hugging Face repositories:
117
+
118
+ | Repo suffix | Stage | Format | Use case |
119
+ |----------------------|------------------|-------------------------------------------------|----------|
120
+ | `...-base` | post-pretraining | nanochat native (`model_XXXXXX.pt`, `meta_*.json`, optimizer shard) | continue training / run with the `nanochat` repo |
121
+ | `...-sft` | post-SFT | nanochat native (`model_XXXXXX.pt`, `meta_*.json`, optimizer shard) | continue training / run with the `nanochat` repo |
122
+ | `...-hf-base` | post-pretraining | Hugging Face `transformers` (`config.json`, `model.safetensors`, `tokenizer.json`) | drop-in `AutoModelForCausalLM` loading |
123
+ | `...-hf-sft` | post-SFT | Hugging Face `transformers` (`config.json`, `model.safetensors`, `tokenizer.json`) | drop-in `AutoModelForCausalLM` loading |
124
+
125
+ - The **`base_checkpoints`** and **`chatsft_checkpoints`** artifacts are the raw nanochat outputs. They
126
+ include the optimizer state (`optim_*_rank0.pt`) and metadata (`meta_*.json` with training config,
127
+ val BPB, step number, etc.), so you can resume training or evaluate with the nanochat scripts
128
+ exactly as produced by `scripts.base_train` and `scripts.chat_sft`.
129
+ - The **`hf_base`** and **`hf_sft`** artifacts are conversions of those same weights into the
130
+ Hugging Face `transformers` layout (architecture name `NanoChatForCausalLM`, `model_type`
131
+ `nanochat`). Load them with:
132
+
133
+ ```python
134
+ from transformers import AutoModelForCausalLM, AutoTokenizer
135
+ model = AutoModelForCausalLM.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True)
136
+ tokenizer = AutoTokenizer.from_pretrained("crellis/nanochat-d20-20tpp-hf-sft", trust_remote_code=True)
137
+ ```
138
+
139
+ `use_rope` in `config.json` reflects the drope setting: `true` for models that kept RoPE for the
140
+ entire pretraining budget, and `false` for drope variants (where RoPE was removed partway through
141
+ pretraining and the model was recalibrated without it). In the drope case, rotary embeddings are
142
+ not applied at inference time.
143
+
144
+ Pick `-hf-base` / `-hf-sft` for inference. Pick `-base` / `-sft` only if you plan to continue
145
+ training inside the nanochat codebase.
146
+
147
+ ## Inference sketch (HF format, SFT)
148
+
149
+ ```python
150
+ from transformers import AutoModelForCausalLM, AutoTokenizer
151
+ import torch
152
+
153
+ repo = "crellis/nanochat-d20-20tpp-hf-sft"
154
+ tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
155
+ model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()
156
+
157
+ messages = [{"role": "user", "content": "Why is the sky blue?"}]
158
+ inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").cuda()
159
+ out = model.generate(inputs, max_new_tokens=256)
160
+ print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
161
+ ```
162
+
163
+ Base (pretrained-only) checkpoints are next-token predictors and do not understand the chat
164
+ template; use `-hf-base` for completion-style prompting and `-hf-sft` for chat.
165
+
166
+ ## Training compute
167
+
168
+ All runs were trained on a single H100 GPU via Slurm. Pretraining wall-clock ranges from
169
+ ~4 hours (d18 @ 9tpp) to ~15 hours (d20 @ 40tpp); SFT adds ~30–90 minutes depending on variant.
170
+
171
+ ## Citation / acknowledgements
172
+
173
+ - Codebase: [`karpathy/nanochat`](https://github.com/karpathy/nanochat)
174
+ - Pretraining data: NVIDIA ClimbMix (via `karpathy/climbmix-400b-shuffle`)
175
+ - SFT data: HuggingFaceTB SmolTalk, CAIS MMLU, OpenAI GSM8K, AI2 Tulu-v2 long-mixture
176
+ - RoPE-removal recipe: [*Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings*](https://arxiv.org/pdf/2512.12167) (arXiv:2512.12167)
177
+
178
+ ## License
179
+
180
+ MIT (inherits from the nanochat repository).
meta_006187.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "step": 6187,
3
+ "val_bpb": 0.7721254779551417,
4
+ "model_config": {
5
+ "sequence_len": 4096,
6
+ "vocab_size": 32768,
7
+ "n_layer": 18,
8
+ "n_head": 9,
9
+ "n_kv_head": 9,
10
+ "n_embd": 1152
11
+ },
12
+ "user_config": {
13
+ "run": "d18",
14
+ "device_type": "",
15
+ "fp8": true,
16
+ "fp8_recipe": "tensorwise",
17
+ "depth": 18,
18
+ "aspect_ratio": 64,
19
+ "head_dim": 128,
20
+ "max_seq_len": 4096,
21
+ "num_iterations": -1,
22
+ "target_flops": -1.0,
23
+ "target_param_data_ratio": 20.0,
24
+ "device_batch_size": 16,
25
+ "total_batch_size": -1,
26
+ "embedding_lr": 0.3,
27
+ "unembedding_lr": 0.008,
28
+ "weight_decay": 0.28,
29
+ "matrix_lr": 0.02,
30
+ "warmup_steps": 40,
31
+ "warmdown_ratio": 0.65,
32
+ "final_lr_frac": 0.05,
33
+ "resume_from_step": -1,
34
+ "eval_every": 250,
35
+ "eval_tokens": 41943040,
36
+ "core_metric_every": 2000,
37
+ "core_metric_max_per_task": 500,
38
+ "sample_every": 2000,
39
+ "save_every_pct": 25.0,
40
+ "model_tag": "d18",
41
+ "rope_removal_pct": -1
42
+ },
43
+ "device_batch_size": 16,
44
+ "max_seq_len": 4096,
45
+ "total_batch_size": 1048576,
46
+ "dataloader_state_dict": {
47
+ "pq_idx": 133,
48
+ "rg_idx": 6,
49
+ "epoch": 1
50
+ },
51
+ "loop_state": {
52
+ "min_val_bpb": 0.7721254779551417,
53
+ "smooth_train_loss": 2.535478062795301,
54
+ "total_training_time": 15976.807270765305,
55
+ "use_rope": true
56
+ }
57
+ }
model_006187.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a99e443949e8bd838d6bdc80f9333255fb8fbfe64f26d431c1a80a2ae01bfa3d
3
+ size 1373162077
optim_006187_rank0.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c8b6732960db8d0ffef63b8123c0219dbaefd136bc696e55f3be6d33e22e0503
3
+ size 1600603109