sdobson commited on
Commit
df8dc5a
·
verified ·
1 Parent(s): dee377b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +142 -148
README.md CHANGED
@@ -1,155 +1,149 @@
1
- ---
2
- language:
3
- - en
4
- license: mit
5
- tags:
6
- - text-generation
7
- - transformer
8
- - conversational
9
- datasets:
10
- - HuggingFaceFW/fineweb-edu
11
- - cais/mmlu
12
- - gsm8k
13
- model-index:
14
- - name: nanochat
15
- results:
16
- - task:
17
- type: text-generation
18
- dataset:
19
- name: MMLU
20
- type: cais/mmlu
21
- metrics:
22
- - type: accuracy
23
- value: 31.51
24
- - task:
25
- type: text-generation
26
- dataset:
27
- name: GSM8K
28
- type: gsm8k
29
- metrics:
30
- - type: accuracy
31
- value: 4.55
32
- - task:
33
- type: text-generation
34
- dataset:
35
- name: HumanEval
36
- type: openai_humaneval
37
- metrics:
38
- - type: pass@1
39
- value: 8.54
40
- ---
41
-
42
- # nanochat
43
-
44
- **nanochat** is a 561M parameter transformer language model trained for conversational AI tasks. This model demonstrates that capable chat models
45
- can be trained efficiently on modest hardware budgets (~$100 on 8x H100 GPUs).
46
-
47
- ## Model Description
48
-
49
- - **Developed by:** Andrej Karpathy
50
- - **Model type:** Transformer-based causal language model
51
- - **Language(s):** English
52
- - **License:** MIT
53
- - **Parameters:** 560,988,160 (~561M)
54
-
55
- ### Architecture
56
-
57
- - **Layers:** 20
58
- - **Hidden size:** 1280 channels
59
- - **Attention heads:** 10
60
- - **Head dimension:** 128
61
- - **Vocabulary size:** 65,536 tokens
62
-
63
- ## Training Details
64
-
65
- ### Training Data
66
-
67
- nanochat was trained in multiple stages:
68
-
69
- 1. **Pretraining:** 100B token subset of FineWeb-EDU (11.2B tokens processed)
70
- 2. **Midtraining:** SmolTalk conversations, MMLU multiple choice questions, GSM8K math problems
71
- 3. **Supervised Fine-tuning (SFT):** Conversational adaptation data
72
-
73
- ### Training Procedure
74
-
75
- #### Tokenization
76
- - Custom Rust-based tokenizer
77
- - Vocabulary: 65,536 tokens
78
- - Compression ratio: 4.8 characters per token
79
-
80
- #### Training Infrastructure
81
- - **Hardware:** 8x H100 GPUs (Lambda GPU Cloud)
82
- - **Training time:** ~3 hours for pretraining stage
83
- - **Estimated compute:** ~4e19 FLOPs
84
- - **Total cost:** ~$100
85
-
86
- #### Training Stages
87
- The model was trained in three stages:
88
- 1. **Pretraining** on web text (FineWeb-EDU)
89
- 2. **Midtraining** on domain-specific datasets (reasoning, conversation, math)
90
- 3. **Supervised fine-tuning** for chat optimization
91
-
92
- ## Performance
93
-
94
- ### Benchmark Results
95
-
96
- | Benchmark | Score | Description |
97
- |-----------|-------|-------------|
98
- | **MMLU** | 31.51% | Multitask language understanding |
99
- | **GSM8K** | 4.55% | Grade school math problems |
100
- | **HumanEval** | 8.54% | Python code generation |
101
- | **ARC-Easy** | 38.76% | Science questions (easy) |
102
- | **ARC-Challenge** | 28.07% | Science questions (hard) |
103
- | **ChatCORE** | 8.84% | Conversational reasoning |
104
-
105
- ### Training Progress
106
-
107
- | Stage | CORE Score |
108
- |-------|-----------|
109
- | Base (after pretraining) | 22.19% |
110
- | After Midtraining | - |
111
- | After SFT | - |
112
-
113
- ## Intended Use
114
-
115
- ### Direct Use
116
-
117
- nanochat is designed for:
118
- - Conversational AI applications
119
- - Research on efficient language model training
120
- - Educational purposes for understanding LLM training pipelines
121
- - Low-resource deployment scenarios
 
 
 
122
 
123
- ### Downstream Use
124
 
125
- The model can be fine-tuned for specific conversational tasks or used as a base model for further domain adaptation.
 
 
 
 
126
 
127
- ### Out-of-Scope Use
128
 
129
- - Production-grade conversational AI (the model is relatively small and has limited capabilities)
130
- - Tasks requiring specialised knowledge or high accuracy
131
- - Critical applications where errors could cause harm
132
 
133
- ## Limitations and Bias
 
 
 
 
 
 
 
134
 
135
- - **Small scale:** At 561M parameters, this model has significantly fewer capabilities than larger models (1B+ parameters)
136
- - **Limited training:** Trained on only 11.2B tokens, which is modest by modern standards
137
- - **Performance:** Benchmark scores indicate limited reasoning and mathematical capabilities
138
- - **Bias:** Inherits biases from training data (FineWeb-EDU, SmolTalk, etc.)
139
- - **Language:** English-only
140
 
141
- ## Citation
142
-
143
- **Repository:** [github.com/karpathy/nanochat](https://github.com/karpathy/nanochat)
144
-
145
- ```bibtex
146
- @software{nanochat2025,
147
- author = {Karpathy, Andrej},
148
- title = {nanochat: A 561M parameter conversational language model},
149
- year = {2025},
150
- url = {https://github.com/karpathy/nanochat}
151
- }
152
-
153
- Model Card Author
154
-
155
- Sam Dobson
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - text-generation
7
+ - transformer
8
+ - conversational
9
+ datasets:
10
+ - HuggingFaceFW/fineweb-edu
11
+ - cais/mmlu
12
+ - gsm8k
13
+ model-index:
14
+ - name: nanochat
15
+ results:
16
+ - task:
17
+ type: text-generation
18
+ dataset:
19
+ name: MMLU
20
+ type: cais/mmlu
21
+ metrics:
22
+ - type: accuracy
23
+ value: 31.51
24
+ - task:
25
+ type: text-generation
26
+ dataset:
27
+ name: GSM8K
28
+ type: gsm8k
29
+ metrics:
30
+ - type: accuracy
31
+ value: 4.55
32
+ - task:
33
+ type: text-generation
34
+ dataset:
35
+ name: HumanEval
36
+ type: openai_humaneval
37
+ metrics:
38
+ - type: pass@1
39
+ value: 8.54
40
+ ---
41
+
42
+ # nanochat
43
+
44
+ **nanochat** is a 561M parameter transformer language model trained for conversational AI tasks. This model demonstrates that capable chat models
45
+ can be trained efficiently on modest hardware budgets (~$100 on 8x H100 GPUs).
46
+
47
+ ## Model Description
48
+
49
+ - **Developed by:** Andrej Karpathy
50
+ - **Trained by:** Sam Dobson
51
+ - **Model type:** Transformer-based causal language model
52
+ - **Language(s):** English
53
+ - **License:** MIT
54
+ - **Parameters:** 560,988,160 (~561M)
55
+
56
+ ### Architecture
57
+
58
+ - **Layers:** 20
59
+ - **Hidden size:** 1280 channels
60
+ - **Attention heads:** 10
61
+ - **Head dimension:** 128
62
+ - **Vocabulary size:** 65,536 tokens
63
+
64
+ ## Training Details
65
+
66
+ ### Training Data
67
+
68
+ nanochat was trained in multiple stages:
69
+
70
+ 1. **Pretraining:** 100B token subset of FineWeb-EDU (11.2B tokens processed)
71
+ 2. **Midtraining:** SmolTalk conversations, MMLU multiple choice questions, GSM8K math problems
72
+ 3. **Supervised Fine-tuning (SFT):** Conversational adaptation data
73
+
74
+ ### Training Procedure
75
+
76
+ #### Tokenization
77
+ - Custom Rust-based tokenizer
78
+ - Vocabulary: 65,536 tokens
79
+ - Compression ratio: 4.8 characters per token
80
+
81
+ #### Training Infrastructure
82
+ - **Hardware:** 8x H100 GPUs (Lambda GPU Cloud)
83
+ - **Training time:** ~3 hours for pretraining stage
84
+ - **Estimated compute:** ~4e19 FLOPs
85
+ - **Total cost:** ~$100
86
+
87
+ #### Training Stages
88
+ The model was trained in three stages:
89
+ 1. **Pretraining** on web text (FineWeb-EDU)
90
+ 2. **Midtraining** on domain-specific datasets (reasoning, conversation, maths)
91
+ 3. **Supervised fine-tuning** for chat optimisation
92
+
93
+ ## Performance
94
+
95
+ ### Benchmark Results
96
+
97
+ | Benchmark | Score | Description |
98
+ |-----------|-------|-------------|
99
+ | **MMLU** | 23.99% | Multitask language understanding |
100
+ | **GSM8K** | 4.47% | Grade school math problems |
101
+ | **HumanEval** | 6.71% | Python code generation |
102
+ | **ARC-Easy** | 24.79% | Science questions (easy) |
103
+ | **ARC-Challenge** | 24.32% | Science questions (hard) |
104
+ | **ChatCORE** | 1.73% | Conversational reasoning |
105
+
106
+ ## Intended Use
107
+
108
+ ### Direct Use
109
+
110
+ nanochat is designed for:
111
+ - Conversational AI applications
112
+ - Research on efficient language model training
113
+ - Educational purposes for understanding LLM training pipelines
114
+ - Low-resource deployment scenarios
115
+
116
+ ### Downstream Use
117
+
118
+ The model can be fine-tuned for specific conversational tasks or used as a base model for further domain adaptation.
119
+
120
+ ### Out-of-Scope Use
121
+
122
+ - Production-grade conversational AI (the model is relatively small and has limited capabilities)
123
+ - Tasks requiring specialised knowledge or high accuracy
124
+ - Critical applications where errors could cause harm
125
 
126
+ ## Limitations and Bias
127
 
128
+ - **Small scale:** At 561M parameters, this model has significantly fewer capabilities than larger models (1B+ parameters)
129
+ - **Limited training:** Trained on only 11.2B tokens, which is modest by modern standards
130
+ - **Performance:** Benchmark scores indicate limited reasoning and mathematical capabilities
131
+ - **Bias:** Inherits biases from training data (FineWeb-EDU, SmolTalk, etc.)
132
+ - **Language:** English-only
133
 
134
+ ## Citation
135
 
136
+ **Repository:** [github.com/karpathy/nanochat](https://github.com/karpathy/nanochat)
 
 
137
 
138
+ ```bibtex
139
+ @software{nanochat2025,
140
+ author = {Karpathy, Andrej},
141
+ title = {nanochat: A 561M parameter conversational language model},
142
+ year = {2025},
143
+ url = {https://github.com/karpathy/nanochat}
144
+ }
145
+ ```
146
 
147
+ ## Model Card Author
 
 
 
 
148
 
149
+ Sam Dobson