JonathanMiddleton commited on
Commit
3a8fbd1
·
verified ·
1 Parent(s): e1b7a80

Delete files token_bytes.pt tokenizer.pkl report.md meta_000650.json with huggingface_hub

Browse files
Files changed (4) hide show
  1. meta_000650.json +0 -14
  2. report.md +0 -269
  3. token_bytes.pt +0 -3
  4. tokenizer.pkl +0 -3
meta_000650.json DELETED
@@ -1,14 +0,0 @@
1
- {
2
- "step": 650,
3
- "val_loss": 1.012525200843811,
4
- "mmlu_acc": 0.328125,
5
- "arc_easy_acc": 0.4296875,
6
- "model_config": {
7
- "sequence_len": 2048,
8
- "vocab_size": 65536,
9
- "n_layer": 20,
10
- "n_head": 10,
11
- "n_kv_head": 10,
12
- "n_embd": 1280
13
- }
14
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
report.md DELETED
@@ -1,269 +0,0 @@
1
- # nanochat training report
2
-
3
- Generated: 2025-10-17 16:50:29
4
-
5
- ## Environment
6
-
7
- ### Git Information
8
- - Branch: master
9
- - Commit: d6d86cb (dirty)
10
- - Message: update readme with a link to the CPU|MPS branch
11
-
12
- ### Hardware
13
- - Platform: Linux
14
- - CPUs: 104 cores (104 logical)
15
- - Memory: 1007.4 GB
16
- - GPUs: 8x NVIDIA H100 80GB HBM3
17
- - GPU Memory: 633.7 GB total
18
- - CUDA Version: 12.8
19
- - Hourly Rate: $24.00/hour
20
-
21
- ### Software
22
- - Python: 3.12.12
23
- - PyTorch: 2.9.0+cu128
24
-
25
-
26
- ### Bloat
27
- - Characters: 351,931
28
- - Lines: 8,552
29
- - Files: 43
30
- - Tokens (approx): 87,982
31
- - Dependencies (uv.lock lines): 2,004
32
-
33
- Run started: 2025-10-17 16:50:31
34
-
35
- ---
36
-
37
- ## Tokenizer training
38
- timestamp: 2025-10-17 16:51:35
39
-
40
- - max_chars: 2,000,000,000
41
- - doc_cap: 10,000
42
- - vocab_size: 65,536
43
- - train_time: 55.2927
44
- - num_special_tokens: 9
45
- - token_bytes_min: 1
46
- - token_bytes_max: 32
47
- - token_bytes_mean: 6.9197
48
- - token_bytes_std: 2.8748
49
-
50
-
51
- ## Tokenizer evaluation
52
- timestamp: 2025-10-17 16:51:40
53
-
54
- ### Comparison with GPT-2
55
-
56
- | Text Type | Bytes | GPT-2 Tokens | GPT-2 Ratio | Ours Tokens | Ours Ratio | Relative Diff % |
57
- |-----------|-------|--------------|--------------|-------------|------------|-----------------|
58
- | news | 1819 | 404 | 4.50 | 375 | 4.85 | +7.2% |
59
- | korean | 893 | 745 | 1.20 | 712 | 1.25 | +4.4% |
60
- | code | 1259 | 576 | 2.19 | 492 | 2.56 | +14.6% |
61
- | math | 1834 | 936 | 1.96 | 966 | 1.90 | -3.2% |
62
- | science | 1112 | 260 | 4.28 | 228 | 4.88 | +12.3% |
63
- | fwe-train | 4208518 | 900364 | 4.67 | 856883 | 4.91 | +4.8% |
64
- | fwe-val | 4908443 | 1059062 | 4.63 | 1010352 | 4.86 | +4.6% |
65
-
66
- ### Comparison with GPT-4
67
-
68
- | Text Type | Bytes | GPT-4 Tokens | GPT-4 Ratio | Ours Tokens | Ours Ratio | Relative Diff % |
69
- |-----------|-------|--------------|--------------|-------------|------------|-----------------|
70
- | news | 1819 | 387 | 4.70 | 375 | 4.85 | +3.1% |
71
- | korean | 893 | 364 | 2.45 | 712 | 1.25 | -95.6% |
72
- | code | 1259 | 309 | 4.07 | 492 | 2.56 | -59.2% |
73
- | math | 1834 | 832 | 2.20 | 966 | 1.90 | -16.1% |
74
- | science | 1112 | 249 | 4.47 | 228 | 4.88 | +8.4% |
75
- | fwe-train | 4208518 | 874799 | 4.81 | 856883 | 4.91 | +2.0% |
76
- | fwe-val | 4908443 | 1029691 | 4.77 | 1010352 | 4.86 | +1.9% |
77
-
78
-
79
- ## Base model training
80
- timestamp: 2025-10-17 20:00:41
81
-
82
- - run: nanochat
83
- - depth: 20
84
- - max_seq_len: 2048
85
- - num_iterations: -1
86
- - target_flops: -1.0000
87
- - target_param_data_ratio: 20
88
- - device_batch_size: 32
89
- - total_batch_size: 524,288
90
- - embedding_lr: 0.2000
91
- - unembedding_lr: 0.0040
92
- - weight_decay: 0.0000
93
- - matrix_lr: 0.0200
94
- - grad_clip: 1.0000
95
- - eval_every: 250
96
- - eval_tokens: 10,485,760
97
- - core_metric_every: 2000
98
- - core_metric_max_per_task: 500
99
- - sample_every: 2000
100
- - model_tag:
101
- - Number of parameters: 560,988,160
102
- - Number of FLOPs per token: 3.491758e+09
103
- - Calculated number of iterations: 21,400
104
- - Number of training tokens: 11,219,763,200
105
- - Tokens : Params ratio: 20.0000
106
- - DDP world size: 8
107
- - warmup_ratio: 0.0000
108
- - warmdown_ratio: 0.2000
109
- - final_lr_frac: 0.0000
110
- - Minimum validation bpb: 0.8118
111
- - Final validation bpb: 0.8118
112
- - CORE metric estimate: 0.2232
113
- - MFU %: 47.92%
114
- - Total training flops: 3.917670e+19
115
- - Total training time: 173.10m
116
- - Peak memory usage: 75422.02MiB
117
-
118
-
119
- ## Base model loss
120
- timestamp: 2025-10-17 20:01:31
121
-
122
- - train bpb: 0.8146
123
- - val bpb: 0.8120
124
- - sample 0: <|bos|>The capital of France is Paris. Paris is the capital of France. Paris is the capital of France.
125
- - sample 1: <|bos|>The chemical symbol of gold is Au. The atomic number of gold is 79. The atomic mass of gold
126
- - sample 2: <|bos|>If yesterday was Friday, then tomorrow will be Tuesday. The day after tomorrow will be Tuesday, and the day after that will
127
- - sample 3: <|bos|>The opposite of hot is cold. The opposite of cold is hot. The opposite of hot is cold.
128
- - sample 4: <|bos|>The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune,
129
- - sample 5: <|bos|>My favorite color is red. I love it because it is so bright and it is so easy to
130
- - sample 6: <|bos|>If 5*x + 3 = 13, then x is the greatest common divisor of 5 and 3. 13 is the
131
-
132
-
133
- ## Base model evaluation
134
- timestamp: 2025-10-17 20:04:41
135
-
136
- - Model: base_model (step 21400)
137
- - CORE metric: 0.2152
138
- - hellaswag_zeroshot: 0.2618
139
- - jeopardy: 0.1181
140
- - bigbench_qa_wikidata: 0.5281
141
- - arc_easy: 0.5241
142
- - arc_challenge: 0.1183
143
- - copa: 0.2000
144
- - commonsense_qa: 0.1697
145
- - piqa: 0.3776
146
- - openbook_qa: 0.1413
147
- - lambada_openai: 0.3699
148
- - hellaswag: 0.2624
149
- - winograd: 0.3260
150
- - winogrande: 0.0624
151
- - bigbench_dyck_languages: 0.0990
152
- - agi_eval_lsat_ar: 0.0978
153
- - bigbench_cs_algorithms: 0.4265
154
- - bigbench_operators: 0.1667
155
- - bigbench_repeat_copy_logic: 0.0000
156
- - squad: 0.2375
157
- - coqa: 0.1950
158
- - boolq: -0.1259
159
- - bigbench_language_identification: 0.1774
160
-
161
-
162
- ## Midtraining
163
- timestamp: 2025-10-17 20:13:13
164
-
165
- - run: nanochat
166
- - dtype: bfloat16
167
- - max_seq_len: 2048
168
- - device_batch_size: 32
169
- - unembedding_lr: 0.0040
170
- - embedding_lr: 0.2000
171
- - matrix_lr: 0.0200
172
- - init_lr_frac: 1.0000
173
- - weight_decay: 0.0000
174
- - eval_every: 150
175
- - eval_tokens: 10,485,760
176
- - total_batch_size: 524,288
177
- - dry_run: 0
178
- - Number of iterations: 765
179
- - DDP world size: 8
180
- - Minimum validation bpb: 0.3952
181
-
182
-
183
- ## Chat evaluation mid
184
- timestamp: 2025-10-17 20:18:21
185
-
186
- - source: mid
187
- - task_name: None
188
- - dtype: bfloat16
189
- - temperature: 0.0000
190
- - max_new_tokens: 512
191
- - num_samples: 1
192
- - top_k: 50
193
- - batch_size: 8
194
- - model_tag: None
195
- - step: None
196
- - max_problems: None
197
- - ARC-Easy: 0.4116
198
- - ARC-Challenge: 0.3012
199
- - MMLU: 0.3284
200
- - GSM8K: 0.0417
201
- - HumanEval: 0.0305
202
- - ChatCORE metric: 0.0921
203
-
204
-
205
- ## Chat SFT
206
- timestamp: 2025-10-17 20:20:44
207
-
208
- - run: nanochat
209
- - source: mid
210
- - dtype: bfloat16
211
- - device_batch_size: 4
212
- - num_epochs: 1
213
- - max_iterations: -1
214
- - target_examples_per_step: 32
215
- - unembedding_lr: 0.0040
216
- - embedding_lr: 0.2000
217
- - matrix_lr: 0.0200
218
- - weight_decay: 0.0000
219
- - init_lr_frac: 0.0200
220
- - eval_every: 100
221
- - eval_steps: 100
222
- - eval_metrics_every: 200
223
- - Training rows: 20,843
224
- - Number of iterations: 651
225
- - Training loss: 1.1113
226
- - Validation loss: 1.0125
227
-
228
-
229
- ## Chat evaluation sft
230
- timestamp: 2025-10-17 20:25:29
231
-
232
- - source: sft
233
- - task_name: None
234
- - dtype: bfloat16
235
- - temperature: 0.0000
236
- - max_new_tokens: 512
237
- - num_samples: 1
238
- - top_k: 50
239
- - batch_size: 8
240
- - model_tag: None
241
- - step: None
242
- - max_problems: None
243
- - ARC-Easy: 0.4360
244
- - ARC-Challenge: 0.3012
245
- - MMLU: 0.3322
246
- - GSM8K: 0.0576
247
- - HumanEval: 0.0305
248
- - ChatCORE metric: 0.1028
249
-
250
-
251
- ## Summary
252
-
253
- - Characters: 351,931
254
- - Lines: 8,552
255
- - Files: 43
256
- - Tokens (approx): 87,982
257
- - Dependencies (uv.lock lines): 2,004
258
-
259
- | Metric | BASE | MID | SFT | RL |
260
- |-----------------|----------|----------|----------|----------|
261
- | CORE | 0.2152 | - | - | - |
262
- | ARC-Challenge | - | 0.3012 | 0.3012 | - |
263
- | ARC-Easy | - | 0.4116 | 0.4360 | - |
264
- | GSM8K | - | 0.0417 | 0.0576 | - |
265
- | HumanEval | - | 0.0305 | 0.0305 | - |
266
- | MMLU | - | 0.3284 | 0.3322 | - |
267
- | ChatCORE | - | 0.0921 | 0.1028 | - |
268
-
269
- Total wall clock time: 3h34m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
token_bytes.pt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:e280877820a90174f3b47bf797b67b9026cd859b7d6d5b7f78e64bcdaca126b4
3
- size 263721
 
 
 
 
tokenizer.pkl DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:33f28610ffd37a57d6631f8d7bd91929bd877ae3f4a87dcbdff00b07f6bd7cc3
3
- size 846092