dekangli commited on
Commit
1770b82
·
verified ·
1 Parent(s): 80c798b

Model save

Browse files
Files changed (5) hide show
  1. README.md +58 -0
  2. all_results.json +8 -0
  3. generation_config.json +14 -0
  4. train_results.json +8 -0
  5. trainer_state.json +1211 -0
README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Qwen/Qwen2.5-3B-Instruct
3
+ library_name: transformers
4
+ model_name: Qwen2.5-3B-R1-Distill
5
+ tags:
6
+ - generated_from_trainer
7
+ - trl
8
+ - sft
9
+ licence: license
10
+ ---
11
+
12
+ # Model Card for Qwen2.5-3B-R1-Distill
13
+
14
+ This model is a fine-tuned version of [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct).
15
+ It has been trained using [TRL](https://github.com/huggingface/trl).
16
+
17
+ ## Quick start
18
+
19
+ ```python
20
+ from transformers import pipeline
21
+
22
+ question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
23
+ generator = pipeline("text-generation", model="dekangli/Qwen2.5-3B-R1-Distill", device="cuda")
24
+ output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
25
+ print(output["generated_text"])
26
+ ```
27
+
28
+ ## Training procedure
29
+
30
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/dekangli-bytedance/huggingface/runs/x7bsomd3)
31
+
32
+
33
+ This model was trained with SFT.
34
+
35
+ ### Framework versions
36
+
37
+ - TRL: 0.17.0
38
+ - Transformers: 4.49.0
39
+ - Pytorch: 2.6.0
40
+ - Datasets: 3.6.0
41
+ - Tokenizers: 0.21.1
42
+
43
+ ## Citations
44
+
45
+
46
+
47
+ Cite TRL as:
48
+
49
+ ```bibtex
50
+ @misc{vonwerra2022trl,
51
+ title = {{TRL: Transformer Reinforcement Learning}},
52
+ author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
53
+ year = 2020,
54
+ journal = {GitHub repository},
55
+ publisher = {GitHub},
56
+ howpublished = {\url{https://github.com/huggingface/trl}}
57
+ }
58
+ ```
all_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 1913806159609856.0,
3
+ "train_loss": 0.48002010657061983,
4
+ "train_runtime": 43121.1313,
5
+ "train_samples": 93733,
6
+ "train_samples_per_second": 2.174,
7
+ "train_steps_per_second": 0.017
8
+ }
generation_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "do_sample": true,
4
+ "eos_token_id": [
5
+ 151645,
6
+ 151643
7
+ ],
8
+ "pad_token_id": 151643,
9
+ "repetition_penalty": 1.05,
10
+ "temperature": 0.7,
11
+ "top_k": 20,
12
+ "top_p": 0.8,
13
+ "transformers_version": "4.49.0"
14
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 1913806159609856.0,
3
+ "train_loss": 0.48002010657061983,
4
+ "train_runtime": 43121.1313,
5
+ "train_samples": 93733,
6
+ "train_samples_per_second": 2.174,
7
+ "train_steps_per_second": 0.017
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,1211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 1.0,
5
+ "eval_steps": 500,
6
+ "global_step": 733,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.0068212824010914054,
13
+ "grad_norm": 2.859520152533828,
14
+ "learning_rate": 6.7567567567567575e-06,
15
+ "loss": 0.8358,
16
+ "num_tokens": 3759146.0,
17
+ "step": 5
18
+ },
19
+ {
20
+ "epoch": 0.013642564802182811,
21
+ "grad_norm": 1.9018163630483922,
22
+ "learning_rate": 1.3513513513513515e-05,
23
+ "loss": 0.7639,
24
+ "num_tokens": 7668808.0,
25
+ "step": 10
26
+ },
27
+ {
28
+ "epoch": 0.020463847203274217,
29
+ "grad_norm": 0.5801403334832769,
30
+ "learning_rate": 2.0270270270270273e-05,
31
+ "loss": 0.6614,
32
+ "num_tokens": 11368873.0,
33
+ "step": 15
34
+ },
35
+ {
36
+ "epoch": 0.027285129604365622,
37
+ "grad_norm": 0.572102113014263,
38
+ "learning_rate": 2.702702702702703e-05,
39
+ "loss": 0.6119,
40
+ "num_tokens": 15118063.0,
41
+ "step": 20
42
+ },
43
+ {
44
+ "epoch": 0.034106412005457026,
45
+ "grad_norm": 0.4278039947946872,
46
+ "learning_rate": 3.3783783783783784e-05,
47
+ "loss": 0.5902,
48
+ "num_tokens": 18906839.0,
49
+ "step": 25
50
+ },
51
+ {
52
+ "epoch": 0.040927694406548434,
53
+ "grad_norm": 0.38997902119107797,
54
+ "learning_rate": 4.0540540540540545e-05,
55
+ "loss": 0.5683,
56
+ "num_tokens": 22641755.0,
57
+ "step": 30
58
+ },
59
+ {
60
+ "epoch": 0.047748976807639835,
61
+ "grad_norm": 0.3731809054939738,
62
+ "learning_rate": 4.72972972972973e-05,
63
+ "loss": 0.5642,
64
+ "num_tokens": 26636629.0,
65
+ "step": 35
66
+ },
67
+ {
68
+ "epoch": 0.054570259208731244,
69
+ "grad_norm": 0.3482748515886515,
70
+ "learning_rate": 4.999793714044176e-05,
71
+ "loss": 0.5359,
72
+ "num_tokens": 30417967.0,
73
+ "step": 40
74
+ },
75
+ {
76
+ "epoch": 0.061391541609822645,
77
+ "grad_norm": 0.37293778656884924,
78
+ "learning_rate": 4.9985332146267735e-05,
79
+ "loss": 0.5384,
80
+ "num_tokens": 34231333.0,
81
+ "step": 45
82
+ },
83
+ {
84
+ "epoch": 0.06821282401091405,
85
+ "grad_norm": 0.36563686554902025,
86
+ "learning_rate": 4.996127460337901e-05,
87
+ "loss": 0.539,
88
+ "num_tokens": 37961424.0,
89
+ "step": 50
90
+ },
91
+ {
92
+ "epoch": 0.07503410641200546,
93
+ "grad_norm": 0.49916502547033137,
94
+ "learning_rate": 4.992577676510502e-05,
95
+ "loss": 0.5403,
96
+ "num_tokens": 41826860.0,
97
+ "step": 55
98
+ },
99
+ {
100
+ "epoch": 0.08185538881309687,
101
+ "grad_norm": 0.4663794674995223,
102
+ "learning_rate": 4.987885671170889e-05,
103
+ "loss": 0.5286,
104
+ "num_tokens": 45543403.0,
105
+ "step": 60
106
+ },
107
+ {
108
+ "epoch": 0.08867667121418826,
109
+ "grad_norm": 0.47229935539155693,
110
+ "learning_rate": 4.9820538341178595e-05,
111
+ "loss": 0.5321,
112
+ "num_tokens": 49369486.0,
113
+ "step": 65
114
+ },
115
+ {
116
+ "epoch": 0.09549795361527967,
117
+ "grad_norm": 0.3986951551087616,
118
+ "learning_rate": 4.97508513570549e-05,
119
+ "loss": 0.5227,
120
+ "num_tokens": 53010874.0,
121
+ "step": 70
122
+ },
123
+ {
124
+ "epoch": 0.10231923601637108,
125
+ "grad_norm": 0.40140582830322363,
126
+ "learning_rate": 4.966983125330225e-05,
127
+ "loss": 0.5244,
128
+ "num_tokens": 56909889.0,
129
+ "step": 75
130
+ },
131
+ {
132
+ "epoch": 0.10914051841746249,
133
+ "grad_norm": 0.4349545422715694,
134
+ "learning_rate": 4.957751929623059e-05,
135
+ "loss": 0.5094,
136
+ "num_tokens": 60650570.0,
137
+ "step": 80
138
+ },
139
+ {
140
+ "epoch": 0.11596180081855388,
141
+ "grad_norm": 0.3803902727248126,
142
+ "learning_rate": 4.947396250347695e-05,
143
+ "loss": 0.5033,
144
+ "num_tokens": 64564660.0,
145
+ "step": 85
146
+ },
147
+ {
148
+ "epoch": 0.12278308321964529,
149
+ "grad_norm": 0.3736381760976187,
150
+ "learning_rate": 4.9359213620057766e-05,
151
+ "loss": 0.5192,
152
+ "num_tokens": 68426882.0,
153
+ "step": 90
154
+ },
155
+ {
156
+ "epoch": 0.1296043656207367,
157
+ "grad_norm": 0.33996053767832096,
158
+ "learning_rate": 4.9233331091504034e-05,
159
+ "loss": 0.5154,
160
+ "num_tokens": 72252819.0,
161
+ "step": 95
162
+ },
163
+ {
164
+ "epoch": 0.1364256480218281,
165
+ "grad_norm": 0.4698197228973068,
166
+ "learning_rate": 4.909637903409306e-05,
167
+ "loss": 0.504,
168
+ "num_tokens": 76160914.0,
169
+ "step": 100
170
+ },
171
+ {
172
+ "epoch": 0.1432469304229195,
173
+ "grad_norm": 0.3493439695910741,
174
+ "learning_rate": 4.8948427202191766e-05,
175
+ "loss": 0.5057,
176
+ "num_tokens": 80152955.0,
177
+ "step": 105
178
+ },
179
+ {
180
+ "epoch": 0.15006821282401092,
181
+ "grad_norm": 0.3516586136746583,
182
+ "learning_rate": 4.878955095272844e-05,
183
+ "loss": 0.5098,
184
+ "num_tokens": 83901702.0,
185
+ "step": 110
186
+ },
187
+ {
188
+ "epoch": 0.15688949522510232,
189
+ "grad_norm": 0.42408743833551077,
190
+ "learning_rate": 4.861983120681089e-05,
191
+ "loss": 0.5088,
192
+ "num_tokens": 87691018.0,
193
+ "step": 115
194
+ },
195
+ {
196
+ "epoch": 0.16371077762619374,
197
+ "grad_norm": 0.3475562440630165,
198
+ "learning_rate": 4.8439354408510536e-05,
199
+ "loss": 0.4976,
200
+ "num_tokens": 91542428.0,
201
+ "step": 120
202
+ },
203
+ {
204
+ "epoch": 0.17053206002728513,
205
+ "grad_norm": 0.36672881758295617,
206
+ "learning_rate": 4.82482124808335e-05,
207
+ "loss": 0.5047,
208
+ "num_tokens": 95438170.0,
209
+ "step": 125
210
+ },
211
+ {
212
+ "epoch": 0.17735334242837653,
213
+ "grad_norm": 0.3716213307471467,
214
+ "learning_rate": 4.804650277890105e-05,
215
+ "loss": 0.4993,
216
+ "num_tokens": 99383692.0,
217
+ "step": 130
218
+ },
219
+ {
220
+ "epoch": 0.18417462482946795,
221
+ "grad_norm": 0.4114376174123406,
222
+ "learning_rate": 4.783432804036335e-05,
223
+ "loss": 0.4997,
224
+ "num_tokens": 103223537.0,
225
+ "step": 135
226
+ },
227
+ {
228
+ "epoch": 0.19099590723055934,
229
+ "grad_norm": 0.417963130595086,
230
+ "learning_rate": 4.761179633307163e-05,
231
+ "loss": 0.511,
232
+ "num_tokens": 106901687.0,
233
+ "step": 140
234
+ },
235
+ {
236
+ "epoch": 0.19781718963165076,
237
+ "grad_norm": 0.3590365052314863,
238
+ "learning_rate": 4.737902100003552e-05,
239
+ "loss": 0.4863,
240
+ "num_tokens": 110840758.0,
241
+ "step": 145
242
+ },
243
+ {
244
+ "epoch": 0.20463847203274216,
245
+ "grad_norm": 0.3336081440893747,
246
+ "learning_rate": 4.713612060169362e-05,
247
+ "loss": 0.5005,
248
+ "num_tokens": 114521504.0,
249
+ "step": 150
250
+ },
251
+ {
252
+ "epoch": 0.21145975443383355,
253
+ "grad_norm": 0.3185236733259455,
254
+ "learning_rate": 4.688321885552659e-05,
255
+ "loss": 0.4875,
256
+ "num_tokens": 118445759.0,
257
+ "step": 155
258
+ },
259
+ {
260
+ "epoch": 0.21828103683492497,
261
+ "grad_norm": 0.3596258991525576,
262
+ "learning_rate": 4.662044457304359e-05,
263
+ "loss": 0.4952,
264
+ "num_tokens": 122311693.0,
265
+ "step": 160
266
+ },
267
+ {
268
+ "epoch": 0.22510231923601637,
269
+ "grad_norm": 0.31478964913575963,
270
+ "learning_rate": 4.634793159417421e-05,
271
+ "loss": 0.498,
272
+ "num_tokens": 126094524.0,
273
+ "step": 165
274
+ },
275
+ {
276
+ "epoch": 0.23192360163710776,
277
+ "grad_norm": 0.339405025728744,
278
+ "learning_rate": 4.606581871909919e-05,
279
+ "loss": 0.492,
280
+ "num_tokens": 129971056.0,
281
+ "step": 170
282
+ },
283
+ {
284
+ "epoch": 0.23874488403819918,
285
+ "grad_norm": 0.3817665665912771,
286
+ "learning_rate": 4.577424963755475e-05,
287
+ "loss": 0.5052,
288
+ "num_tokens": 133765982.0,
289
+ "step": 175
290
+ },
291
+ {
292
+ "epoch": 0.24556616643929058,
293
+ "grad_norm": 0.455088588103575,
294
+ "learning_rate": 4.547337285564649e-05,
295
+ "loss": 0.4874,
296
+ "num_tokens": 137522281.0,
297
+ "step": 180
298
+ },
299
+ {
300
+ "epoch": 0.252387448840382,
301
+ "grad_norm": 0.3863705096212881,
302
+ "learning_rate": 4.516334162021013e-05,
303
+ "loss": 0.4826,
304
+ "num_tokens": 141196084.0,
305
+ "step": 185
306
+ },
307
+ {
308
+ "epoch": 0.2592087312414734,
309
+ "grad_norm": 0.36548889219989195,
310
+ "learning_rate": 4.484431384075771e-05,
311
+ "loss": 0.4923,
312
+ "num_tokens": 145136704.0,
313
+ "step": 190
314
+ },
315
+ {
316
+ "epoch": 0.2660300136425648,
317
+ "grad_norm": 0.3282797740156995,
318
+ "learning_rate": 4.4516452009048814e-05,
319
+ "loss": 0.4813,
320
+ "num_tokens": 148940122.0,
321
+ "step": 195
322
+ },
323
+ {
324
+ "epoch": 0.2728512960436562,
325
+ "grad_norm": 0.3373945226676213,
326
+ "learning_rate": 4.4179923116328005e-05,
327
+ "loss": 0.4911,
328
+ "num_tokens": 152809678.0,
329
+ "step": 200
330
+ },
331
+ {
332
+ "epoch": 0.27967257844474763,
333
+ "grad_norm": 0.2988913432631945,
334
+ "learning_rate": 4.3834898568270444e-05,
335
+ "loss": 0.4848,
336
+ "num_tokens": 156683573.0,
337
+ "step": 205
338
+ },
339
+ {
340
+ "epoch": 0.286493860845839,
341
+ "grad_norm": 0.32252594879192,
342
+ "learning_rate": 4.348155409767913e-05,
343
+ "loss": 0.486,
344
+ "num_tokens": 160433326.0,
345
+ "step": 210
346
+ },
347
+ {
348
+ "epoch": 0.2933151432469304,
349
+ "grad_norm": 0.30855373605724573,
350
+ "learning_rate": 4.3120069674978156e-05,
351
+ "loss": 0.4883,
352
+ "num_tokens": 164374316.0,
353
+ "step": 215
354
+ },
355
+ {
356
+ "epoch": 0.30013642564802184,
357
+ "grad_norm": 0.30234329464930193,
358
+ "learning_rate": 4.275062941654767e-05,
359
+ "loss": 0.4702,
360
+ "num_tokens": 168223299.0,
361
+ "step": 220
362
+ },
363
+ {
364
+ "epoch": 0.3069577080491132,
365
+ "grad_norm": 0.2841369078220348,
366
+ "learning_rate": 4.237342149094701e-05,
367
+ "loss": 0.4815,
368
+ "num_tokens": 172132000.0,
369
+ "step": 225
370
+ },
371
+ {
372
+ "epoch": 0.31377899045020463,
373
+ "grad_norm": 0.32791055661062934,
374
+ "learning_rate": 4.1988638023074116e-05,
375
+ "loss": 0.4787,
376
+ "num_tokens": 176086331.0,
377
+ "step": 230
378
+ },
379
+ {
380
+ "epoch": 0.32060027285129605,
381
+ "grad_norm": 0.3356851421085584,
382
+ "learning_rate": 4.159647499630971e-05,
383
+ "loss": 0.4708,
384
+ "num_tokens": 179917640.0,
385
+ "step": 235
386
+ },
387
+ {
388
+ "epoch": 0.3274215552523875,
389
+ "grad_norm": 0.3037114872946393,
390
+ "learning_rate": 4.1197132152696215e-05,
391
+ "loss": 0.4822,
392
+ "num_tokens": 183746129.0,
393
+ "step": 240
394
+ },
395
+ {
396
+ "epoch": 0.33424283765347884,
397
+ "grad_norm": 0.30572141355536847,
398
+ "learning_rate": 4.07908128912024e-05,
399
+ "loss": 0.4895,
400
+ "num_tokens": 187699370.0,
401
+ "step": 245
402
+ },
403
+ {
404
+ "epoch": 0.34106412005457026,
405
+ "grad_norm": 0.3102442024034853,
406
+ "learning_rate": 4.037772416412524e-05,
407
+ "loss": 0.4739,
408
+ "num_tokens": 191512142.0,
409
+ "step": 250
410
+ },
411
+ {
412
+ "epoch": 0.3478854024556617,
413
+ "grad_norm": 0.3059715305515258,
414
+ "learning_rate": 3.995807637168205e-05,
415
+ "loss": 0.4751,
416
+ "num_tokens": 195367749.0,
417
+ "step": 255
418
+ },
419
+ {
420
+ "epoch": 0.35470668485675305,
421
+ "grad_norm": 0.2904211221393888,
422
+ "learning_rate": 3.9532083254846505e-05,
423
+ "loss": 0.4648,
424
+ "num_tokens": 199215371.0,
425
+ "step": 260
426
+ },
427
+ {
428
+ "epoch": 0.3615279672578445,
429
+ "grad_norm": 0.33681948143185453,
430
+ "learning_rate": 3.909996178648299e-05,
431
+ "loss": 0.4826,
432
+ "num_tokens": 202903048.0,
433
+ "step": 265
434
+ },
435
+ {
436
+ "epoch": 0.3683492496589359,
437
+ "grad_norm": 0.3285871766466283,
438
+ "learning_rate": 3.866193206083494e-05,
439
+ "loss": 0.4727,
440
+ "num_tokens": 206761213.0,
441
+ "step": 270
442
+ },
443
+ {
444
+ "epoch": 0.37517053206002726,
445
+ "grad_norm": 0.33670385009296144,
446
+ "learning_rate": 3.821821718142332e-05,
447
+ "loss": 0.4694,
448
+ "num_tokens": 210585632.0,
449
+ "step": 275
450
+ },
451
+ {
452
+ "epoch": 0.3819918144611187,
453
+ "grad_norm": 0.3305593264219332,
454
+ "learning_rate": 3.77690431474123e-05,
455
+ "loss": 0.4744,
456
+ "num_tokens": 214261738.0,
457
+ "step": 280
458
+ },
459
+ {
460
+ "epoch": 0.3888130968622101,
461
+ "grad_norm": 0.29054732905430486,
462
+ "learning_rate": 3.7314638738500265e-05,
463
+ "loss": 0.479,
464
+ "num_tokens": 218144024.0,
465
+ "step": 285
466
+ },
467
+ {
468
+ "epoch": 0.3956343792633015,
469
+ "grad_norm": 0.2919105758915809,
470
+ "learning_rate": 3.685523539839439e-05,
471
+ "loss": 0.4752,
472
+ "num_tokens": 222057295.0,
473
+ "step": 290
474
+ },
475
+ {
476
+ "epoch": 0.4024556616643929,
477
+ "grad_norm": 0.2954441471813468,
478
+ "learning_rate": 3.63910671169285e-05,
479
+ "loss": 0.4671,
480
+ "num_tokens": 225828573.0,
481
+ "step": 295
482
+ },
483
+ {
484
+ "epoch": 0.4092769440654843,
485
+ "grad_norm": 0.33726151653757314,
486
+ "learning_rate": 3.5922370310884014e-05,
487
+ "loss": 0.4664,
488
+ "num_tokens": 229710487.0,
489
+ "step": 300
490
+ },
491
+ {
492
+ "epoch": 0.41609822646657574,
493
+ "grad_norm": 0.2994069615286232,
494
+ "learning_rate": 3.5449383703574806e-05,
495
+ "loss": 0.4801,
496
+ "num_tokens": 233617525.0,
497
+ "step": 305
498
+ },
499
+ {
500
+ "epoch": 0.4229195088676671,
501
+ "grad_norm": 0.321770217300059,
502
+ "learning_rate": 3.4972348203257274e-05,
503
+ "loss": 0.4774,
504
+ "num_tokens": 237394471.0,
505
+ "step": 310
506
+ },
507
+ {
508
+ "epoch": 0.4297407912687585,
509
+ "grad_norm": 0.3342140614239043,
510
+ "learning_rate": 3.449150678042748e-05,
511
+ "loss": 0.4732,
512
+ "num_tokens": 241114261.0,
513
+ "step": 315
514
+ },
515
+ {
516
+ "epoch": 0.43656207366984995,
517
+ "grad_norm": 0.311787983749093,
518
+ "learning_rate": 3.400710434406803e-05,
519
+ "loss": 0.4727,
520
+ "num_tokens": 244967987.0,
521
+ "step": 320
522
+ },
523
+ {
524
+ "epoch": 0.4433833560709413,
525
+ "grad_norm": 0.3125973899622851,
526
+ "learning_rate": 3.351938761690748e-05,
527
+ "loss": 0.4789,
528
+ "num_tokens": 248751095.0,
529
+ "step": 325
530
+ },
531
+ {
532
+ "epoch": 0.45020463847203274,
533
+ "grad_norm": 0.3128243315361879,
534
+ "learning_rate": 3.302860500975605e-05,
535
+ "loss": 0.4678,
536
+ "num_tokens": 252607265.0,
537
+ "step": 330
538
+ },
539
+ {
540
+ "epoch": 0.45702592087312416,
541
+ "grad_norm": 0.2899975717949231,
542
+ "learning_rate": 3.253500649498153e-05,
543
+ "loss": 0.4736,
544
+ "num_tokens": 256417909.0,
545
+ "step": 335
546
+ },
547
+ {
548
+ "epoch": 0.4638472032742155,
549
+ "grad_norm": 0.2947121469603074,
550
+ "learning_rate": 3.203884347918975e-05,
551
+ "loss": 0.4663,
552
+ "num_tokens": 260456429.0,
553
+ "step": 340
554
+ },
555
+ {
556
+ "epoch": 0.47066848567530695,
557
+ "grad_norm": 0.2747115046593522,
558
+ "learning_rate": 3.154036867517462e-05,
559
+ "loss": 0.4601,
560
+ "num_tokens": 264287905.0,
561
+ "step": 345
562
+ },
563
+ {
564
+ "epoch": 0.47748976807639837,
565
+ "grad_norm": 0.2684844916597318,
566
+ "learning_rate": 3.1039835973202865e-05,
567
+ "loss": 0.4689,
568
+ "num_tokens": 268098790.0,
569
+ "step": 350
570
+ },
571
+ {
572
+ "epoch": 0.4843110504774898,
573
+ "grad_norm": 0.271171033537078,
574
+ "learning_rate": 3.053750031169903e-05,
575
+ "loss": 0.4769,
576
+ "num_tokens": 271974065.0,
577
+ "step": 355
578
+ },
579
+ {
580
+ "epoch": 0.49113233287858116,
581
+ "grad_norm": 0.29862281272985114,
582
+ "learning_rate": 3.0033617547396614e-05,
583
+ "loss": 0.4804,
584
+ "num_tokens": 275852045.0,
585
+ "step": 360
586
+ },
587
+ {
588
+ "epoch": 0.4979536152796726,
589
+ "grad_norm": 0.2791498620793116,
590
+ "learning_rate": 2.9528444325021477e-05,
591
+ "loss": 0.4603,
592
+ "num_tokens": 279504484.0,
593
+ "step": 365
594
+ },
595
+ {
596
+ "epoch": 0.504774897680764,
597
+ "grad_norm": 0.27773797940501777,
598
+ "learning_rate": 2.902223794657391e-05,
599
+ "loss": 0.4623,
600
+ "num_tokens": 283461546.0,
601
+ "step": 370
602
+ },
603
+ {
604
+ "epoch": 0.5115961800818554,
605
+ "grad_norm": 0.2501959816908535,
606
+ "learning_rate": 2.8515256240275946e-05,
607
+ "loss": 0.4692,
608
+ "num_tokens": 287371918.0,
609
+ "step": 375
610
+ },
611
+ {
612
+ "epoch": 0.5184174624829468,
613
+ "grad_norm": 0.2508499415881564,
614
+ "learning_rate": 2.8007757429250597e-05,
615
+ "loss": 0.4575,
616
+ "num_tokens": 291057738.0,
617
+ "step": 380
618
+ },
619
+ {
620
+ "epoch": 0.5252387448840382,
621
+ "grad_norm": 0.23841623530318584,
622
+ "learning_rate": 2.7500000000000004e-05,
623
+ "loss": 0.4657,
624
+ "num_tokens": 294881963.0,
625
+ "step": 385
626
+ },
627
+ {
628
+ "epoch": 0.5320600272851296,
629
+ "grad_norm": 0.27722631135406894,
630
+ "learning_rate": 2.699224257074941e-05,
631
+ "loss": 0.4666,
632
+ "num_tokens": 298677895.0,
633
+ "step": 390
634
+ },
635
+ {
636
+ "epoch": 0.538881309686221,
637
+ "grad_norm": 0.26944884620681187,
638
+ "learning_rate": 2.6484743759724062e-05,
639
+ "loss": 0.4528,
640
+ "num_tokens": 302387985.0,
641
+ "step": 395
642
+ },
643
+ {
644
+ "epoch": 0.5457025920873124,
645
+ "grad_norm": 0.2683881461913158,
646
+ "learning_rate": 2.5977762053426098e-05,
647
+ "loss": 0.4698,
648
+ "num_tokens": 306130593.0,
649
+ "step": 400
650
+ },
651
+ {
652
+ "epoch": 0.5525238744884038,
653
+ "grad_norm": 0.2292317979583899,
654
+ "learning_rate": 2.547155567497854e-05,
655
+ "loss": 0.4706,
656
+ "num_tokens": 309952008.0,
657
+ "step": 405
658
+ },
659
+ {
660
+ "epoch": 0.5593451568894953,
661
+ "grad_norm": 0.262653672464314,
662
+ "learning_rate": 2.496638245260339e-05,
663
+ "loss": 0.4576,
664
+ "num_tokens": 313725489.0,
665
+ "step": 410
666
+ },
667
+ {
668
+ "epoch": 0.5661664392905866,
669
+ "grad_norm": 0.31628174437057555,
670
+ "learning_rate": 2.446249968830097e-05,
671
+ "loss": 0.4621,
672
+ "num_tokens": 317525148.0,
673
+ "step": 415
674
+ },
675
+ {
676
+ "epoch": 0.572987721691678,
677
+ "grad_norm": 0.2971582407388678,
678
+ "learning_rate": 2.3960164026797137e-05,
679
+ "loss": 0.4625,
680
+ "num_tokens": 321410101.0,
681
+ "step": 420
682
+ },
683
+ {
684
+ "epoch": 0.5798090040927695,
685
+ "grad_norm": 0.2736258492313367,
686
+ "learning_rate": 2.3459631324825388e-05,
687
+ "loss": 0.4579,
688
+ "num_tokens": 325102278.0,
689
+ "step": 425
690
+ },
691
+ {
692
+ "epoch": 0.5866302864938608,
693
+ "grad_norm": 0.27230747471072914,
694
+ "learning_rate": 2.2961156520810255e-05,
695
+ "loss": 0.4623,
696
+ "num_tokens": 328831071.0,
697
+ "step": 430
698
+ },
699
+ {
700
+ "epoch": 0.5934515688949522,
701
+ "grad_norm": 0.2870356128439239,
702
+ "learning_rate": 2.246499350501848e-05,
703
+ "loss": 0.4527,
704
+ "num_tokens": 332767044.0,
705
+ "step": 435
706
+ },
707
+ {
708
+ "epoch": 0.6002728512960437,
709
+ "grad_norm": 0.23826692416074355,
710
+ "learning_rate": 2.197139499024396e-05,
711
+ "loss": 0.4503,
712
+ "num_tokens": 336595638.0,
713
+ "step": 440
714
+ },
715
+ {
716
+ "epoch": 0.607094133697135,
717
+ "grad_norm": 0.26903175563152104,
718
+ "learning_rate": 2.1480612383092536e-05,
719
+ "loss": 0.4621,
720
+ "num_tokens": 340358925.0,
721
+ "step": 445
722
+ },
723
+ {
724
+ "epoch": 0.6139154160982264,
725
+ "grad_norm": 0.23858645047531596,
726
+ "learning_rate": 2.0992895655931984e-05,
727
+ "loss": 0.4606,
728
+ "num_tokens": 344239058.0,
729
+ "step": 450
730
+ },
731
+ {
732
+ "epoch": 0.6207366984993179,
733
+ "grad_norm": 0.26224623620936377,
734
+ "learning_rate": 2.0508493219572522e-05,
735
+ "loss": 0.4638,
736
+ "num_tokens": 348080585.0,
737
+ "step": 455
738
+ },
739
+ {
740
+ "epoch": 0.6275579809004093,
741
+ "grad_norm": 0.22212001743500964,
742
+ "learning_rate": 2.0027651796742735e-05,
743
+ "loss": 0.4578,
744
+ "num_tokens": 351817695.0,
745
+ "step": 460
746
+ },
747
+ {
748
+ "epoch": 0.6343792633015006,
749
+ "grad_norm": 0.24324689513326436,
750
+ "learning_rate": 1.95506162964252e-05,
751
+ "loss": 0.4537,
752
+ "num_tokens": 355639183.0,
753
+ "step": 465
754
+ },
755
+ {
756
+ "epoch": 0.6412005457025921,
757
+ "grad_norm": 0.2377209139662622,
758
+ "learning_rate": 1.9077629689115995e-05,
759
+ "loss": 0.4697,
760
+ "num_tokens": 359437581.0,
761
+ "step": 470
762
+ },
763
+ {
764
+ "epoch": 0.6480218281036835,
765
+ "grad_norm": 0.2525718316267335,
766
+ "learning_rate": 1.8608932883071507e-05,
767
+ "loss": 0.4483,
768
+ "num_tokens": 363189983.0,
769
+ "step": 475
770
+ },
771
+ {
772
+ "epoch": 0.654843110504775,
773
+ "grad_norm": 0.24386193951321106,
774
+ "learning_rate": 1.8144764601605613e-05,
775
+ "loss": 0.4503,
776
+ "num_tokens": 366863209.0,
777
+ "step": 480
778
+ },
779
+ {
780
+ "epoch": 0.6616643929058663,
781
+ "grad_norm": 0.22335725928265934,
782
+ "learning_rate": 1.7685361261499733e-05,
783
+ "loss": 0.4631,
784
+ "num_tokens": 370725860.0,
785
+ "step": 485
786
+ },
787
+ {
788
+ "epoch": 0.6684856753069577,
789
+ "grad_norm": 0.22739365537182124,
790
+ "learning_rate": 1.72309568525877e-05,
791
+ "loss": 0.4493,
792
+ "num_tokens": 374566515.0,
793
+ "step": 490
794
+ },
795
+ {
796
+ "epoch": 0.6753069577080492,
797
+ "grad_norm": 0.2518810180466547,
798
+ "learning_rate": 1.6781782818576686e-05,
799
+ "loss": 0.4434,
800
+ "num_tokens": 378330489.0,
801
+ "step": 495
802
+ },
803
+ {
804
+ "epoch": 0.6821282401091405,
805
+ "grad_norm": 0.22437165048428956,
806
+ "learning_rate": 1.6338067939165058e-05,
807
+ "loss": 0.4475,
808
+ "num_tokens": 382103468.0,
809
+ "step": 500
810
+ },
811
+ {
812
+ "epoch": 0.6889495225102319,
813
+ "grad_norm": 0.24484654505585532,
814
+ "learning_rate": 1.590003821351701e-05,
815
+ "loss": 0.4558,
816
+ "num_tokens": 385837297.0,
817
+ "step": 505
818
+ },
819
+ {
820
+ "epoch": 0.6957708049113234,
821
+ "grad_norm": 0.25262167850764056,
822
+ "learning_rate": 1.54679167451535e-05,
823
+ "loss": 0.4522,
824
+ "num_tokens": 389586680.0,
825
+ "step": 510
826
+ },
827
+ {
828
+ "epoch": 0.7025920873124147,
829
+ "grad_norm": 0.25644327559623603,
830
+ "learning_rate": 1.5041923628317948e-05,
831
+ "loss": 0.4569,
832
+ "num_tokens": 393428760.0,
833
+ "step": 515
834
+ },
835
+ {
836
+ "epoch": 0.7094133697135061,
837
+ "grad_norm": 0.24914618478726863,
838
+ "learning_rate": 1.4622275835874766e-05,
839
+ "loss": 0.4677,
840
+ "num_tokens": 397158700.0,
841
+ "step": 520
842
+ },
843
+ {
844
+ "epoch": 0.7162346521145976,
845
+ "grad_norm": 0.2404286713923951,
846
+ "learning_rate": 1.4209187108797607e-05,
847
+ "loss": 0.4533,
848
+ "num_tokens": 400923938.0,
849
+ "step": 525
850
+ },
851
+ {
852
+ "epoch": 0.723055934515689,
853
+ "grad_norm": 0.22063525670425096,
854
+ "learning_rate": 1.3802867847303785e-05,
855
+ "loss": 0.4483,
856
+ "num_tokens": 404685655.0,
857
+ "step": 530
858
+ },
859
+ {
860
+ "epoch": 0.7298772169167803,
861
+ "grad_norm": 0.21801652381587655,
862
+ "learning_rate": 1.3403525003690304e-05,
863
+ "loss": 0.4532,
864
+ "num_tokens": 408582799.0,
865
+ "step": 535
866
+ },
867
+ {
868
+ "epoch": 0.7366984993178718,
869
+ "grad_norm": 0.20892765769673624,
870
+ "learning_rate": 1.3011361976925884e-05,
871
+ "loss": 0.4584,
872
+ "num_tokens": 412386009.0,
873
+ "step": 540
874
+ },
875
+ {
876
+ "epoch": 0.7435197817189632,
877
+ "grad_norm": 0.21186026658080837,
878
+ "learning_rate": 1.2626578509052997e-05,
879
+ "loss": 0.4603,
880
+ "num_tokens": 416372039.0,
881
+ "step": 545
882
+ },
883
+ {
884
+ "epoch": 0.7503410641200545,
885
+ "grad_norm": 0.21768800059481686,
886
+ "learning_rate": 1.2249370583452342e-05,
887
+ "loss": 0.4468,
888
+ "num_tokens": 420051975.0,
889
+ "step": 550
890
+ },
891
+ {
892
+ "epoch": 0.757162346521146,
893
+ "grad_norm": 0.2204584535704754,
894
+ "learning_rate": 1.1879930325021841e-05,
895
+ "loss": 0.447,
896
+ "num_tokens": 423685709.0,
897
+ "step": 555
898
+ },
899
+ {
900
+ "epoch": 0.7639836289222374,
901
+ "grad_norm": 0.24319632994282764,
902
+ "learning_rate": 1.1518445902320878e-05,
903
+ "loss": 0.4439,
904
+ "num_tokens": 427405904.0,
905
+ "step": 560
906
+ },
907
+ {
908
+ "epoch": 0.7708049113233287,
909
+ "grad_norm": 0.23395443194660598,
910
+ "learning_rate": 1.1165101431729561e-05,
911
+ "loss": 0.4442,
912
+ "num_tokens": 431288378.0,
913
+ "step": 565
914
+ },
915
+ {
916
+ "epoch": 0.7776261937244202,
917
+ "grad_norm": 0.2353220319536335,
918
+ "learning_rate": 1.0820076883671999e-05,
919
+ "loss": 0.4467,
920
+ "num_tokens": 435077995.0,
921
+ "step": 570
922
+ },
923
+ {
924
+ "epoch": 0.7844474761255116,
925
+ "grad_norm": 0.23920963211303678,
926
+ "learning_rate": 1.0483547990951195e-05,
927
+ "loss": 0.4464,
928
+ "num_tokens": 439006864.0,
929
+ "step": 575
930
+ },
931
+ {
932
+ "epoch": 0.791268758526603,
933
+ "grad_norm": 0.23845367222027455,
934
+ "learning_rate": 1.0155686159242297e-05,
935
+ "loss": 0.4602,
936
+ "num_tokens": 443045688.0,
937
+ "step": 580
938
+ },
939
+ {
940
+ "epoch": 0.7980900409276944,
941
+ "grad_norm": 0.23878766406558552,
942
+ "learning_rate": 9.836658379789875e-06,
943
+ "loss": 0.4487,
944
+ "num_tokens": 446852018.0,
945
+ "step": 585
946
+ },
947
+ {
948
+ "epoch": 0.8049113233287858,
949
+ "grad_norm": 0.21045206753906473,
950
+ "learning_rate": 9.52662714435352e-06,
951
+ "loss": 0.4619,
952
+ "num_tokens": 450687287.0,
953
+ "step": 590
954
+ },
955
+ {
956
+ "epoch": 0.8117326057298773,
957
+ "grad_norm": 0.24180115500464866,
958
+ "learning_rate": 9.225750362445255e-06,
959
+ "loss": 0.4478,
960
+ "num_tokens": 454483896.0,
961
+ "step": 595
962
+ },
963
+ {
964
+ "epoch": 0.8185538881309686,
965
+ "grad_norm": 0.22414629123508337,
966
+ "learning_rate": 8.93418128090081e-06,
967
+ "loss": 0.4464,
968
+ "num_tokens": 458325861.0,
969
+ "step": 600
970
+ },
971
+ {
972
+ "epoch": 0.82537517053206,
973
+ "grad_norm": 0.22622883948448075,
974
+ "learning_rate": 8.652068405825798e-06,
975
+ "loss": 0.4519,
976
+ "num_tokens": 462055868.0,
977
+ "step": 605
978
+ },
979
+ {
980
+ "epoch": 0.8321964529331515,
981
+ "grad_norm": 0.2002504054474251,
982
+ "learning_rate": 8.379555426956415e-06,
983
+ "loss": 0.4461,
984
+ "num_tokens": 465973876.0,
985
+ "step": 610
986
+ },
987
+ {
988
+ "epoch": 0.8390177353342428,
989
+ "grad_norm": 0.2196686538554551,
990
+ "learning_rate": 8.11678114447342e-06,
991
+ "loss": 0.4446,
992
+ "num_tokens": 469791498.0,
993
+ "step": 615
994
+ },
995
+ {
996
+ "epoch": 0.8458390177353342,
997
+ "grad_norm": 0.21537486730907285,
998
+ "learning_rate": 7.863879398306385e-06,
999
+ "loss": 0.4419,
1000
+ "num_tokens": 473570581.0,
1001
+ "step": 620
1002
+ },
1003
+ {
1004
+ "epoch": 0.8526603001364257,
1005
+ "grad_norm": 0.2386317506871887,
1006
+ "learning_rate": 7.620978999964487e-06,
1007
+ "loss": 0.4558,
1008
+ "num_tokens": 477401353.0,
1009
+ "step": 625
1010
+ },
1011
+ {
1012
+ "epoch": 0.859481582537517,
1013
+ "grad_norm": 0.20977878997335292,
1014
+ "learning_rate": 7.3882036669283754e-06,
1015
+ "loss": 0.4553,
1016
+ "num_tokens": 481348892.0,
1017
+ "step": 630
1018
+ },
1019
+ {
1020
+ "epoch": 0.8663028649386084,
1021
+ "grad_norm": 0.21563339045910554,
1022
+ "learning_rate": 7.16567195963665e-06,
1023
+ "loss": 0.4544,
1024
+ "num_tokens": 485171910.0,
1025
+ "step": 635
1026
+ },
1027
+ {
1028
+ "epoch": 0.8731241473396999,
1029
+ "grad_norm": 0.20119697622543456,
1030
+ "learning_rate": 6.953497221098949e-06,
1031
+ "loss": 0.4413,
1032
+ "num_tokens": 489059548.0,
1033
+ "step": 640
1034
+ },
1035
+ {
1036
+ "epoch": 0.8799454297407913,
1037
+ "grad_norm": 0.20245196883946875,
1038
+ "learning_rate": 6.751787519166505e-06,
1039
+ "loss": 0.4431,
1040
+ "num_tokens": 492893029.0,
1041
+ "step": 645
1042
+ },
1043
+ {
1044
+ "epoch": 0.8867667121418826,
1045
+ "grad_norm": 0.2144202772255573,
1046
+ "learning_rate": 6.560645591489468e-06,
1047
+ "loss": 0.45,
1048
+ "num_tokens": 496905674.0,
1049
+ "step": 650
1050
+ },
1051
+ {
1052
+ "epoch": 0.8935879945429741,
1053
+ "grad_norm": 0.1960337560239353,
1054
+ "learning_rate": 6.380168793189115e-06,
1055
+ "loss": 0.4464,
1056
+ "num_tokens": 500864542.0,
1057
+ "step": 655
1058
+ },
1059
+ {
1060
+ "epoch": 0.9004092769440655,
1061
+ "grad_norm": 0.19741690445074656,
1062
+ "learning_rate": 6.210449047271566e-06,
1063
+ "loss": 0.4492,
1064
+ "num_tokens": 504810366.0,
1065
+ "step": 660
1066
+ },
1067
+ {
1068
+ "epoch": 0.9072305593451568,
1069
+ "grad_norm": 0.203527801548371,
1070
+ "learning_rate": 6.0515727978082415e-06,
1071
+ "loss": 0.446,
1072
+ "num_tokens": 508607232.0,
1073
+ "step": 665
1074
+ },
1075
+ {
1076
+ "epoch": 0.9140518417462483,
1077
+ "grad_norm": 0.2114927842592368,
1078
+ "learning_rate": 5.9036209659069404e-06,
1079
+ "loss": 0.4519,
1080
+ "num_tokens": 512474084.0,
1081
+ "step": 670
1082
+ },
1083
+ {
1084
+ "epoch": 0.9208731241473397,
1085
+ "grad_norm": 0.21111293452840946,
1086
+ "learning_rate": 5.766668908495966e-06,
1087
+ "loss": 0.4438,
1088
+ "num_tokens": 516216104.0,
1089
+ "step": 675
1090
+ },
1091
+ {
1092
+ "epoch": 0.927694406548431,
1093
+ "grad_norm": 0.2020840535619729,
1094
+ "learning_rate": 5.64078637994224e-06,
1095
+ "loss": 0.453,
1096
+ "num_tokens": 519966345.0,
1097
+ "step": 680
1098
+ },
1099
+ {
1100
+ "epoch": 0.9345156889495225,
1101
+ "grad_norm": 0.19990572835537956,
1102
+ "learning_rate": 5.526037496523051e-06,
1103
+ "loss": 0.4393,
1104
+ "num_tokens": 523793837.0,
1105
+ "step": 685
1106
+ },
1107
+ {
1108
+ "epoch": 0.9413369713506139,
1109
+ "grad_norm": 0.21501511467608375,
1110
+ "learning_rate": 5.422480703769408e-06,
1111
+ "loss": 0.4523,
1112
+ "num_tokens": 527666864.0,
1113
+ "step": 690
1114
+ },
1115
+ {
1116
+ "epoch": 0.9481582537517054,
1117
+ "grad_norm": 0.20500881616784905,
1118
+ "learning_rate": 5.330168746697747e-06,
1119
+ "loss": 0.4494,
1120
+ "num_tokens": 531466359.0,
1121
+ "step": 695
1122
+ },
1123
+ {
1124
+ "epoch": 0.9549795361527967,
1125
+ "grad_norm": 0.19384439762912464,
1126
+ "learning_rate": 5.249148642945106e-06,
1127
+ "loss": 0.4513,
1128
+ "num_tokens": 535253217.0,
1129
+ "step": 700
1130
+ },
1131
+ {
1132
+ "epoch": 0.9618008185538881,
1133
+ "grad_norm": 0.20677675163076573,
1134
+ "learning_rate": 5.179461658821403e-06,
1135
+ "loss": 0.4372,
1136
+ "num_tokens": 539137436.0,
1137
+ "step": 705
1138
+ },
1139
+ {
1140
+ "epoch": 0.9686221009549796,
1141
+ "grad_norm": 0.2176204558853245,
1142
+ "learning_rate": 5.121143288291119e-06,
1143
+ "loss": 0.447,
1144
+ "num_tokens": 542824098.0,
1145
+ "step": 710
1146
+ },
1147
+ {
1148
+ "epoch": 0.975443383356071,
1149
+ "grad_norm": 0.20886934132711027,
1150
+ "learning_rate": 5.07422323489499e-06,
1151
+ "loss": 0.4566,
1152
+ "num_tokens": 546578047.0,
1153
+ "step": 715
1154
+ },
1155
+ {
1156
+ "epoch": 0.9822646657571623,
1157
+ "grad_norm": 0.20458382129158925,
1158
+ "learning_rate": 5.03872539662099e-06,
1159
+ "loss": 0.4446,
1160
+ "num_tokens": 550377811.0,
1161
+ "step": 720
1162
+ },
1163
+ {
1164
+ "epoch": 0.9890859481582538,
1165
+ "grad_norm": 0.18980320361368563,
1166
+ "learning_rate": 5.014667853732269e-06,
1167
+ "loss": 0.4403,
1168
+ "num_tokens": 554224024.0,
1169
+ "step": 725
1170
+ },
1171
+ {
1172
+ "epoch": 0.9959072305593452,
1173
+ "grad_norm": 0.18951137695952713,
1174
+ "learning_rate": 5.00206285955824e-06,
1175
+ "loss": 0.4485,
1176
+ "num_tokens": 558001366.0,
1177
+ "step": 730
1178
+ },
1179
+ {
1180
+ "epoch": 1.0,
1181
+ "num_tokens": 560311272.0,
1182
+ "step": 733,
1183
+ "total_flos": 1913806159609856.0,
1184
+ "train_loss": 0.48002010657061983,
1185
+ "train_runtime": 43121.1313,
1186
+ "train_samples_per_second": 2.174,
1187
+ "train_steps_per_second": 0.017
1188
+ }
1189
+ ],
1190
+ "logging_steps": 5,
1191
+ "max_steps": 733,
1192
+ "num_input_tokens_seen": 0,
1193
+ "num_train_epochs": 1,
1194
+ "save_steps": 50,
1195
+ "stateful_callbacks": {
1196
+ "TrainerControl": {
1197
+ "args": {
1198
+ "should_epoch_stop": false,
1199
+ "should_evaluate": false,
1200
+ "should_log": false,
1201
+ "should_save": true,
1202
+ "should_training_stop": true
1203
+ },
1204
+ "attributes": {}
1205
+ }
1206
+ },
1207
+ "total_flos": 1913806159609856.0,
1208
+ "train_batch_size": 16,
1209
+ "trial_name": null,
1210
+ "trial_params": null
1211
+ }