pl593 commited on
Commit
d3be871
·
verified ·
1 Parent(s): 30f475f

upload trained GPN MSA model

Browse files
README.md ADDED
@@ -0,0 +1,261 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - generated_from_trainer
4
+ datasets:
5
+ - songlab/gpn-msa-sapiens-dataset
6
+ model-index:
7
+ - name: checkpoints
8
+ results: []
9
+ ---
10
+
11
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
12
+ should probably proofread and complete it, then remove this comment. -->
13
+
14
+ # checkpoints
15
+
16
+ This model is a fine-tuned version of [](https://huggingface.co/) on the songlab/gpn-msa-sapiens-dataset dataset.
17
+ It achieves the following results on the evaluation set:
18
+ - Loss: 0.1593
19
+
20
+ ## Model description
21
+
22
+ More information needed
23
+
24
+ ## Intended uses & limitations
25
+
26
+ More information needed
27
+
28
+ ## Training and evaluation data
29
+
30
+ More information needed
31
+
32
+ ## Training procedure
33
+
34
+ ### Training hyperparameters
35
+
36
+ The following hyperparameters were used during training:
37
+ - learning_rate: 0.0001
38
+ - train_batch_size: 1024
39
+ - eval_batch_size: 1024
40
+ - seed: 42
41
+ - distributed_type: multi-GPU
42
+ - gradient_accumulation_steps: 2
43
+ - total_train_batch_size: 2048
44
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
45
+ - lr_scheduler_type: cosine
46
+ - lr_scheduler_warmup_steps: 100
47
+ - training_steps: 10000
48
+ - mixed_precision_training: Native AMP
49
+
50
+ ### Training results
51
+
52
+ | Training Loss | Epoch | Step | Validation Loss |
53
+ |:-------------:|:------:|:-----:|:---------------:|
54
+ | 0.6225 | 0.0232 | 50 | 0.1988 |
55
+ | 0.1508 | 0.0464 | 100 | 0.1836 |
56
+ | 0.1449 | 0.0696 | 150 | 0.1789 |
57
+ | 0.1421 | 0.0927 | 200 | 0.1762 |
58
+ | 0.141 | 0.1159 | 250 | 0.1764 |
59
+ | 0.1397 | 0.1391 | 300 | 0.1755 |
60
+ | 0.1393 | 0.1623 | 350 | 0.1741 |
61
+ | 0.1388 | 0.1855 | 400 | 0.1738 |
62
+ | 0.1394 | 0.2087 | 450 | 0.1725 |
63
+ | 0.1383 | 0.2319 | 500 | 0.1730 |
64
+ | 0.1376 | 0.2550 | 550 | 0.1717 |
65
+ | 0.1372 | 0.2782 | 600 | 0.1708 |
66
+ | 0.1361 | 0.3014 | 650 | 0.1728 |
67
+ | 0.1362 | 0.3246 | 700 | 0.1715 |
68
+ | 0.1364 | 0.3478 | 750 | 0.1713 |
69
+ | 0.1356 | 0.3710 | 800 | 0.1695 |
70
+ | 0.1353 | 0.3942 | 850 | 0.1687 |
71
+ | 0.1361 | 0.4173 | 900 | 0.1703 |
72
+ | 0.1354 | 0.4405 | 950 | 0.1697 |
73
+ | 0.1352 | 0.4637 | 1000 | 0.1695 |
74
+ | 0.1335 | 0.4869 | 1050 | 0.1683 |
75
+ | 0.1327 | 0.5101 | 1100 | 0.1686 |
76
+ | 0.1337 | 0.5333 | 1150 | 0.1692 |
77
+ | 0.134 | 0.5565 | 1200 | 0.1665 |
78
+ | 0.1341 | 0.5796 | 1250 | 0.1680 |
79
+ | 0.1347 | 0.6028 | 1300 | 0.1672 |
80
+ | 0.1335 | 0.6260 | 1350 | 0.1661 |
81
+ | 0.1338 | 0.6492 | 1400 | 0.1663 |
82
+ | 0.1335 | 0.6724 | 1450 | 0.1670 |
83
+ | 0.1332 | 0.6956 | 1500 | 0.1652 |
84
+ | 0.1336 | 0.7188 | 1550 | 0.1663 |
85
+ | 0.133 | 0.7419 | 1600 | 0.1656 |
86
+ | 0.1332 | 0.7651 | 1650 | 0.1661 |
87
+ | 0.1327 | 0.7883 | 1700 | 0.1656 |
88
+ | 0.1318 | 0.8115 | 1750 | 0.1662 |
89
+ | 0.1319 | 0.8347 | 1800 | 0.1652 |
90
+ | 0.1337 | 0.8579 | 1850 | 0.1639 |
91
+ | 0.1324 | 0.8811 | 1900 | 0.1648 |
92
+ | 0.1334 | 0.9042 | 1950 | 0.1651 |
93
+ | 0.1317 | 0.9274 | 2000 | 0.1638 |
94
+ | 0.1324 | 0.9506 | 2050 | 0.1649 |
95
+ | 0.1326 | 0.9738 | 2100 | 0.1660 |
96
+ | 0.1326 | 0.9970 | 2150 | 0.1640 |
97
+ | 0.132 | 1.0202 | 2200 | 0.1653 |
98
+ | 0.1319 | 1.0434 | 2250 | 0.1655 |
99
+ | 0.1326 | 1.0665 | 2300 | 0.1643 |
100
+ | 0.1321 | 1.0897 | 2350 | 0.1659 |
101
+ | 0.1317 | 1.1129 | 2400 | 0.1644 |
102
+ | 0.1322 | 1.1361 | 2450 | 0.1651 |
103
+ | 0.1325 | 1.1593 | 2500 | 0.1640 |
104
+ | 0.1311 | 1.1825 | 2550 | 0.1626 |
105
+ | 0.1323 | 1.2057 | 2600 | 0.1626 |
106
+ | 0.1316 | 1.2288 | 2650 | 0.1639 |
107
+ | 0.1314 | 1.2520 | 2700 | 0.1635 |
108
+ | 0.1314 | 1.2752 | 2750 | 0.1636 |
109
+ | 0.131 | 1.2984 | 2800 | 0.1626 |
110
+ | 0.1313 | 1.3216 | 2850 | 0.1632 |
111
+ | 0.1312 | 1.3448 | 2900 | 0.1637 |
112
+ | 0.1317 | 1.3680 | 2950 | 0.1640 |
113
+ | 0.1311 | 1.3911 | 3000 | 0.1621 |
114
+ | 0.1304 | 1.4143 | 3050 | 0.1631 |
115
+ | 0.1307 | 1.4375 | 3100 | 0.1624 |
116
+ | 0.1315 | 1.4607 | 3150 | 0.1642 |
117
+ | 0.1303 | 1.4839 | 3200 | 0.1636 |
118
+ | 0.1315 | 1.5071 | 3250 | 0.1622 |
119
+ | 0.1315 | 1.5303 | 3300 | 0.1629 |
120
+ | 0.1303 | 1.5534 | 3350 | 0.1642 |
121
+ | 0.1309 | 1.5766 | 3400 | 0.1618 |
122
+ | 0.1307 | 1.5998 | 3450 | 0.1631 |
123
+ | 0.1314 | 1.6230 | 3500 | 0.1629 |
124
+ | 0.1314 | 1.6462 | 3550 | 0.1628 |
125
+ | 0.1312 | 1.6694 | 3600 | 0.1631 |
126
+ | 0.1299 | 1.6926 | 3650 | 0.1618 |
127
+ | 0.1304 | 1.7157 | 3700 | 0.1624 |
128
+ | 0.1299 | 1.7389 | 3750 | 0.1632 |
129
+ | 0.1309 | 1.7621 | 3800 | 0.1623 |
130
+ | 0.1303 | 1.7853 | 3850 | 0.1631 |
131
+ | 0.1312 | 1.8085 | 3900 | 0.1616 |
132
+ | 0.1303 | 1.8317 | 3950 | 0.1622 |
133
+ | 0.1308 | 1.8549 | 4000 | 0.1632 |
134
+ | 0.1297 | 1.8780 | 4050 | 0.1620 |
135
+ | 0.1301 | 1.9012 | 4100 | 0.1617 |
136
+ | 0.131 | 1.9244 | 4150 | 0.1597 |
137
+ | 0.1296 | 1.9476 | 4200 | 0.1626 |
138
+ | 0.1299 | 1.9708 | 4250 | 0.1632 |
139
+ | 0.1299 | 1.9940 | 4300 | 0.1605 |
140
+ | 0.1296 | 2.0172 | 4350 | 0.1620 |
141
+ | 0.1302 | 2.0403 | 4400 | 0.1628 |
142
+ | 0.13 | 2.0635 | 4450 | 0.1621 |
143
+ | 0.1296 | 2.0867 | 4500 | 0.1616 |
144
+ | 0.1298 | 2.1099 | 4550 | 0.1613 |
145
+ | 0.1299 | 2.1331 | 4600 | 0.1603 |
146
+ | 0.1299 | 2.1563 | 4650 | 0.1621 |
147
+ | 0.1306 | 2.1795 | 4700 | 0.1614 |
148
+ | 0.1303 | 2.2026 | 4750 | 0.1625 |
149
+ | 0.13 | 2.2258 | 4800 | 0.1624 |
150
+ | 0.1295 | 2.2490 | 4850 | 0.1627 |
151
+ | 0.1299 | 2.2722 | 4900 | 0.1609 |
152
+ | 0.13 | 2.2954 | 4950 | 0.1622 |
153
+ | 0.1311 | 2.3186 | 5000 | 0.1602 |
154
+ | 0.1284 | 2.3418 | 5050 | 0.1616 |
155
+ | 0.13 | 2.3649 | 5100 | 0.1602 |
156
+ | 0.129 | 2.3881 | 5150 | 0.1605 |
157
+ | 0.129 | 2.4113 | 5200 | 0.1606 |
158
+ | 0.1297 | 2.4345 | 5250 | 0.1620 |
159
+ | 0.1293 | 2.4577 | 5300 | 0.1607 |
160
+ | 0.1288 | 2.4809 | 5350 | 0.1615 |
161
+ | 0.1294 | 2.5041 | 5400 | 0.1614 |
162
+ | 0.1285 | 2.5272 | 5450 | 0.1620 |
163
+ | 0.1303 | 2.5504 | 5500 | 0.1618 |
164
+ | 0.1291 | 2.5736 | 5550 | 0.1603 |
165
+ | 0.1298 | 2.5968 | 5600 | 0.1609 |
166
+ | 0.1288 | 2.6200 | 5650 | 0.1604 |
167
+ | 0.129 | 2.6432 | 5700 | 0.1600 |
168
+ | 0.1291 | 2.6664 | 5750 | 0.1597 |
169
+ | 0.1291 | 2.6895 | 5800 | 0.1609 |
170
+ | 0.129 | 2.7127 | 5850 | 0.1611 |
171
+ | 0.13 | 2.7359 | 5900 | 0.1600 |
172
+ | 0.1296 | 2.7591 | 5950 | 0.1603 |
173
+ | 0.1294 | 2.7823 | 6000 | 0.1592 |
174
+ | 0.1283 | 2.8055 | 6050 | 0.1618 |
175
+ | 0.1292 | 2.8287 | 6100 | 0.1612 |
176
+ | 0.128 | 2.8518 | 6150 | 0.1604 |
177
+ | 0.1288 | 2.8750 | 6200 | 0.1611 |
178
+ | 0.1283 | 2.8982 | 6250 | 0.1609 |
179
+ | 0.1292 | 2.9214 | 6300 | 0.1605 |
180
+ | 0.1302 | 2.9446 | 6350 | 0.1602 |
181
+ | 0.1285 | 2.9678 | 6400 | 0.1601 |
182
+ | 0.1286 | 2.9910 | 6450 | 0.1609 |
183
+ | 0.1301 | 3.0141 | 6500 | 0.1602 |
184
+ | 0.1296 | 3.0373 | 6550 | 0.1597 |
185
+ | 0.1291 | 3.0605 | 6600 | 0.1604 |
186
+ | 0.1288 | 3.0837 | 6650 | 0.1595 |
187
+ | 0.129 | 3.1069 | 6700 | 0.1593 |
188
+ | 0.1286 | 3.1301 | 6750 | 0.1600 |
189
+ | 0.1293 | 3.1533 | 6800 | 0.1599 |
190
+ | 0.1289 | 3.1764 | 6850 | 0.1599 |
191
+ | 0.1295 | 3.1996 | 6900 | 0.1601 |
192
+ | 0.1287 | 3.2228 | 6950 | 0.1592 |
193
+ | 0.1286 | 3.2460 | 7000 | 0.1600 |
194
+ | 0.1283 | 3.2692 | 7050 | 0.1598 |
195
+ | 0.1288 | 3.2924 | 7100 | 0.1612 |
196
+ | 0.1298 | 3.3156 | 7150 | 0.1597 |
197
+ | 0.1284 | 3.3387 | 7200 | 0.1605 |
198
+ | 0.1289 | 3.3619 | 7250 | 0.1605 |
199
+ | 0.1289 | 3.3851 | 7300 | 0.1600 |
200
+ | 0.1285 | 3.4083 | 7350 | 0.1605 |
201
+ | 0.1286 | 3.4315 | 7400 | 0.1610 |
202
+ | 0.1278 | 3.4547 | 7450 | 0.1598 |
203
+ | 0.1274 | 3.4779 | 7500 | 0.1598 |
204
+ | 0.1297 | 3.5010 | 7550 | 0.1599 |
205
+ | 0.1288 | 3.5242 | 7600 | 0.1591 |
206
+ | 0.1281 | 3.5474 | 7650 | 0.1598 |
207
+ | 0.1288 | 3.5706 | 7700 | 0.1600 |
208
+ | 0.128 | 3.5938 | 7750 | 0.1594 |
209
+ | 0.1287 | 3.6170 | 7800 | 0.1603 |
210
+ | 0.1291 | 3.6402 | 7850 | 0.1592 |
211
+ | 0.1287 | 3.6633 | 7900 | 0.1596 |
212
+ | 0.1283 | 3.6865 | 7950 | 0.1590 |
213
+ | 0.128 | 3.7097 | 8000 | 0.1584 |
214
+ | 0.1276 | 3.7329 | 8050 | 0.1602 |
215
+ | 0.1287 | 3.7561 | 8100 | 0.1602 |
216
+ | 0.1306 | 3.7793 | 8150 | 0.1595 |
217
+ | 0.1286 | 3.8025 | 8200 | 0.1587 |
218
+ | 0.1292 | 3.8256 | 8250 | 0.1593 |
219
+ | 0.1275 | 3.8488 | 8300 | 0.1590 |
220
+ | 0.1277 | 3.8720 | 8350 | 0.1600 |
221
+ | 0.129 | 3.8952 | 8400 | 0.1602 |
222
+ | 0.1286 | 3.9184 | 8450 | 0.1593 |
223
+ | 0.1281 | 3.9416 | 8500 | 0.1603 |
224
+ | 0.1285 | 3.9648 | 8550 | 0.1591 |
225
+ | 0.1293 | 3.9879 | 8600 | 0.1592 |
226
+ | 0.1283 | 4.0111 | 8650 | 0.1587 |
227
+ | 0.1277 | 4.0343 | 8700 | 0.1598 |
228
+ | 0.1283 | 4.0575 | 8750 | 0.1599 |
229
+ | 0.1288 | 4.0807 | 8800 | 0.1579 |
230
+ | 0.1287 | 4.1039 | 8850 | 0.1588 |
231
+ | 0.1294 | 4.1271 | 8900 | 0.1607 |
232
+ | 0.1277 | 4.1502 | 8950 | 0.1599 |
233
+ | 0.1285 | 4.1734 | 9000 | 0.1595 |
234
+ | 0.1289 | 4.1966 | 9050 | 0.1610 |
235
+ | 0.1289 | 4.2198 | 9100 | 0.1599 |
236
+ | 0.1283 | 4.2430 | 9150 | 0.1589 |
237
+ | 0.1282 | 4.2662 | 9200 | 0.1597 |
238
+ | 0.1286 | 4.2894 | 9250 | 0.1608 |
239
+ | 0.1287 | 4.3125 | 9300 | 0.1608 |
240
+ | 0.1287 | 4.3357 | 9350 | 0.1602 |
241
+ | 0.1286 | 4.3589 | 9400 | 0.1596 |
242
+ | 0.1289 | 4.3821 | 9450 | 0.1598 |
243
+ | 0.1286 | 4.4053 | 9500 | 0.1612 |
244
+ | 0.1281 | 4.4285 | 9550 | 0.1590 |
245
+ | 0.1276 | 4.4517 | 9600 | 0.1588 |
246
+ | 0.1289 | 4.4748 | 9650 | 0.1590 |
247
+ | 0.1284 | 4.4980 | 9700 | 0.1587 |
248
+ | 0.1284 | 4.5212 | 9750 | 0.1597 |
249
+ | 0.1297 | 4.5444 | 9800 | 0.1594 |
250
+ | 0.1276 | 4.5676 | 9850 | 0.1593 |
251
+ | 0.129 | 4.5908 | 9900 | 0.1592 |
252
+ | 0.1285 | 4.6140 | 9950 | 0.1603 |
253
+ | 0.1282 | 4.6371 | 10000 | 0.1601 |
254
+
255
+
256
+ ### Framework versions
257
+
258
+ - Transformers 4.40.2
259
+ - Pytorch 2.8.0+cu126
260
+ - Datasets 4.0.0
261
+ - Tokenizers 0.19.1
all_results.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 4.637143519591931,
3
+ "eval_loss": 0.15925397718542236,
4
+ "eval_runtime": 61.5353,
5
+ "eval_samples_per_second": 675.807,
6
+ "eval_steps_per_second": 0.666,
7
+ "perplexity": 1.1726357315864648,
8
+ "total_flos": 2.3231400526217216e+17,
9
+ "train_loss": 0.13326009378433226,
10
+ "train_runtime": 41368.3669,
11
+ "train_samples_per_second": 495.064,
12
+ "train_steps_per_second": 0.242
13
+ }
checkpoint-8800/config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "GPNRoFormerForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "aux_features_vocab_size": 5,
7
+ "embedding_size": 768,
8
+ "group_tokens": 1,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 1536,
16
+ "model_type": "GPNRoFormer",
17
+ "n_aux_features": 445,
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 2,
20
+ "pad_token_id": 0,
21
+ "rotary_value": false,
22
+ "torch_dtype": "float32",
23
+ "transformers_version": "4.40.2",
24
+ "type_vocab_size": 2,
25
+ "use_cache": true,
26
+ "vocab_size": 6
27
+ }
checkpoint-8800/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a3fe36920de849781b2a7f361176544f9d0a1ee3f2ebe62c3627426ee9518250
3
+ size 118212491
checkpoint-8800/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a0e79347487c61d8217106a3f8f05bbf42d7c2038dab7a7a461077975e6acff9
3
+ size 59497084
checkpoint-8800/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8637e509d3b3a1df7b3097e625b6e4859dba03dc277087294b9305e0298e9f05
3
+ size 14709
checkpoint-8800/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b0294dd3bf153ec3af58682fe4ad6efb3b9c01758053c3cd8df76a3af27e88fb
3
+ size 1465
checkpoint-8800/trainer_state.json ADDED
@@ -0,0 +1,2661 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": 0.15788726458515429,
3
+ "best_model_checkpoint": "checkpoints/checkpoint-8800",
4
+ "epoch": 4.0806862972408995,
5
+ "eval_steps": 50,
6
+ "global_step": 8800,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.023185717597959656,
13
+ "grad_norm": 0.16052097082138062,
14
+ "learning_rate": 5e-05,
15
+ "loss": 0.6225,
16
+ "step": 50
17
+ },
18
+ {
19
+ "epoch": 0.023185717597959656,
20
+ "eval_loss": 0.1987911110084725,
21
+ "eval_runtime": 63.5433,
22
+ "eval_samples_per_second": 654.451,
23
+ "eval_steps_per_second": 0.645,
24
+ "step": 50
25
+ },
26
+ {
27
+ "epoch": 0.04637143519591931,
28
+ "grad_norm": 0.09532159566879272,
29
+ "learning_rate": 0.0001,
30
+ "loss": 0.1508,
31
+ "step": 100
32
+ },
33
+ {
34
+ "epoch": 0.04637143519591931,
35
+ "eval_loss": 0.18357936446787168,
36
+ "eval_runtime": 60.9844,
37
+ "eval_samples_per_second": 681.912,
38
+ "eval_steps_per_second": 0.672,
39
+ "step": 100
40
+ },
41
+ {
42
+ "epoch": 0.06955715279387897,
43
+ "grad_norm": 0.24056212604045868,
44
+ "learning_rate": 9.999370638369377e-05,
45
+ "loss": 0.1449,
46
+ "step": 150
47
+ },
48
+ {
49
+ "epoch": 0.06955715279387897,
50
+ "eval_loss": 0.17892658896642444,
51
+ "eval_runtime": 60.7834,
52
+ "eval_samples_per_second": 684.167,
53
+ "eval_steps_per_second": 0.675,
54
+ "step": 150
55
+ },
56
+ {
57
+ "epoch": 0.09274287039183862,
58
+ "grad_norm": 0.09350813180208206,
59
+ "learning_rate": 9.997482711915927e-05,
60
+ "loss": 0.1421,
61
+ "step": 200
62
+ },
63
+ {
64
+ "epoch": 0.09274287039183862,
65
+ "eval_loss": 0.17624869869175752,
66
+ "eval_runtime": 60.3826,
67
+ "eval_samples_per_second": 688.708,
68
+ "eval_steps_per_second": 0.679,
69
+ "step": 200
70
+ },
71
+ {
72
+ "epoch": 0.11592858798979828,
73
+ "grad_norm": 0.12230529636144638,
74
+ "learning_rate": 9.99433669591504e-05,
75
+ "loss": 0.141,
76
+ "step": 250
77
+ },
78
+ {
79
+ "epoch": 0.11592858798979828,
80
+ "eval_loss": 0.17641382363047173,
81
+ "eval_runtime": 60.4169,
82
+ "eval_samples_per_second": 688.317,
83
+ "eval_steps_per_second": 0.679,
84
+ "step": 250
85
+ },
86
+ {
87
+ "epoch": 0.13911430558775795,
88
+ "grad_norm": 0.14592748880386353,
89
+ "learning_rate": 9.989933382359422e-05,
90
+ "loss": 0.1397,
91
+ "step": 300
92
+ },
93
+ {
94
+ "epoch": 0.13911430558775795,
95
+ "eval_loss": 0.17552215792639078,
96
+ "eval_runtime": 61.6101,
97
+ "eval_samples_per_second": 674.987,
98
+ "eval_steps_per_second": 0.665,
99
+ "step": 300
100
+ },
101
+ {
102
+ "epoch": 0.1623000231857176,
103
+ "grad_norm": 0.10219988226890564,
104
+ "learning_rate": 9.984273879759713e-05,
105
+ "loss": 0.1393,
106
+ "step": 350
107
+ },
108
+ {
109
+ "epoch": 0.1623000231857176,
110
+ "eval_loss": 0.17414749172793012,
111
+ "eval_runtime": 61.4962,
112
+ "eval_samples_per_second": 676.237,
113
+ "eval_steps_per_second": 0.667,
114
+ "step": 350
115
+ },
116
+ {
117
+ "epoch": 0.18548574078367724,
118
+ "grad_norm": 0.12338168174028397,
119
+ "learning_rate": 9.977359612865423e-05,
120
+ "loss": 0.1388,
121
+ "step": 400
122
+ },
123
+ {
124
+ "epoch": 0.18548574078367724,
125
+ "eval_loss": 0.17378012638412807,
126
+ "eval_runtime": 61.0462,
127
+ "eval_samples_per_second": 681.221,
128
+ "eval_steps_per_second": 0.672,
129
+ "step": 400
130
+ },
131
+ {
132
+ "epoch": 0.20867145838163692,
133
+ "grad_norm": 0.09479879587888718,
134
+ "learning_rate": 9.969192322306271e-05,
135
+ "loss": 0.1394,
136
+ "step": 450
137
+ },
138
+ {
139
+ "epoch": 0.20867145838163692,
140
+ "eval_loss": 0.17252362204688398,
141
+ "eval_runtime": 60.963,
142
+ "eval_samples_per_second": 682.151,
143
+ "eval_steps_per_second": 0.673,
144
+ "step": 450
145
+ },
146
+ {
147
+ "epoch": 0.23185717597959657,
148
+ "grad_norm": 0.1108623668551445,
149
+ "learning_rate": 9.959774064153977e-05,
150
+ "loss": 0.1383,
151
+ "step": 500
152
+ },
153
+ {
154
+ "epoch": 0.23185717597959657,
155
+ "eval_loss": 0.17298176916843877,
156
+ "eval_runtime": 60.6546,
157
+ "eval_samples_per_second": 685.62,
158
+ "eval_steps_per_second": 0.676,
159
+ "step": 500
160
+ },
161
+ {
162
+ "epoch": 0.2550428935775562,
163
+ "grad_norm": 0.0725204199552536,
164
+ "learning_rate": 9.949107209404665e-05,
165
+ "loss": 0.1376,
166
+ "step": 550
167
+ },
168
+ {
169
+ "epoch": 0.2550428935775562,
170
+ "eval_loss": 0.17165218165539878,
171
+ "eval_runtime": 59.4888,
172
+ "eval_samples_per_second": 699.056,
173
+ "eval_steps_per_second": 0.689,
174
+ "step": 550
175
+ },
176
+ {
177
+ "epoch": 0.2782286111755159,
178
+ "grad_norm": 0.0955963134765625,
179
+ "learning_rate": 9.937194443381972e-05,
180
+ "loss": 0.1372,
181
+ "step": 600
182
+ },
183
+ {
184
+ "epoch": 0.2782286111755159,
185
+ "eval_loss": 0.17077083113718278,
186
+ "eval_runtime": 60.6021,
187
+ "eval_samples_per_second": 686.214,
188
+ "eval_steps_per_second": 0.677,
189
+ "step": 600
190
+ },
191
+ {
192
+ "epoch": 0.3014143287734755,
193
+ "grad_norm": 0.18736732006072998,
194
+ "learning_rate": 9.924038765061042e-05,
195
+ "loss": 0.1361,
196
+ "step": 650
197
+ },
198
+ {
199
+ "epoch": 0.3014143287734755,
200
+ "eval_loss": 0.1727738813183492,
201
+ "eval_runtime": 60.3343,
202
+ "eval_samples_per_second": 689.259,
203
+ "eval_steps_per_second": 0.68,
204
+ "step": 650
205
+ },
206
+ {
207
+ "epoch": 0.3246000463714352,
208
+ "grad_norm": 0.09572151303291321,
209
+ "learning_rate": 9.909643486313533e-05,
210
+ "loss": 0.1362,
211
+ "step": 700
212
+ },
213
+ {
214
+ "epoch": 0.3246000463714352,
215
+ "eval_loss": 0.17145407115151273,
216
+ "eval_runtime": 60.2732,
217
+ "eval_samples_per_second": 689.959,
218
+ "eval_steps_per_second": 0.68,
219
+ "step": 700
220
+ },
221
+ {
222
+ "epoch": 0.34778576396939487,
223
+ "grad_norm": 0.07214252650737762,
224
+ "learning_rate": 9.894012231073894e-05,
225
+ "loss": 0.1364,
226
+ "step": 750
227
+ },
228
+ {
229
+ "epoch": 0.34778576396939487,
230
+ "eval_loss": 0.17133199408489355,
231
+ "eval_runtime": 60.0148,
232
+ "eval_samples_per_second": 692.929,
233
+ "eval_steps_per_second": 0.683,
234
+ "step": 750
235
+ },
236
+ {
237
+ "epoch": 0.3709714815673545,
238
+ "grad_norm": 0.18224318325519562,
239
+ "learning_rate": 9.877148934427037e-05,
240
+ "loss": 0.1356,
241
+ "step": 800
242
+ },
243
+ {
244
+ "epoch": 0.3709714815673545,
245
+ "eval_loss": 0.16949569222888886,
246
+ "eval_runtime": 60.1491,
247
+ "eval_samples_per_second": 691.382,
248
+ "eval_steps_per_second": 0.682,
249
+ "step": 800
250
+ },
251
+ {
252
+ "epoch": 0.39415719916531416,
253
+ "grad_norm": 0.06306415796279907,
254
+ "learning_rate": 9.859057841617709e-05,
255
+ "loss": 0.1353,
256
+ "step": 850
257
+ },
258
+ {
259
+ "epoch": 0.39415719916531416,
260
+ "eval_loss": 0.1686952690798172,
261
+ "eval_runtime": 60.6447,
262
+ "eval_samples_per_second": 685.731,
263
+ "eval_steps_per_second": 0.676,
264
+ "step": 850
265
+ },
266
+ {
267
+ "epoch": 0.41734291676327384,
268
+ "grad_norm": 0.10090287029743195,
269
+ "learning_rate": 9.839743506981782e-05,
270
+ "loss": 0.1361,
271
+ "step": 900
272
+ },
273
+ {
274
+ "epoch": 0.41734291676327384,
275
+ "eval_loss": 0.17026100034088926,
276
+ "eval_runtime": 61.5224,
277
+ "eval_samples_per_second": 675.949,
278
+ "eval_steps_per_second": 0.666,
279
+ "step": 900
280
+ },
281
+ {
282
+ "epoch": 0.44052863436123346,
283
+ "grad_norm": 0.10061236470937729,
284
+ "learning_rate": 9.819210792799712e-05,
285
+ "loss": 0.1354,
286
+ "step": 950
287
+ },
288
+ {
289
+ "epoch": 0.44052863436123346,
290
+ "eval_loss": 0.16971544565694113,
291
+ "eval_runtime": 60.488,
292
+ "eval_samples_per_second": 687.508,
293
+ "eval_steps_per_second": 0.678,
294
+ "step": 950
295
+ },
296
+ {
297
+ "epoch": 0.46371435195919314,
298
+ "grad_norm": 0.06525534391403198,
299
+ "learning_rate": 9.797464868072488e-05,
300
+ "loss": 0.1352,
301
+ "step": 1000
302
+ },
303
+ {
304
+ "epoch": 0.46371435195919314,
305
+ "eval_loss": 0.16946903195393553,
306
+ "eval_runtime": 61.3558,
307
+ "eval_samples_per_second": 677.784,
308
+ "eval_steps_per_second": 0.668,
309
+ "step": 1000
310
+ },
311
+ {
312
+ "epoch": 0.4869000695571528,
313
+ "grad_norm": 0.06269507855176926,
314
+ "learning_rate": 9.77451120722037e-05,
315
+ "loss": 0.1335,
316
+ "step": 1050
317
+ },
318
+ {
319
+ "epoch": 0.4869000695571528,
320
+ "eval_loss": 0.16825352947444114,
321
+ "eval_runtime": 60.1971,
322
+ "eval_samples_per_second": 690.831,
323
+ "eval_steps_per_second": 0.681,
324
+ "step": 1050
325
+ },
326
+ {
327
+ "epoch": 0.5100857871551124,
328
+ "grad_norm": 0.08187470585107803,
329
+ "learning_rate": 9.750355588704727e-05,
330
+ "loss": 0.1327,
331
+ "step": 1100
332
+ },
333
+ {
334
+ "epoch": 0.5100857871551124,
335
+ "eval_loss": 0.16861737282523587,
336
+ "eval_runtime": 59.5715,
337
+ "eval_samples_per_second": 698.085,
338
+ "eval_steps_per_second": 0.688,
339
+ "step": 1100
340
+ },
341
+ {
342
+ "epoch": 0.5332715047530721,
343
+ "grad_norm": 0.06607680767774582,
344
+ "learning_rate": 9.725004093573342e-05,
345
+ "loss": 0.1337,
346
+ "step": 1150
347
+ },
348
+ {
349
+ "epoch": 0.5332715047530721,
350
+ "eval_loss": 0.1692070748762034,
351
+ "eval_runtime": 60.0637,
352
+ "eval_samples_per_second": 692.364,
353
+ "eval_steps_per_second": 0.683,
354
+ "step": 1150
355
+ },
356
+ {
357
+ "epoch": 0.5564572223510318,
358
+ "grad_norm": 0.09759815782308578,
359
+ "learning_rate": 9.698463103929542e-05,
360
+ "loss": 0.134,
361
+ "step": 1200
362
+ },
363
+ {
364
+ "epoch": 0.5564572223510318,
365
+ "eval_loss": 0.16649228385381692,
366
+ "eval_runtime": 60.4305,
367
+ "eval_samples_per_second": 688.162,
368
+ "eval_steps_per_second": 0.678,
369
+ "step": 1200
370
+ },
371
+ {
372
+ "epoch": 0.5796429399489914,
373
+ "grad_norm": 0.10353852063417435,
374
+ "learning_rate": 9.670739301325534e-05,
375
+ "loss": 0.1341,
376
+ "step": 1250
377
+ },
378
+ {
379
+ "epoch": 0.5796429399489914,
380
+ "eval_loss": 0.16802514322459206,
381
+ "eval_runtime": 60.0955,
382
+ "eval_samples_per_second": 691.999,
383
+ "eval_steps_per_second": 0.682,
384
+ "step": 1250
385
+ },
386
+ {
387
+ "epoch": 0.602828657546951,
388
+ "grad_norm": 0.11834366619586945,
389
+ "learning_rate": 9.641839665080363e-05,
390
+ "loss": 0.1347,
391
+ "step": 1300
392
+ },
393
+ {
394
+ "epoch": 0.602828657546951,
395
+ "eval_loss": 0.1672302417427292,
396
+ "eval_runtime": 60.3484,
397
+ "eval_samples_per_second": 689.098,
398
+ "eval_steps_per_second": 0.679,
399
+ "step": 1300
400
+ },
401
+ {
402
+ "epoch": 0.6260143751449108,
403
+ "grad_norm": 0.06963012367486954,
404
+ "learning_rate": 9.611771470522908e-05,
405
+ "loss": 0.1335,
406
+ "step": 1350
407
+ },
408
+ {
409
+ "epoch": 0.6260143751449108,
410
+ "eval_loss": 0.16607839684977216,
411
+ "eval_runtime": 60.3308,
412
+ "eval_samples_per_second": 689.3,
413
+ "eval_steps_per_second": 0.68,
414
+ "step": 1350
415
+ },
416
+ {
417
+ "epoch": 0.6492000927428704,
418
+ "grad_norm": 0.06842990219593048,
419
+ "learning_rate": 9.580542287160348e-05,
420
+ "loss": 0.1338,
421
+ "step": 1400
422
+ },
423
+ {
424
+ "epoch": 0.6492000927428704,
425
+ "eval_loss": 0.16628812684035693,
426
+ "eval_runtime": 59.9335,
427
+ "eval_samples_per_second": 693.87,
428
+ "eval_steps_per_second": 0.684,
429
+ "step": 1400
430
+ },
431
+ {
432
+ "epoch": 0.67238581034083,
433
+ "grad_norm": 0.07053674757480621,
434
+ "learning_rate": 9.548159976772592e-05,
435
+ "loss": 0.1335,
436
+ "step": 1450
437
+ },
438
+ {
439
+ "epoch": 0.67238581034083,
440
+ "eval_loss": 0.16696060882262428,
441
+ "eval_runtime": 59.8079,
442
+ "eval_samples_per_second": 695.326,
443
+ "eval_steps_per_second": 0.686,
444
+ "step": 1450
445
+ },
446
+ {
447
+ "epoch": 0.6955715279387897,
448
+ "grad_norm": 0.09175281971693039,
449
+ "learning_rate": 9.514632691433107e-05,
450
+ "loss": 0.1332,
451
+ "step": 1500
452
+ },
453
+ {
454
+ "epoch": 0.6955715279387897,
455
+ "eval_loss": 0.16521949465081834,
456
+ "eval_runtime": 60.1856,
457
+ "eval_samples_per_second": 690.963,
458
+ "eval_steps_per_second": 0.681,
459
+ "step": 1500
460
+ },
461
+ {
462
+ "epoch": 0.7187572455367494,
463
+ "grad_norm": 0.05836635082960129,
464
+ "learning_rate": 9.479968871456679e-05,
465
+ "loss": 0.1336,
466
+ "step": 1550
467
+ },
468
+ {
469
+ "epoch": 0.7187572455367494,
470
+ "eval_loss": 0.16626366255041727,
471
+ "eval_runtime": 60.6256,
472
+ "eval_samples_per_second": 685.948,
473
+ "eval_steps_per_second": 0.676,
474
+ "step": 1550
475
+ },
476
+ {
477
+ "epoch": 0.741942963134709,
478
+ "grad_norm": 0.07249301671981812,
479
+ "learning_rate": 9.444177243274618e-05,
480
+ "loss": 0.133,
481
+ "step": 1600
482
+ },
483
+ {
484
+ "epoch": 0.741942963134709,
485
+ "eval_loss": 0.1655649439629329,
486
+ "eval_runtime": 60.2447,
487
+ "eval_samples_per_second": 690.285,
488
+ "eval_steps_per_second": 0.681,
489
+ "step": 1600
490
+ },
491
+ {
492
+ "epoch": 0.7651286807326687,
493
+ "grad_norm": 0.07509302347898483,
494
+ "learning_rate": 9.407266817237911e-05,
495
+ "loss": 0.1332,
496
+ "step": 1650
497
+ },
498
+ {
499
+ "epoch": 0.7651286807326687,
500
+ "eval_loss": 0.16605371203296967,
501
+ "eval_runtime": 59.8196,
502
+ "eval_samples_per_second": 695.191,
503
+ "eval_steps_per_second": 0.685,
504
+ "step": 1650
505
+ },
506
+ {
507
+ "epoch": 0.7883143983306283,
508
+ "grad_norm": 0.07540406286716461,
509
+ "learning_rate": 9.369246885348926e-05,
510
+ "loss": 0.1327,
511
+ "step": 1700
512
+ },
513
+ {
514
+ "epoch": 0.7883143983306283,
515
+ "eval_loss": 0.16555590021301406,
516
+ "eval_runtime": 60.4119,
517
+ "eval_samples_per_second": 688.374,
518
+ "eval_steps_per_second": 0.679,
519
+ "step": 1700
520
+ },
521
+ {
522
+ "epoch": 0.811500115928588,
523
+ "grad_norm": 0.06061087176203728,
524
+ "learning_rate": 9.330127018922194e-05,
525
+ "loss": 0.1318,
526
+ "step": 1750
527
+ },
528
+ {
529
+ "epoch": 0.811500115928588,
530
+ "eval_loss": 0.16623179673527624,
531
+ "eval_runtime": 59.7807,
532
+ "eval_samples_per_second": 695.643,
533
+ "eval_steps_per_second": 0.686,
534
+ "step": 1750
535
+ },
536
+ {
537
+ "epoch": 0.8346858335265477,
538
+ "grad_norm": 0.05577518790960312,
539
+ "learning_rate": 9.289917066174886e-05,
540
+ "loss": 0.1319,
541
+ "step": 1800
542
+ },
543
+ {
544
+ "epoch": 0.8346858335265477,
545
+ "eval_loss": 0.16519989030959317,
546
+ "eval_runtime": 60.1508,
547
+ "eval_samples_per_second": 691.363,
548
+ "eval_steps_per_second": 0.682,
549
+ "step": 1800
550
+ },
551
+ {
552
+ "epoch": 0.8578715511245073,
553
+ "grad_norm": 0.06929640471935272,
554
+ "learning_rate": 9.248627149747573e-05,
555
+ "loss": 0.1337,
556
+ "step": 1850
557
+ },
558
+ {
559
+ "epoch": 0.8578715511245073,
560
+ "eval_loss": 0.16394849849125304,
561
+ "eval_runtime": 60.1044,
562
+ "eval_samples_per_second": 691.896,
563
+ "eval_steps_per_second": 0.682,
564
+ "step": 1850
565
+ },
566
+ {
567
+ "epoch": 0.8810572687224669,
568
+ "grad_norm": 0.07941466569900513,
569
+ "learning_rate": 9.206267664155907e-05,
570
+ "loss": 0.1324,
571
+ "step": 1900
572
+ },
573
+ {
574
+ "epoch": 0.8810572687224669,
575
+ "eval_loss": 0.1648257591054525,
576
+ "eval_runtime": 59.9818,
577
+ "eval_samples_per_second": 693.31,
578
+ "eval_steps_per_second": 0.684,
579
+ "step": 1900
580
+ },
581
+ {
582
+ "epoch": 0.9042429863204267,
583
+ "grad_norm": 0.09700328856706619,
584
+ "learning_rate": 9.162849273173857e-05,
585
+ "loss": 0.1334,
586
+ "step": 1950
587
+ },
588
+ {
589
+ "epoch": 0.9042429863204267,
590
+ "eval_loss": 0.16508159820082235,
591
+ "eval_runtime": 60.0956,
592
+ "eval_samples_per_second": 691.997,
593
+ "eval_steps_per_second": 0.682,
594
+ "step": 1950
595
+ },
596
+ {
597
+ "epoch": 0.9274287039183863,
598
+ "grad_norm": 0.09397923946380615,
599
+ "learning_rate": 9.118382907149165e-05,
600
+ "loss": 0.1317,
601
+ "step": 2000
602
+ },
603
+ {
604
+ "epoch": 0.9274287039183863,
605
+ "eval_loss": 0.16377645660046958,
606
+ "eval_runtime": 60.471,
607
+ "eval_samples_per_second": 687.701,
608
+ "eval_steps_per_second": 0.678,
609
+ "step": 2000
610
+ },
611
+ {
612
+ "epoch": 0.9506144215163459,
613
+ "grad_norm": 0.08097202330827713,
614
+ "learning_rate": 9.072879760251679e-05,
615
+ "loss": 0.1324,
616
+ "step": 2050
617
+ },
618
+ {
619
+ "epoch": 0.9506144215163459,
620
+ "eval_loss": 0.16491611914973717,
621
+ "eval_runtime": 60.6678,
622
+ "eval_samples_per_second": 685.471,
623
+ "eval_steps_per_second": 0.676,
624
+ "step": 2050
625
+ },
626
+ {
627
+ "epoch": 0.9738001391143056,
628
+ "grad_norm": 0.08455361425876617,
629
+ "learning_rate": 9.026351287655294e-05,
630
+ "loss": 0.1326,
631
+ "step": 2100
632
+ },
633
+ {
634
+ "epoch": 0.9738001391143056,
635
+ "eval_loss": 0.16602741997032858,
636
+ "eval_runtime": 60.5593,
637
+ "eval_samples_per_second": 686.698,
638
+ "eval_steps_per_second": 0.677,
639
+ "step": 2100
640
+ },
641
+ {
642
+ "epoch": 0.9969858567122653,
643
+ "grad_norm": 0.056316621601581573,
644
+ "learning_rate": 8.978809202654162e-05,
645
+ "loss": 0.1326,
646
+ "step": 2150
647
+ },
648
+ {
649
+ "epoch": 0.9969858567122653,
650
+ "eval_loss": 0.1640188218462461,
651
+ "eval_runtime": 60.9228,
652
+ "eval_samples_per_second": 682.602,
653
+ "eval_steps_per_second": 0.673,
654
+ "step": 2150
655
+ },
656
+ {
657
+ "epoch": 1.0201715743102249,
658
+ "grad_norm": 0.06686601787805557,
659
+ "learning_rate": 8.930265473713938e-05,
660
+ "loss": 0.132,
661
+ "step": 2200
662
+ },
663
+ {
664
+ "epoch": 1.0201715743102249,
665
+ "eval_loss": 0.1652621257294944,
666
+ "eval_runtime": 60.883,
667
+ "eval_samples_per_second": 683.048,
668
+ "eval_steps_per_second": 0.673,
669
+ "step": 2200
670
+ },
671
+ {
672
+ "epoch": 1.0433572919081846,
673
+ "grad_norm": 0.040202509611845016,
674
+ "learning_rate": 8.880732321458784e-05,
675
+ "loss": 0.1319,
676
+ "step": 2250
677
+ },
678
+ {
679
+ "epoch": 1.0433572919081846,
680
+ "eval_loss": 0.1655291575008717,
681
+ "eval_runtime": 60.3109,
682
+ "eval_samples_per_second": 689.527,
683
+ "eval_steps_per_second": 0.68,
684
+ "step": 2250
685
+ },
686
+ {
687
+ "epoch": 1.0665430095061441,
688
+ "grad_norm": 0.0656428411602974,
689
+ "learning_rate": 8.83022221559489e-05,
690
+ "loss": 0.1326,
691
+ "step": 2300
692
+ },
693
+ {
694
+ "epoch": 1.0665430095061441,
695
+ "eval_loss": 0.16431572407087935,
696
+ "eval_runtime": 60.1036,
697
+ "eval_samples_per_second": 691.906,
698
+ "eval_steps_per_second": 0.682,
699
+ "step": 2300
700
+ },
701
+ {
702
+ "epoch": 1.0897287271041038,
703
+ "grad_norm": 0.06945247948169708,
704
+ "learning_rate": 8.778747871771292e-05,
705
+ "loss": 0.1321,
706
+ "step": 2350
707
+ },
708
+ {
709
+ "epoch": 1.0897287271041038,
710
+ "eval_loss": 0.16585482329987242,
711
+ "eval_runtime": 60.6263,
712
+ "eval_samples_per_second": 685.94,
713
+ "eval_steps_per_second": 0.676,
714
+ "step": 2350
715
+ },
716
+ {
717
+ "epoch": 1.1129144447020636,
718
+ "grad_norm": 0.0523492731153965,
719
+ "learning_rate": 8.726322248378775e-05,
720
+ "loss": 0.1317,
721
+ "step": 2400
722
+ },
723
+ {
724
+ "epoch": 1.1129144447020636,
725
+ "eval_loss": 0.16438524923736036,
726
+ "eval_runtime": 60.317,
727
+ "eval_samples_per_second": 689.457,
728
+ "eval_steps_per_second": 0.68,
729
+ "step": 2400
730
+ },
731
+ {
732
+ "epoch": 1.136100162300023,
733
+ "grad_norm": 0.07777334004640579,
734
+ "learning_rate": 8.672958543287666e-05,
735
+ "loss": 0.1322,
736
+ "step": 2450
737
+ },
738
+ {
739
+ "epoch": 1.136100162300023,
740
+ "eval_loss": 0.16509696053644565,
741
+ "eval_runtime": 60.26,
742
+ "eval_samples_per_second": 690.109,
743
+ "eval_steps_per_second": 0.68,
744
+ "step": 2450
745
+ },
746
+ {
747
+ "epoch": 1.1592858798979828,
748
+ "grad_norm": 0.06430637836456299,
749
+ "learning_rate": 8.618670190525352e-05,
750
+ "loss": 0.1325,
751
+ "step": 2500
752
+ },
753
+ {
754
+ "epoch": 1.1592858798979828,
755
+ "eval_loss": 0.1639541608445008,
756
+ "eval_runtime": 60.5314,
757
+ "eval_samples_per_second": 687.015,
758
+ "eval_steps_per_second": 0.677,
759
+ "step": 2500
760
+ },
761
+ {
762
+ "epoch": 1.1824715974959426,
763
+ "grad_norm": 0.11194106936454773,
764
+ "learning_rate": 8.563470856894316e-05,
765
+ "loss": 0.1311,
766
+ "step": 2550
767
+ },
768
+ {
769
+ "epoch": 1.1824715974959426,
770
+ "eval_loss": 0.16260699934317355,
771
+ "eval_runtime": 60.3659,
772
+ "eval_samples_per_second": 688.899,
773
+ "eval_steps_per_second": 0.679,
774
+ "step": 2550
775
+ },
776
+ {
777
+ "epoch": 1.205657315093902,
778
+ "grad_norm": 0.06165901944041252,
779
+ "learning_rate": 8.507374438531607e-05,
780
+ "loss": 0.1323,
781
+ "step": 2600
782
+ },
783
+ {
784
+ "epoch": 1.205657315093902,
785
+ "eval_loss": 0.1626319663130242,
786
+ "eval_runtime": 59.9516,
787
+ "eval_samples_per_second": 693.66,
788
+ "eval_steps_per_second": 0.684,
789
+ "step": 2600
790
+ },
791
+ {
792
+ "epoch": 1.2288430326918618,
793
+ "grad_norm": 0.10654885321855545,
794
+ "learning_rate": 8.450395057410561e-05,
795
+ "loss": 0.1316,
796
+ "step": 2650
797
+ },
798
+ {
799
+ "epoch": 1.2288430326918618,
800
+ "eval_loss": 0.16393000041041636,
801
+ "eval_runtime": 59.576,
802
+ "eval_samples_per_second": 698.032,
803
+ "eval_steps_per_second": 0.688,
804
+ "step": 2650
805
+ },
806
+ {
807
+ "epoch": 1.2520287502898215,
808
+ "grad_norm": 0.04848140478134155,
809
+ "learning_rate": 8.392547057785661e-05,
810
+ "loss": 0.1314,
811
+ "step": 2700
812
+ },
813
+ {
814
+ "epoch": 1.2520287502898215,
815
+ "eval_loss": 0.16348152455768114,
816
+ "eval_runtime": 60.098,
817
+ "eval_samples_per_second": 691.97,
818
+ "eval_steps_per_second": 0.682,
819
+ "step": 2700
820
+ },
821
+ {
822
+ "epoch": 1.275214467887781,
823
+ "grad_norm": 0.0573604516685009,
824
+ "learning_rate": 8.333845002581458e-05,
825
+ "loss": 0.1314,
826
+ "step": 2750
827
+ },
828
+ {
829
+ "epoch": 1.275214467887781,
830
+ "eval_loss": 0.16364089140116167,
831
+ "eval_runtime": 60.1364,
832
+ "eval_samples_per_second": 691.528,
833
+ "eval_steps_per_second": 0.682,
834
+ "step": 2750
835
+ },
836
+ {
837
+ "epoch": 1.2984001854857408,
838
+ "grad_norm": 0.053159259259700775,
839
+ "learning_rate": 8.274303669726426e-05,
840
+ "loss": 0.131,
841
+ "step": 2800
842
+ },
843
+ {
844
+ "epoch": 1.2984001854857408,
845
+ "eval_loss": 0.16257415365129801,
846
+ "eval_runtime": 60.0025,
847
+ "eval_samples_per_second": 693.071,
848
+ "eval_steps_per_second": 0.683,
849
+ "step": 2800
850
+ },
851
+ {
852
+ "epoch": 1.3215859030837005,
853
+ "grad_norm": 0.09136148542165756,
854
+ "learning_rate": 8.213938048432697e-05,
855
+ "loss": 0.1313,
856
+ "step": 2850
857
+ },
858
+ {
859
+ "epoch": 1.3215859030837005,
860
+ "eval_loss": 0.16324665471619784,
861
+ "eval_runtime": 59.8429,
862
+ "eval_samples_per_second": 694.92,
863
+ "eval_steps_per_second": 0.685,
864
+ "step": 2850
865
+ },
866
+ {
867
+ "epoch": 1.34477162068166,
868
+ "grad_norm": 0.05825324356555939,
869
+ "learning_rate": 8.152763335422613e-05,
870
+ "loss": 0.1312,
871
+ "step": 2900
872
+ },
873
+ {
874
+ "epoch": 1.34477162068166,
875
+ "eval_loss": 0.16367374608121235,
876
+ "eval_runtime": 60.219,
877
+ "eval_samples_per_second": 690.579,
878
+ "eval_steps_per_second": 0.681,
879
+ "step": 2900
880
+ },
881
+ {
882
+ "epoch": 1.3679573382796197,
883
+ "grad_norm": 0.06379790604114532,
884
+ "learning_rate": 8.090794931103026e-05,
885
+ "loss": 0.1317,
886
+ "step": 2950
887
+ },
888
+ {
889
+ "epoch": 1.3679573382796197,
890
+ "eval_loss": 0.16400733758786312,
891
+ "eval_runtime": 59.9641,
892
+ "eval_samples_per_second": 693.515,
893
+ "eval_steps_per_second": 0.684,
894
+ "step": 2950
895
+ },
896
+ {
897
+ "epoch": 1.3911430558775795,
898
+ "grad_norm": 0.05361103266477585,
899
+ "learning_rate": 8.028048435688333e-05,
900
+ "loss": 0.1311,
901
+ "step": 3000
902
+ },
903
+ {
904
+ "epoch": 1.3911430558775795,
905
+ "eval_loss": 0.16210626991928834,
906
+ "eval_runtime": 59.5858,
907
+ "eval_samples_per_second": 697.919,
908
+ "eval_steps_per_second": 0.688,
909
+ "step": 3000
910
+ },
911
+ {
912
+ "epoch": 1.414328773475539,
913
+ "grad_norm": 0.04593402519822121,
914
+ "learning_rate": 7.964539645273204e-05,
915
+ "loss": 0.1304,
916
+ "step": 3050
917
+ },
918
+ {
919
+ "epoch": 1.414328773475539,
920
+ "eval_loss": 0.163067463275087,
921
+ "eval_runtime": 60.098,
922
+ "eval_samples_per_second": 691.97,
923
+ "eval_steps_per_second": 0.682,
924
+ "step": 3050
925
+ },
926
+ {
927
+ "epoch": 1.4375144910734987,
928
+ "grad_norm": 0.057480327785015106,
929
+ "learning_rate": 7.900284547855991e-05,
930
+ "loss": 0.1307,
931
+ "step": 3100
932
+ },
933
+ {
934
+ "epoch": 1.4375144910734987,
935
+ "eval_loss": 0.16243572043734797,
936
+ "eval_runtime": 59.5674,
937
+ "eval_samples_per_second": 698.133,
938
+ "eval_steps_per_second": 0.688,
939
+ "step": 3100
940
+ },
941
+ {
942
+ "epoch": 1.4607002086714584,
943
+ "grad_norm": 0.08223798871040344,
944
+ "learning_rate": 7.835299319313853e-05,
945
+ "loss": 0.1315,
946
+ "step": 3150
947
+ },
948
+ {
949
+ "epoch": 1.4607002086714584,
950
+ "eval_loss": 0.1641944734489707,
951
+ "eval_runtime": 59.5423,
952
+ "eval_samples_per_second": 698.428,
953
+ "eval_steps_per_second": 0.689,
954
+ "step": 3150
955
+ },
956
+ {
957
+ "epoch": 1.483885926269418,
958
+ "grad_norm": 0.09742949903011322,
959
+ "learning_rate": 7.769600319330552e-05,
960
+ "loss": 0.1303,
961
+ "step": 3200
962
+ },
963
+ {
964
+ "epoch": 1.483885926269418,
965
+ "eval_loss": 0.16355698856626613,
966
+ "eval_runtime": 60.1946,
967
+ "eval_samples_per_second": 690.859,
968
+ "eval_steps_per_second": 0.681,
969
+ "step": 3200
970
+ },
971
+ {
972
+ "epoch": 1.5070716438673777,
973
+ "grad_norm": 0.06401767581701279,
974
+ "learning_rate": 7.703204087277988e-05,
975
+ "loss": 0.1315,
976
+ "step": 3250
977
+ },
978
+ {
979
+ "epoch": 1.5070716438673777,
980
+ "eval_loss": 0.16215006705140952,
981
+ "eval_runtime": 59.7822,
982
+ "eval_samples_per_second": 695.625,
983
+ "eval_steps_per_second": 0.686,
984
+ "step": 3250
985
+ },
986
+ {
987
+ "epoch": 1.5302573614653374,
988
+ "grad_norm": 0.07916898280382156,
989
+ "learning_rate": 7.636127338052512e-05,
990
+ "loss": 0.1315,
991
+ "step": 3300
992
+ },
993
+ {
994
+ "epoch": 1.5302573614653374,
995
+ "eval_loss": 0.16288597734760557,
996
+ "eval_runtime": 59.2757,
997
+ "eval_samples_per_second": 701.57,
998
+ "eval_steps_per_second": 0.692,
999
+ "step": 3300
1000
+ },
1001
+ {
1002
+ "epoch": 1.553443079063297,
1003
+ "grad_norm": 0.06549016386270523,
1004
+ "learning_rate": 7.568386957867033e-05,
1005
+ "loss": 0.1303,
1006
+ "step": 3350
1007
+ },
1008
+ {
1009
+ "epoch": 1.553443079063297,
1010
+ "eval_loss": 0.16416664097655873,
1011
+ "eval_runtime": 59.84,
1012
+ "eval_samples_per_second": 694.953,
1013
+ "eval_steps_per_second": 0.685,
1014
+ "step": 3350
1015
+ },
1016
+ {
1017
+ "epoch": 1.5766287966612567,
1018
+ "grad_norm": 0.0709395632147789,
1019
+ "learning_rate": 7.500000000000001e-05,
1020
+ "loss": 0.1309,
1021
+ "step": 3400
1022
+ },
1023
+ {
1024
+ "epoch": 1.5766287966612567,
1025
+ "eval_loss": 0.16179194486424098,
1026
+ "eval_runtime": 59.8634,
1027
+ "eval_samples_per_second": 694.682,
1028
+ "eval_steps_per_second": 0.685,
1029
+ "step": 3400
1030
+ },
1031
+ {
1032
+ "epoch": 1.5998145142592164,
1033
+ "grad_norm": 0.05671363323926926,
1034
+ "learning_rate": 7.430983680502344e-05,
1035
+ "loss": 0.1307,
1036
+ "step": 3450
1037
+ },
1038
+ {
1039
+ "epoch": 1.5998145142592164,
1040
+ "eval_loss": 0.16309191886303373,
1041
+ "eval_runtime": 59.618,
1042
+ "eval_samples_per_second": 697.541,
1043
+ "eval_steps_per_second": 0.688,
1044
+ "step": 3450
1045
+ },
1046
+ {
1047
+ "epoch": 1.623000231857176,
1048
+ "grad_norm": 0.04889162629842758,
1049
+ "learning_rate": 7.361355373863414e-05,
1050
+ "loss": 0.1314,
1051
+ "step": 3500
1052
+ },
1053
+ {
1054
+ "epoch": 1.623000231857176,
1055
+ "eval_loss": 0.16290782983414598,
1056
+ "eval_runtime": 60.3904,
1057
+ "eval_samples_per_second": 688.619,
1058
+ "eval_steps_per_second": 0.679,
1059
+ "step": 3500
1060
+ },
1061
+ {
1062
+ "epoch": 1.6461859494551356,
1063
+ "grad_norm": 0.0970933735370636,
1064
+ "learning_rate": 7.291132608637052e-05,
1065
+ "loss": 0.1314,
1066
+ "step": 3550
1067
+ },
1068
+ {
1069
+ "epoch": 1.6461859494551356,
1070
+ "eval_loss": 0.16278222993823557,
1071
+ "eval_runtime": 59.8666,
1072
+ "eval_samples_per_second": 694.644,
1073
+ "eval_steps_per_second": 0.685,
1074
+ "step": 3550
1075
+ },
1076
+ {
1077
+ "epoch": 1.6693716670530954,
1078
+ "grad_norm": 0.056557025760412216,
1079
+ "learning_rate": 7.220333063028872e-05,
1080
+ "loss": 0.1312,
1081
+ "step": 3600
1082
+ },
1083
+ {
1084
+ "epoch": 1.6693716670530954,
1085
+ "eval_loss": 0.16313205291311117,
1086
+ "eval_runtime": 60.0092,
1087
+ "eval_samples_per_second": 692.993,
1088
+ "eval_steps_per_second": 0.683,
1089
+ "step": 3600
1090
+ },
1091
+ {
1092
+ "epoch": 1.6925573846510549,
1093
+ "grad_norm": 0.04870522394776344,
1094
+ "learning_rate": 7.148974560445859e-05,
1095
+ "loss": 0.1299,
1096
+ "step": 3650
1097
+ },
1098
+ {
1099
+ "epoch": 1.6925573846510549,
1100
+ "eval_loss": 0.1617941082289122,
1101
+ "eval_runtime": 60.1721,
1102
+ "eval_samples_per_second": 691.117,
1103
+ "eval_steps_per_second": 0.681,
1104
+ "step": 3650
1105
+ },
1106
+ {
1107
+ "epoch": 1.7157431022490146,
1108
+ "grad_norm": 0.0681833028793335,
1109
+ "learning_rate": 7.077075065009433e-05,
1110
+ "loss": 0.1304,
1111
+ "step": 3700
1112
+ },
1113
+ {
1114
+ "epoch": 1.7157431022490146,
1115
+ "eval_loss": 0.16243406602519425,
1116
+ "eval_runtime": 59.3626,
1117
+ "eval_samples_per_second": 700.542,
1118
+ "eval_steps_per_second": 0.691,
1119
+ "step": 3700
1120
+ },
1121
+ {
1122
+ "epoch": 1.7389288198469743,
1123
+ "grad_norm": 0.06506156921386719,
1124
+ "learning_rate": 7.004652677033068e-05,
1125
+ "loss": 0.1299,
1126
+ "step": 3750
1127
+ },
1128
+ {
1129
+ "epoch": 1.7389288198469743,
1130
+ "eval_loss": 0.16324780134312317,
1131
+ "eval_runtime": 59.6022,
1132
+ "eval_samples_per_second": 697.726,
1133
+ "eval_steps_per_second": 0.688,
1134
+ "step": 3750
1135
+ },
1136
+ {
1137
+ "epoch": 1.7621145374449338,
1138
+ "grad_norm": 0.06188170611858368,
1139
+ "learning_rate": 6.931725628465643e-05,
1140
+ "loss": 0.1309,
1141
+ "step": 3800
1142
+ },
1143
+ {
1144
+ "epoch": 1.7621145374449338,
1145
+ "eval_loss": 0.1623115342294882,
1146
+ "eval_runtime": 59.7694,
1147
+ "eval_samples_per_second": 695.774,
1148
+ "eval_steps_per_second": 0.686,
1149
+ "step": 3800
1150
+ },
1151
+ {
1152
+ "epoch": 1.7853002550428936,
1153
+ "grad_norm": 0.05675831064581871,
1154
+ "learning_rate": 6.858312278301637e-05,
1155
+ "loss": 0.1303,
1156
+ "step": 3850
1157
+ },
1158
+ {
1159
+ "epoch": 1.7853002550428936,
1160
+ "eval_loss": 0.1630547638293529,
1161
+ "eval_runtime": 59.779,
1162
+ "eval_samples_per_second": 695.662,
1163
+ "eval_steps_per_second": 0.686,
1164
+ "step": 3850
1165
+ },
1166
+ {
1167
+ "epoch": 1.8084859726408533,
1168
+ "grad_norm": 0.04727062210440636,
1169
+ "learning_rate": 6.784431107959359e-05,
1170
+ "loss": 0.1312,
1171
+ "step": 3900
1172
+ },
1173
+ {
1174
+ "epoch": 1.8084859726408533,
1175
+ "eval_loss": 0.1616409071893626,
1176
+ "eval_runtime": 59.6005,
1177
+ "eval_samples_per_second": 697.746,
1178
+ "eval_steps_per_second": 0.688,
1179
+ "step": 3900
1180
+ },
1181
+ {
1182
+ "epoch": 1.8316716902388128,
1183
+ "grad_norm": 0.06378892064094543,
1184
+ "learning_rate": 6.710100716628344e-05,
1185
+ "loss": 0.1303,
1186
+ "step": 3950
1187
+ },
1188
+ {
1189
+ "epoch": 1.8316716902388128,
1190
+ "eval_loss": 0.1622395658739077,
1191
+ "eval_runtime": 60.1499,
1192
+ "eval_samples_per_second": 691.373,
1193
+ "eval_steps_per_second": 0.682,
1194
+ "step": 3950
1195
+ },
1196
+ {
1197
+ "epoch": 1.8548574078367726,
1198
+ "grad_norm": 0.05470576509833336,
1199
+ "learning_rate": 6.635339816587109e-05,
1200
+ "loss": 0.1308,
1201
+ "step": 4000
1202
+ },
1203
+ {
1204
+ "epoch": 1.8548574078367726,
1205
+ "eval_loss": 0.16317236170181762,
1206
+ "eval_runtime": 60.014,
1207
+ "eval_samples_per_second": 692.939,
1208
+ "eval_steps_per_second": 0.683,
1209
+ "step": 4000
1210
+ },
1211
+ {
1212
+ "epoch": 1.8780431254347323,
1213
+ "grad_norm": 0.053886763751506805,
1214
+ "learning_rate": 6.560167228492436e-05,
1215
+ "loss": 0.1297,
1216
+ "step": 4050
1217
+ },
1218
+ {
1219
+ "epoch": 1.8780431254347323,
1220
+ "eval_loss": 0.16198886262197804,
1221
+ "eval_runtime": 60.8262,
1222
+ "eval_samples_per_second": 683.685,
1223
+ "eval_steps_per_second": 0.674,
1224
+ "step": 4050
1225
+ },
1226
+ {
1227
+ "epoch": 1.9012288430326918,
1228
+ "grad_norm": 0.054583676159381866,
1229
+ "learning_rate": 6.484601876641375e-05,
1230
+ "loss": 0.1301,
1231
+ "step": 4100
1232
+ },
1233
+ {
1234
+ "epoch": 1.9012288430326918,
1235
+ "eval_loss": 0.1616550050294764,
1236
+ "eval_runtime": 59.7779,
1237
+ "eval_samples_per_second": 695.675,
1238
+ "eval_steps_per_second": 0.686,
1239
+ "step": 4100
1240
+ },
1241
+ {
1242
+ "epoch": 1.9244145606306515,
1243
+ "grad_norm": 0.071171335875988,
1244
+ "learning_rate": 6.408662784207149e-05,
1245
+ "loss": 0.131,
1246
+ "step": 4150
1247
+ },
1248
+ {
1249
+ "epoch": 1.9244145606306515,
1250
+ "eval_loss": 0.15968682813566223,
1251
+ "eval_runtime": 60.227,
1252
+ "eval_samples_per_second": 690.487,
1253
+ "eval_steps_per_second": 0.681,
1254
+ "step": 4150
1255
+ },
1256
+ {
1257
+ "epoch": 1.9476002782286113,
1258
+ "grad_norm": 0.05775531381368637,
1259
+ "learning_rate": 6.332369068450174e-05,
1260
+ "loss": 0.1296,
1261
+ "step": 4200
1262
+ },
1263
+ {
1264
+ "epoch": 1.9476002782286113,
1265
+ "eval_loss": 0.16262199212265846,
1266
+ "eval_runtime": 60.3405,
1267
+ "eval_samples_per_second": 689.189,
1268
+ "eval_steps_per_second": 0.679,
1269
+ "step": 4200
1270
+ },
1271
+ {
1272
+ "epoch": 1.9707859958265708,
1273
+ "grad_norm": 0.06425776332616806,
1274
+ "learning_rate": 6.255739935905396e-05,
1275
+ "loss": 0.1299,
1276
+ "step": 4250
1277
+ },
1278
+ {
1279
+ "epoch": 1.9707859958265708,
1280
+ "eval_loss": 0.16324524366491053,
1281
+ "eval_runtime": 61.417,
1282
+ "eval_samples_per_second": 677.109,
1283
+ "eval_steps_per_second": 0.668,
1284
+ "step": 4250
1285
+ },
1286
+ {
1287
+ "epoch": 1.9939717134245305,
1288
+ "grad_norm": 0.045762140303850174,
1289
+ "learning_rate": 6.178794677547137e-05,
1290
+ "loss": 0.1299,
1291
+ "step": 4300
1292
+ },
1293
+ {
1294
+ "epoch": 1.9939717134245305,
1295
+ "eval_loss": 0.16053301797614244,
1296
+ "eval_runtime": 61.0801,
1297
+ "eval_samples_per_second": 680.844,
1298
+ "eval_steps_per_second": 0.671,
1299
+ "step": 4300
1300
+ },
1301
+ {
1302
+ "epoch": 2.0171574310224902,
1303
+ "grad_norm": 0.07060451060533524,
1304
+ "learning_rate": 6.1015526639327035e-05,
1305
+ "loss": 0.1296,
1306
+ "step": 4350
1307
+ },
1308
+ {
1309
+ "epoch": 2.0171574310224902,
1310
+ "eval_loss": 0.1620254674138633,
1311
+ "eval_runtime": 61.0829,
1312
+ "eval_samples_per_second": 680.812,
1313
+ "eval_steps_per_second": 0.671,
1314
+ "step": 4350
1315
+ },
1316
+ {
1317
+ "epoch": 2.0403431486204497,
1318
+ "grad_norm": 0.059919316321611404,
1319
+ "learning_rate": 6.024033340325954e-05,
1320
+ "loss": 0.1302,
1321
+ "step": 4400
1322
+ },
1323
+ {
1324
+ "epoch": 2.0403431486204497,
1325
+ "eval_loss": 0.16284223807997533,
1326
+ "eval_runtime": 61.5789,
1327
+ "eval_samples_per_second": 675.328,
1328
+ "eval_steps_per_second": 0.666,
1329
+ "step": 4400
1330
+ },
1331
+ {
1332
+ "epoch": 2.0635288662184093,
1333
+ "grad_norm": 0.07983385026454926,
1334
+ "learning_rate": 5.946256221802051e-05,
1335
+ "loss": 0.13,
1336
+ "step": 4450
1337
+ },
1338
+ {
1339
+ "epoch": 2.0635288662184093,
1340
+ "eval_loss": 0.16209282393788932,
1341
+ "eval_runtime": 61.6584,
1342
+ "eval_samples_per_second": 674.458,
1343
+ "eval_steps_per_second": 0.665,
1344
+ "step": 4450
1345
+ },
1346
+ {
1347
+ "epoch": 2.086714583816369,
1348
+ "grad_norm": 0.07582173496484756,
1349
+ "learning_rate": 5.868240888334653e-05,
1350
+ "loss": 0.1296,
1351
+ "step": 4500
1352
+ },
1353
+ {
1354
+ "epoch": 2.086714583816369,
1355
+ "eval_loss": 0.16158196377974565,
1356
+ "eval_runtime": 61.1826,
1357
+ "eval_samples_per_second": 679.703,
1358
+ "eval_steps_per_second": 0.67,
1359
+ "step": 4500
1360
+ },
1361
+ {
1362
+ "epoch": 2.1099003014143287,
1363
+ "grad_norm": 0.06049995869398117,
1364
+ "learning_rate": 5.79000697986675e-05,
1365
+ "loss": 0.1298,
1366
+ "step": 4550
1367
+ },
1368
+ {
1369
+ "epoch": 2.1099003014143287,
1370
+ "eval_loss": 0.16130609279963956,
1371
+ "eval_runtime": 61.0153,
1372
+ "eval_samples_per_second": 681.567,
1373
+ "eval_steps_per_second": 0.672,
1374
+ "step": 4550
1375
+ },
1376
+ {
1377
+ "epoch": 2.1330860190122882,
1378
+ "grad_norm": 0.0440148264169693,
1379
+ "learning_rate": 5.7115741913664264e-05,
1380
+ "loss": 0.1299,
1381
+ "step": 4600
1382
+ },
1383
+ {
1384
+ "epoch": 2.1330860190122882,
1385
+ "eval_loss": 0.16027799763638953,
1386
+ "eval_runtime": 61.1993,
1387
+ "eval_samples_per_second": 679.517,
1388
+ "eval_steps_per_second": 0.67,
1389
+ "step": 4600
1390
+ },
1391
+ {
1392
+ "epoch": 2.156271736610248,
1393
+ "grad_norm": 0.05254065990447998,
1394
+ "learning_rate": 5.6329622678687463e-05,
1395
+ "loss": 0.1299,
1396
+ "step": 4650
1397
+ },
1398
+ {
1399
+ "epoch": 2.156271736610248,
1400
+ "eval_loss": 0.16206274484291652,
1401
+ "eval_runtime": 61.4415,
1402
+ "eval_samples_per_second": 676.839,
1403
+ "eval_steps_per_second": 0.667,
1404
+ "step": 4650
1405
+ },
1406
+ {
1407
+ "epoch": 2.1794574542082077,
1408
+ "grad_norm": 0.06294432282447815,
1409
+ "learning_rate": 5.5541909995050554e-05,
1410
+ "loss": 0.1306,
1411
+ "step": 4700
1412
+ },
1413
+ {
1414
+ "epoch": 2.1794574542082077,
1415
+ "eval_loss": 0.16140170723024802,
1416
+ "eval_runtime": 60.8861,
1417
+ "eval_samples_per_second": 683.013,
1418
+ "eval_steps_per_second": 0.673,
1419
+ "step": 4700
1420
+ },
1421
+ {
1422
+ "epoch": 2.202643171806167,
1423
+ "grad_norm": 0.06710942089557648,
1424
+ "learning_rate": 5.475280216520913e-05,
1425
+ "loss": 0.1303,
1426
+ "step": 4750
1427
+ },
1428
+ {
1429
+ "epoch": 2.202643171806167,
1430
+ "eval_loss": 0.16245448075670843,
1431
+ "eval_runtime": 61.2839,
1432
+ "eval_samples_per_second": 678.58,
1433
+ "eval_steps_per_second": 0.669,
1434
+ "step": 4750
1435
+ },
1436
+ {
1437
+ "epoch": 2.225828889404127,
1438
+ "grad_norm": 0.05298132076859474,
1439
+ "learning_rate": 5.396249784283942e-05,
1440
+ "loss": 0.13,
1441
+ "step": 4800
1442
+ },
1443
+ {
1444
+ "epoch": 2.225828889404127,
1445
+ "eval_loss": 0.1623738898660767,
1446
+ "eval_runtime": 61.1531,
1447
+ "eval_samples_per_second": 680.031,
1448
+ "eval_steps_per_second": 0.67,
1449
+ "step": 4800
1450
+ },
1451
+ {
1452
+ "epoch": 2.2490146070020867,
1453
+ "grad_norm": 0.04066763445734978,
1454
+ "learning_rate": 5.317119598282823e-05,
1455
+ "loss": 0.1295,
1456
+ "step": 4850
1457
+ },
1458
+ {
1459
+ "epoch": 2.2490146070020867,
1460
+ "eval_loss": 0.1627438727811327,
1461
+ "eval_runtime": 61.0414,
1462
+ "eval_samples_per_second": 681.275,
1463
+ "eval_steps_per_second": 0.672,
1464
+ "step": 4850
1465
+ },
1466
+ {
1467
+ "epoch": 2.272200324600046,
1468
+ "grad_norm": 0.061821240931749344,
1469
+ "learning_rate": 5.2379095791187124e-05,
1470
+ "loss": 0.1299,
1471
+ "step": 4900
1472
+ },
1473
+ {
1474
+ "epoch": 2.272200324600046,
1475
+ "eval_loss": 0.16086717177928397,
1476
+ "eval_runtime": 60.7945,
1477
+ "eval_samples_per_second": 684.042,
1478
+ "eval_steps_per_second": 0.674,
1479
+ "step": 4900
1480
+ },
1481
+ {
1482
+ "epoch": 2.295386042198006,
1483
+ "grad_norm": 0.08038394153118134,
1484
+ "learning_rate": 5.158639667490339e-05,
1485
+ "loss": 0.13,
1486
+ "step": 4950
1487
+ },
1488
+ {
1489
+ "epoch": 2.295386042198006,
1490
+ "eval_loss": 0.16221664317086187,
1491
+ "eval_runtime": 61.6742,
1492
+ "eval_samples_per_second": 674.285,
1493
+ "eval_steps_per_second": 0.665,
1494
+ "step": 4950
1495
+ },
1496
+ {
1497
+ "epoch": 2.3185717597959656,
1498
+ "grad_norm": 0.0556926503777504,
1499
+ "learning_rate": 5.0793298191740404e-05,
1500
+ "loss": 0.1311,
1501
+ "step": 5000
1502
+ },
1503
+ {
1504
+ "epoch": 2.3185717597959656,
1505
+ "eval_loss": 0.16015339844546791,
1506
+ "eval_runtime": 61.3657,
1507
+ "eval_samples_per_second": 677.675,
1508
+ "eval_steps_per_second": 0.668,
1509
+ "step": 5000
1510
+ },
1511
+ {
1512
+ "epoch": 2.3417574773939256,
1513
+ "grad_norm": 0.06645477563142776,
1514
+ "learning_rate": 5e-05,
1515
+ "loss": 0.1284,
1516
+ "step": 5050
1517
+ },
1518
+ {
1519
+ "epoch": 2.3417574773939256,
1520
+ "eval_loss": 0.16160674186313023,
1521
+ "eval_runtime": 61.4737,
1522
+ "eval_samples_per_second": 676.484,
1523
+ "eval_steps_per_second": 0.667,
1524
+ "step": 5050
1525
+ },
1526
+ {
1527
+ "epoch": 2.364943194991885,
1528
+ "grad_norm": 0.05365500971674919,
1529
+ "learning_rate": 4.92067018082596e-05,
1530
+ "loss": 0.13,
1531
+ "step": 5100
1532
+ },
1533
+ {
1534
+ "epoch": 2.364943194991885,
1535
+ "eval_loss": 0.16016484459556096,
1536
+ "eval_runtime": 61.4058,
1537
+ "eval_samples_per_second": 677.232,
1538
+ "eval_steps_per_second": 0.668,
1539
+ "step": 5100
1540
+ },
1541
+ {
1542
+ "epoch": 2.3881289125898446,
1543
+ "grad_norm": 0.0499204620718956,
1544
+ "learning_rate": 4.841360332509663e-05,
1545
+ "loss": 0.129,
1546
+ "step": 5150
1547
+ },
1548
+ {
1549
+ "epoch": 2.3881289125898446,
1550
+ "eval_loss": 0.16054727464378063,
1551
+ "eval_runtime": 61.1539,
1552
+ "eval_samples_per_second": 680.023,
1553
+ "eval_steps_per_second": 0.67,
1554
+ "step": 5150
1555
+ },
1556
+ {
1557
+ "epoch": 2.411314630187804,
1558
+ "grad_norm": 0.07284457236528397,
1559
+ "learning_rate": 4.762090420881289e-05,
1560
+ "loss": 0.129,
1561
+ "step": 5200
1562
+ },
1563
+ {
1564
+ "epoch": 2.411314630187804,
1565
+ "eval_loss": 0.16057287778830004,
1566
+ "eval_runtime": 60.5785,
1567
+ "eval_samples_per_second": 686.481,
1568
+ "eval_steps_per_second": 0.677,
1569
+ "step": 5200
1570
+ },
1571
+ {
1572
+ "epoch": 2.434500347785764,
1573
+ "grad_norm": 0.06511891633272171,
1574
+ "learning_rate": 4.6828804017171776e-05,
1575
+ "loss": 0.1297,
1576
+ "step": 5250
1577
+ },
1578
+ {
1579
+ "epoch": 2.434500347785764,
1580
+ "eval_loss": 0.16202011190836896,
1581
+ "eval_runtime": 61.4053,
1582
+ "eval_samples_per_second": 677.238,
1583
+ "eval_steps_per_second": 0.668,
1584
+ "step": 5250
1585
+ },
1586
+ {
1587
+ "epoch": 2.4576860653837236,
1588
+ "grad_norm": 0.05936937406659126,
1589
+ "learning_rate": 4.603750215716057e-05,
1590
+ "loss": 0.1293,
1591
+ "step": 5300
1592
+ },
1593
+ {
1594
+ "epoch": 2.4576860653837236,
1595
+ "eval_loss": 0.16067086041480225,
1596
+ "eval_runtime": 60.4469,
1597
+ "eval_samples_per_second": 687.976,
1598
+ "eval_steps_per_second": 0.678,
1599
+ "step": 5300
1600
+ },
1601
+ {
1602
+ "epoch": 2.480871782981683,
1603
+ "grad_norm": 0.039836496114730835,
1604
+ "learning_rate": 4.5247197834790876e-05,
1605
+ "loss": 0.1288,
1606
+ "step": 5350
1607
+ },
1608
+ {
1609
+ "epoch": 2.480871782981683,
1610
+ "eval_loss": 0.1614640227625451,
1611
+ "eval_runtime": 60.9513,
1612
+ "eval_samples_per_second": 682.283,
1613
+ "eval_steps_per_second": 0.673,
1614
+ "step": 5350
1615
+ },
1616
+ {
1617
+ "epoch": 2.504057500579643,
1618
+ "grad_norm": 0.04305760934948921,
1619
+ "learning_rate": 4.445809000494946e-05,
1620
+ "loss": 0.1294,
1621
+ "step": 5400
1622
+ },
1623
+ {
1624
+ "epoch": 2.504057500579643,
1625
+ "eval_loss": 0.16139181990447046,
1626
+ "eval_runtime": 60.6766,
1627
+ "eval_samples_per_second": 685.371,
1628
+ "eval_steps_per_second": 0.676,
1629
+ "step": 5400
1630
+ },
1631
+ {
1632
+ "epoch": 2.5272432181776026,
1633
+ "grad_norm": 0.06780368089675903,
1634
+ "learning_rate": 4.3670377321312535e-05,
1635
+ "loss": 0.1285,
1636
+ "step": 5450
1637
+ },
1638
+ {
1639
+ "epoch": 2.5272432181776026,
1640
+ "eval_loss": 0.1619736397134425,
1641
+ "eval_runtime": 60.7281,
1642
+ "eval_samples_per_second": 684.79,
1643
+ "eval_steps_per_second": 0.675,
1644
+ "step": 5450
1645
+ },
1646
+ {
1647
+ "epoch": 2.550428935775562,
1648
+ "grad_norm": 0.052273835986852646,
1649
+ "learning_rate": 4.288425808633575e-05,
1650
+ "loss": 0.1303,
1651
+ "step": 5500
1652
+ },
1653
+ {
1654
+ "epoch": 2.550428935775562,
1655
+ "eval_loss": 0.16178818674979198,
1656
+ "eval_runtime": 60.8875,
1657
+ "eval_samples_per_second": 682.997,
1658
+ "eval_steps_per_second": 0.673,
1659
+ "step": 5500
1660
+ },
1661
+ {
1662
+ "epoch": 2.573614653373522,
1663
+ "grad_norm": 0.045574627816677094,
1664
+ "learning_rate": 4.20999302013325e-05,
1665
+ "loss": 0.1291,
1666
+ "step": 5550
1667
+ },
1668
+ {
1669
+ "epoch": 2.573614653373522,
1670
+ "eval_loss": 0.16034006952877458,
1671
+ "eval_runtime": 60.7378,
1672
+ "eval_samples_per_second": 684.681,
1673
+ "eval_steps_per_second": 0.675,
1674
+ "step": 5550
1675
+ },
1676
+ {
1677
+ "epoch": 2.5968003709714815,
1678
+ "grad_norm": 0.044092051684856415,
1679
+ "learning_rate": 4.131759111665349e-05,
1680
+ "loss": 0.1298,
1681
+ "step": 5600
1682
+ },
1683
+ {
1684
+ "epoch": 2.5968003709714815,
1685
+ "eval_loss": 0.16090484909780667,
1686
+ "eval_runtime": 60.4675,
1687
+ "eval_samples_per_second": 687.741,
1688
+ "eval_steps_per_second": 0.678,
1689
+ "step": 5600
1690
+ },
1691
+ {
1692
+ "epoch": 2.6199860885694415,
1693
+ "grad_norm": 0.05473971739411354,
1694
+ "learning_rate": 4.0537437781979506e-05,
1695
+ "loss": 0.1288,
1696
+ "step": 5650
1697
+ },
1698
+ {
1699
+ "epoch": 2.6199860885694415,
1700
+ "eval_loss": 0.1604315377337276,
1701
+ "eval_runtime": 62.8239,
1702
+ "eval_samples_per_second": 661.946,
1703
+ "eval_steps_per_second": 0.653,
1704
+ "step": 5650
1705
+ },
1706
+ {
1707
+ "epoch": 2.643171806167401,
1708
+ "grad_norm": 0.07100555300712585,
1709
+ "learning_rate": 3.9759666596740476e-05,
1710
+ "loss": 0.129,
1711
+ "step": 5700
1712
+ },
1713
+ {
1714
+ "epoch": 2.643171806167401,
1715
+ "eval_loss": 0.15997494100305837,
1716
+ "eval_runtime": 61.3008,
1717
+ "eval_samples_per_second": 678.392,
1718
+ "eval_steps_per_second": 0.669,
1719
+ "step": 5700
1720
+ },
1721
+ {
1722
+ "epoch": 2.6663575237653605,
1723
+ "grad_norm": 0.04020215570926666,
1724
+ "learning_rate": 3.898447336067297e-05,
1725
+ "loss": 0.1291,
1726
+ "step": 5750
1727
+ },
1728
+ {
1729
+ "epoch": 2.6663575237653605,
1730
+ "eval_loss": 0.1596748490832133,
1731
+ "eval_runtime": 60.6148,
1732
+ "eval_samples_per_second": 686.07,
1733
+ "eval_steps_per_second": 0.676,
1734
+ "step": 5750
1735
+ },
1736
+ {
1737
+ "epoch": 2.68954324136332,
1738
+ "grad_norm": 0.05526584014296532,
1739
+ "learning_rate": 3.821205322452863e-05,
1740
+ "loss": 0.1291,
1741
+ "step": 5800
1742
+ },
1743
+ {
1744
+ "epoch": 2.68954324136332,
1745
+ "eval_loss": 0.16091962633426782,
1746
+ "eval_runtime": 60.1717,
1747
+ "eval_samples_per_second": 691.122,
1748
+ "eval_steps_per_second": 0.681,
1749
+ "step": 5800
1750
+ },
1751
+ {
1752
+ "epoch": 2.71272895896128,
1753
+ "grad_norm": 0.052167922258377075,
1754
+ "learning_rate": 3.744260064094604e-05,
1755
+ "loss": 0.129,
1756
+ "step": 5850
1757
+ },
1758
+ {
1759
+ "epoch": 2.71272895896128,
1760
+ "eval_loss": 0.16112806362615253,
1761
+ "eval_runtime": 60.1273,
1762
+ "eval_samples_per_second": 691.633,
1763
+ "eval_steps_per_second": 0.682,
1764
+ "step": 5850
1765
+ },
1766
+ {
1767
+ "epoch": 2.7359146765592395,
1768
+ "grad_norm": 0.054320793598890305,
1769
+ "learning_rate": 3.6676309315498256e-05,
1770
+ "loss": 0.13,
1771
+ "step": 5900
1772
+ },
1773
+ {
1774
+ "epoch": 2.7359146765592395,
1775
+ "eval_loss": 0.15996250695505343,
1776
+ "eval_runtime": 60.655,
1777
+ "eval_samples_per_second": 685.616,
1778
+ "eval_steps_per_second": 0.676,
1779
+ "step": 5900
1780
+ },
1781
+ {
1782
+ "epoch": 2.7591003941571994,
1783
+ "grad_norm": 0.05470626428723335,
1784
+ "learning_rate": 3.591337215792852e-05,
1785
+ "loss": 0.1296,
1786
+ "step": 5950
1787
+ },
1788
+ {
1789
+ "epoch": 2.7591003941571994,
1790
+ "eval_loss": 0.16025288890609335,
1791
+ "eval_runtime": 60.826,
1792
+ "eval_samples_per_second": 683.688,
1793
+ "eval_steps_per_second": 0.674,
1794
+ "step": 5950
1795
+ },
1796
+ {
1797
+ "epoch": 2.782286111755159,
1798
+ "grad_norm": 0.04805810749530792,
1799
+ "learning_rate": 3.515398123358627e-05,
1800
+ "loss": 0.1294,
1801
+ "step": 6000
1802
+ },
1803
+ {
1804
+ "epoch": 2.782286111755159,
1805
+ "eval_loss": 0.15918263724182835,
1806
+ "eval_runtime": 60.2321,
1807
+ "eval_samples_per_second": 690.429,
1808
+ "eval_steps_per_second": 0.681,
1809
+ "step": 6000
1810
+ },
1811
+ {
1812
+ "epoch": 2.8054718293531185,
1813
+ "grad_norm": 0.04185302183032036,
1814
+ "learning_rate": 3.439832771507565e-05,
1815
+ "loss": 0.1283,
1816
+ "step": 6050
1817
+ },
1818
+ {
1819
+ "epoch": 2.8054718293531185,
1820
+ "eval_loss": 0.16179385240233157,
1821
+ "eval_runtime": 60.9176,
1822
+ "eval_samples_per_second": 682.66,
1823
+ "eval_steps_per_second": 0.673,
1824
+ "step": 6050
1825
+ },
1826
+ {
1827
+ "epoch": 2.828657546951078,
1828
+ "grad_norm": 0.04609336704015732,
1829
+ "learning_rate": 3.364660183412892e-05,
1830
+ "loss": 0.1292,
1831
+ "step": 6100
1832
+ },
1833
+ {
1834
+ "epoch": 2.828657546951078,
1835
+ "eval_loss": 0.1611929898635588,
1836
+ "eval_runtime": 60.5916,
1837
+ "eval_samples_per_second": 686.333,
1838
+ "eval_steps_per_second": 0.677,
1839
+ "step": 6100
1840
+ },
1841
+ {
1842
+ "epoch": 2.851843264549038,
1843
+ "grad_norm": 0.05404876172542572,
1844
+ "learning_rate": 3.289899283371657e-05,
1845
+ "loss": 0.128,
1846
+ "step": 6150
1847
+ },
1848
+ {
1849
+ "epoch": 2.851843264549038,
1850
+ "eval_loss": 0.16039360794951976,
1851
+ "eval_runtime": 60.5961,
1852
+ "eval_samples_per_second": 686.282,
1853
+ "eval_steps_per_second": 0.677,
1854
+ "step": 6150
1855
+ },
1856
+ {
1857
+ "epoch": 2.8750289821469974,
1858
+ "grad_norm": 0.06787659227848053,
1859
+ "learning_rate": 3.215568892040641e-05,
1860
+ "loss": 0.1288,
1861
+ "step": 6200
1862
+ },
1863
+ {
1864
+ "epoch": 2.8750289821469974,
1865
+ "eval_loss": 0.16113480515361805,
1866
+ "eval_runtime": 60.2775,
1867
+ "eval_samples_per_second": 689.909,
1868
+ "eval_steps_per_second": 0.68,
1869
+ "step": 6200
1870
+ },
1871
+ {
1872
+ "epoch": 2.8982146997449574,
1873
+ "grad_norm": 0.06937435269355774,
1874
+ "learning_rate": 3.141687721698363e-05,
1875
+ "loss": 0.1283,
1876
+ "step": 6250
1877
+ },
1878
+ {
1879
+ "epoch": 2.8982146997449574,
1880
+ "eval_loss": 0.16087572214972407,
1881
+ "eval_runtime": 60.6789,
1882
+ "eval_samples_per_second": 685.345,
1883
+ "eval_steps_per_second": 0.676,
1884
+ "step": 6250
1885
+ },
1886
+ {
1887
+ "epoch": 2.921400417342917,
1888
+ "grad_norm": 0.08074232190847397,
1889
+ "learning_rate": 3.0682743715343564e-05,
1890
+ "loss": 0.1292,
1891
+ "step": 6300
1892
+ },
1893
+ {
1894
+ "epoch": 2.921400417342917,
1895
+ "eval_loss": 0.16049740787316144,
1896
+ "eval_runtime": 60.3194,
1897
+ "eval_samples_per_second": 689.43,
1898
+ "eval_steps_per_second": 0.68,
1899
+ "step": 6300
1900
+ },
1901
+ {
1902
+ "epoch": 2.9445861349408764,
1903
+ "grad_norm": 0.03976515680551529,
1904
+ "learning_rate": 2.9953473229669328e-05,
1905
+ "loss": 0.1302,
1906
+ "step": 6350
1907
+ },
1908
+ {
1909
+ "epoch": 2.9445861349408764,
1910
+ "eval_loss": 0.16023700059761273,
1911
+ "eval_runtime": 60.8537,
1912
+ "eval_samples_per_second": 683.377,
1913
+ "eval_steps_per_second": 0.674,
1914
+ "step": 6350
1915
+ },
1916
+ {
1917
+ "epoch": 2.967771852538836,
1918
+ "grad_norm": 0.05303976684808731,
1919
+ "learning_rate": 2.9229249349905684e-05,
1920
+ "loss": 0.1285,
1921
+ "step": 6400
1922
+ },
1923
+ {
1924
+ "epoch": 2.967771852538836,
1925
+ "eval_loss": 0.1601465398516622,
1926
+ "eval_runtime": 60.6472,
1927
+ "eval_samples_per_second": 685.703,
1928
+ "eval_steps_per_second": 0.676,
1929
+ "step": 6400
1930
+ },
1931
+ {
1932
+ "epoch": 2.990957570136796,
1933
+ "grad_norm": 0.0519745759665966,
1934
+ "learning_rate": 2.851025439554142e-05,
1935
+ "loss": 0.1286,
1936
+ "step": 6450
1937
+ },
1938
+ {
1939
+ "epoch": 2.990957570136796,
1940
+ "eval_loss": 0.16085429229133483,
1941
+ "eval_runtime": 60.2507,
1942
+ "eval_samples_per_second": 690.216,
1943
+ "eval_steps_per_second": 0.68,
1944
+ "step": 6450
1945
+ },
1946
+ {
1947
+ "epoch": 3.0141432877347554,
1948
+ "grad_norm": 0.050518251955509186,
1949
+ "learning_rate": 2.7796669369711294e-05,
1950
+ "loss": 0.1301,
1951
+ "step": 6500
1952
+ },
1953
+ {
1954
+ "epoch": 3.0141432877347554,
1955
+ "eval_loss": 0.16015394660421692,
1956
+ "eval_runtime": 60.5015,
1957
+ "eval_samples_per_second": 687.355,
1958
+ "eval_steps_per_second": 0.678,
1959
+ "step": 6500
1960
+ },
1961
+ {
1962
+ "epoch": 3.037329005332715,
1963
+ "grad_norm": 0.04253960773348808,
1964
+ "learning_rate": 2.708867391362948e-05,
1965
+ "loss": 0.1296,
1966
+ "step": 6550
1967
+ },
1968
+ {
1969
+ "epoch": 3.037329005332715,
1970
+ "eval_loss": 0.1597283595131218,
1971
+ "eval_runtime": 60.13,
1972
+ "eval_samples_per_second": 691.601,
1973
+ "eval_steps_per_second": 0.682,
1974
+ "step": 6550
1975
+ },
1976
+ {
1977
+ "epoch": 3.060514722930675,
1978
+ "grad_norm": 0.06899340450763702,
1979
+ "learning_rate": 2.638644626136587e-05,
1980
+ "loss": 0.1291,
1981
+ "step": 6600
1982
+ },
1983
+ {
1984
+ "epoch": 3.060514722930675,
1985
+ "eval_loss": 0.1604277250117246,
1986
+ "eval_runtime": 60.4618,
1987
+ "eval_samples_per_second": 687.806,
1988
+ "eval_steps_per_second": 0.678,
1989
+ "step": 6600
1990
+ },
1991
+ {
1992
+ "epoch": 3.0837004405286343,
1993
+ "grad_norm": 0.06556117534637451,
1994
+ "learning_rate": 2.5690163194976575e-05,
1995
+ "loss": 0.1288,
1996
+ "step": 6650
1997
+ },
1998
+ {
1999
+ "epoch": 3.0837004405286343,
2000
+ "eval_loss": 0.15953636330193482,
2001
+ "eval_runtime": 60.2757,
2002
+ "eval_samples_per_second": 689.93,
2003
+ "eval_steps_per_second": 0.68,
2004
+ "step": 6650
2005
+ },
2006
+ {
2007
+ "epoch": 3.106886158126594,
2008
+ "grad_norm": 0.03685734421014786,
2009
+ "learning_rate": 2.500000000000001e-05,
2010
+ "loss": 0.129,
2011
+ "step": 6700
2012
+ },
2013
+ {
2014
+ "epoch": 3.106886158126594,
2015
+ "eval_loss": 0.159308270335797,
2016
+ "eval_runtime": 60.624,
2017
+ "eval_samples_per_second": 685.966,
2018
+ "eval_steps_per_second": 0.676,
2019
+ "step": 6700
2020
+ },
2021
+ {
2022
+ "epoch": 3.130071875724554,
2023
+ "grad_norm": 0.0451020672917366,
2024
+ "learning_rate": 2.4316130421329697e-05,
2025
+ "loss": 0.1286,
2026
+ "step": 6750
2027
+ },
2028
+ {
2029
+ "epoch": 3.130071875724554,
2030
+ "eval_loss": 0.15995884031774596,
2031
+ "eval_runtime": 60.3654,
2032
+ "eval_samples_per_second": 688.905,
2033
+ "eval_steps_per_second": 0.679,
2034
+ "step": 6750
2035
+ },
2036
+ {
2037
+ "epoch": 3.1532575933225133,
2038
+ "grad_norm": 0.0495733842253685,
2039
+ "learning_rate": 2.363872661947488e-05,
2040
+ "loss": 0.1293,
2041
+ "step": 6800
2042
+ },
2043
+ {
2044
+ "epoch": 3.1532575933225133,
2045
+ "eval_loss": 0.15987331824692497,
2046
+ "eval_runtime": 60.4636,
2047
+ "eval_samples_per_second": 687.786,
2048
+ "eval_steps_per_second": 0.678,
2049
+ "step": 6800
2050
+ },
2051
+ {
2052
+ "epoch": 3.176443310920473,
2053
+ "grad_norm": 0.05756652355194092,
2054
+ "learning_rate": 2.296795912722014e-05,
2055
+ "loss": 0.1289,
2056
+ "step": 6850
2057
+ },
2058
+ {
2059
+ "epoch": 3.176443310920473,
2060
+ "eval_loss": 0.15986134614331013,
2061
+ "eval_runtime": 61.0063,
2062
+ "eval_samples_per_second": 681.667,
2063
+ "eval_steps_per_second": 0.672,
2064
+ "step": 6850
2065
+ },
2066
+ {
2067
+ "epoch": 3.199629028518433,
2068
+ "grad_norm": 0.0467820018529892,
2069
+ "learning_rate": 2.2303996806694488e-05,
2070
+ "loss": 0.1295,
2071
+ "step": 6900
2072
+ },
2073
+ {
2074
+ "epoch": 3.199629028518433,
2075
+ "eval_loss": 0.16011030076900337,
2076
+ "eval_runtime": 60.1041,
2077
+ "eval_samples_per_second": 691.9,
2078
+ "eval_steps_per_second": 0.682,
2079
+ "step": 6900
2080
+ },
2081
+ {
2082
+ "epoch": 3.2228147461163923,
2083
+ "grad_norm": 0.04179982468485832,
2084
+ "learning_rate": 2.164700680686147e-05,
2085
+ "loss": 0.1287,
2086
+ "step": 6950
2087
+ },
2088
+ {
2089
+ "epoch": 3.2228147461163923,
2090
+ "eval_loss": 0.15917751068552838,
2091
+ "eval_runtime": 60.5321,
2092
+ "eval_samples_per_second": 687.007,
2093
+ "eval_steps_per_second": 0.677,
2094
+ "step": 6950
2095
+ },
2096
+ {
2097
+ "epoch": 3.246000463714352,
2098
+ "grad_norm": 0.053910572081804276,
2099
+ "learning_rate": 2.09971545214401e-05,
2100
+ "loss": 0.1286,
2101
+ "step": 7000
2102
+ },
2103
+ {
2104
+ "epoch": 3.246000463714352,
2105
+ "eval_loss": 0.15998092838627764,
2106
+ "eval_runtime": 60.4067,
2107
+ "eval_samples_per_second": 688.434,
2108
+ "eval_steps_per_second": 0.679,
2109
+ "step": 7000
2110
+ },
2111
+ {
2112
+ "epoch": 3.2691861813123118,
2113
+ "grad_norm": 0.04404950886964798,
2114
+ "learning_rate": 2.0354603547267985e-05,
2115
+ "loss": 0.1283,
2116
+ "step": 7050
2117
+ },
2118
+ {
2119
+ "epoch": 3.2691861813123118,
2120
+ "eval_loss": 0.1597617331551387,
2121
+ "eval_runtime": 60.4218,
2122
+ "eval_samples_per_second": 688.262,
2123
+ "eval_steps_per_second": 0.679,
2124
+ "step": 7050
2125
+ },
2126
+ {
2127
+ "epoch": 3.2923718989102713,
2128
+ "grad_norm": 0.04763752967119217,
2129
+ "learning_rate": 1.9719515643116674e-05,
2130
+ "loss": 0.1288,
2131
+ "step": 7100
2132
+ },
2133
+ {
2134
+ "epoch": 3.2923718989102713,
2135
+ "eval_loss": 0.16116006530852447,
2136
+ "eval_runtime": 60.2132,
2137
+ "eval_samples_per_second": 690.646,
2138
+ "eval_steps_per_second": 0.681,
2139
+ "step": 7100
2140
+ },
2141
+ {
2142
+ "epoch": 3.3155576165082308,
2143
+ "grad_norm": 0.049567196518182755,
2144
+ "learning_rate": 1.9092050688969738e-05,
2145
+ "loss": 0.1298,
2146
+ "step": 7150
2147
+ },
2148
+ {
2149
+ "epoch": 3.3155576165082308,
2150
+ "eval_loss": 0.15965543804361845,
2151
+ "eval_runtime": 60.3928,
2152
+ "eval_samples_per_second": 688.592,
2153
+ "eval_steps_per_second": 0.679,
2154
+ "step": 7150
2155
+ },
2156
+ {
2157
+ "epoch": 3.3387433341061907,
2158
+ "grad_norm": 0.05488676205277443,
2159
+ "learning_rate": 1.847236664577389e-05,
2160
+ "loss": 0.1284,
2161
+ "step": 7200
2162
+ },
2163
+ {
2164
+ "epoch": 3.3387433341061907,
2165
+ "eval_loss": 0.16050384662882064,
2166
+ "eval_runtime": 60.121,
2167
+ "eval_samples_per_second": 691.705,
2168
+ "eval_steps_per_second": 0.682,
2169
+ "step": 7200
2170
+ },
2171
+ {
2172
+ "epoch": 3.3619290517041502,
2173
+ "grad_norm": 0.04124298691749573,
2174
+ "learning_rate": 1.7860619515673033e-05,
2175
+ "loss": 0.1289,
2176
+ "step": 7250
2177
+ },
2178
+ {
2179
+ "epoch": 3.3619290517041502,
2180
+ "eval_loss": 0.16054145931691394,
2181
+ "eval_runtime": 60.2046,
2182
+ "eval_samples_per_second": 690.745,
2183
+ "eval_steps_per_second": 0.681,
2184
+ "step": 7250
2185
+ },
2186
+ {
2187
+ "epoch": 3.3851147693021097,
2188
+ "grad_norm": 0.04400424286723137,
2189
+ "learning_rate": 1.725696330273575e-05,
2190
+ "loss": 0.1289,
2191
+ "step": 7300
2192
+ },
2193
+ {
2194
+ "epoch": 3.3851147693021097,
2195
+ "eval_loss": 0.15999099129576416,
2196
+ "eval_runtime": 60.4869,
2197
+ "eval_samples_per_second": 687.52,
2198
+ "eval_steps_per_second": 0.678,
2199
+ "step": 7300
2200
+ },
2201
+ {
2202
+ "epoch": 3.4083004869000697,
2203
+ "grad_norm": 0.05488509312272072,
2204
+ "learning_rate": 1.6661549974185424e-05,
2205
+ "loss": 0.1285,
2206
+ "step": 7350
2207
+ },
2208
+ {
2209
+ "epoch": 3.4083004869000697,
2210
+ "eval_loss": 0.16051823730892306,
2211
+ "eval_runtime": 60.1981,
2212
+ "eval_samples_per_second": 690.819,
2213
+ "eval_steps_per_second": 0.681,
2214
+ "step": 7350
2215
+ },
2216
+ {
2217
+ "epoch": 3.431486204498029,
2218
+ "grad_norm": 0.06722457706928253,
2219
+ "learning_rate": 1.60745294221434e-05,
2220
+ "loss": 0.1286,
2221
+ "step": 7400
2222
+ },
2223
+ {
2224
+ "epoch": 3.431486204498029,
2225
+ "eval_loss": 0.1610307768591294,
2226
+ "eval_runtime": 60.7755,
2227
+ "eval_samples_per_second": 684.256,
2228
+ "eval_steps_per_second": 0.675,
2229
+ "step": 7400
2230
+ },
2231
+ {
2232
+ "epoch": 3.4546719220959887,
2233
+ "grad_norm": 0.04814394935965538,
2234
+ "learning_rate": 1.549604942589441e-05,
2235
+ "loss": 0.1278,
2236
+ "step": 7450
2237
+ },
2238
+ {
2239
+ "epoch": 3.4546719220959887,
2240
+ "eval_loss": 0.1598065741965525,
2241
+ "eval_runtime": 59.9968,
2242
+ "eval_samples_per_second": 693.136,
2243
+ "eval_steps_per_second": 0.683,
2244
+ "step": 7450
2245
+ },
2246
+ {
2247
+ "epoch": 3.4778576396939487,
2248
+ "grad_norm": 0.04934167116880417,
2249
+ "learning_rate": 1.4926255614683932e-05,
2250
+ "loss": 0.1274,
2251
+ "step": 7500
2252
+ },
2253
+ {
2254
+ "epoch": 3.4778576396939487,
2255
+ "eval_loss": 0.15982454893723042,
2256
+ "eval_runtime": 60.201,
2257
+ "eval_samples_per_second": 690.786,
2258
+ "eval_steps_per_second": 0.681,
2259
+ "step": 7500
2260
+ },
2261
+ {
2262
+ "epoch": 3.501043357291908,
2263
+ "grad_norm": 0.04529615864157677,
2264
+ "learning_rate": 1.4365291431056871e-05,
2265
+ "loss": 0.1297,
2266
+ "step": 7550
2267
+ },
2268
+ {
2269
+ "epoch": 3.501043357291908,
2270
+ "eval_loss": 0.15986133524024926,
2271
+ "eval_runtime": 59.95,
2272
+ "eval_samples_per_second": 693.678,
2273
+ "eval_steps_per_second": 0.684,
2274
+ "step": 7550
2275
+ },
2276
+ {
2277
+ "epoch": 3.5242290748898677,
2278
+ "grad_norm": 0.0399620421230793,
2279
+ "learning_rate": 1.3813298094746491e-05,
2280
+ "loss": 0.1288,
2281
+ "step": 7600
2282
+ },
2283
+ {
2284
+ "epoch": 3.5242290748898677,
2285
+ "eval_loss": 0.15905609221590689,
2286
+ "eval_runtime": 61.181,
2287
+ "eval_samples_per_second": 679.72,
2288
+ "eval_steps_per_second": 0.67,
2289
+ "step": 7600
2290
+ },
2291
+ {
2292
+ "epoch": 3.5474147924878277,
2293
+ "grad_norm": 0.05973295867443085,
2294
+ "learning_rate": 1.327041456712334e-05,
2295
+ "loss": 0.1281,
2296
+ "step": 7650
2297
+ },
2298
+ {
2299
+ "epoch": 3.5474147924878277,
2300
+ "eval_loss": 0.15981091550942805,
2301
+ "eval_runtime": 60.5605,
2302
+ "eval_samples_per_second": 686.685,
2303
+ "eval_steps_per_second": 0.677,
2304
+ "step": 7650
2305
+ },
2306
+ {
2307
+ "epoch": 3.570600510085787,
2308
+ "grad_norm": 0.04896661266684532,
2309
+ "learning_rate": 1.2736777516212266e-05,
2310
+ "loss": 0.1288,
2311
+ "step": 7700
2312
+ },
2313
+ {
2314
+ "epoch": 3.570600510085787,
2315
+ "eval_loss": 0.1599924400443614,
2316
+ "eval_runtime": 60.486,
2317
+ "eval_samples_per_second": 687.531,
2318
+ "eval_steps_per_second": 0.678,
2319
+ "step": 7700
2320
+ },
2321
+ {
2322
+ "epoch": 3.5937862276837467,
2323
+ "grad_norm": 0.07458525151014328,
2324
+ "learning_rate": 1.2212521282287092e-05,
2325
+ "loss": 0.128,
2326
+ "step": 7750
2327
+ },
2328
+ {
2329
+ "epoch": 3.5937862276837467,
2330
+ "eval_loss": 0.15936126278835275,
2331
+ "eval_runtime": 60.9341,
2332
+ "eval_samples_per_second": 682.475,
2333
+ "eval_steps_per_second": 0.673,
2334
+ "step": 7750
2335
+ },
2336
+ {
2337
+ "epoch": 3.6169719452817066,
2338
+ "grad_norm": 0.04200127348303795,
2339
+ "learning_rate": 1.1697777844051105e-05,
2340
+ "loss": 0.1287,
2341
+ "step": 7800
2342
+ },
2343
+ {
2344
+ "epoch": 3.6169719452817066,
2345
+ "eval_loss": 0.1603394617678833,
2346
+ "eval_runtime": 60.5155,
2347
+ "eval_samples_per_second": 687.195,
2348
+ "eval_steps_per_second": 0.678,
2349
+ "step": 7800
2350
+ },
2351
+ {
2352
+ "epoch": 3.640157662879666,
2353
+ "grad_norm": 0.06712640821933746,
2354
+ "learning_rate": 1.1192676785412154e-05,
2355
+ "loss": 0.1291,
2356
+ "step": 7850
2357
+ },
2358
+ {
2359
+ "epoch": 3.640157662879666,
2360
+ "eval_loss": 0.15920067938345067,
2361
+ "eval_runtime": 60.0225,
2362
+ "eval_samples_per_second": 692.84,
2363
+ "eval_steps_per_second": 0.683,
2364
+ "step": 7850
2365
+ },
2366
+ {
2367
+ "epoch": 3.6633433804776256,
2368
+ "grad_norm": 0.049462996423244476,
2369
+ "learning_rate": 1.0697345262860636e-05,
2370
+ "loss": 0.1287,
2371
+ "step": 7900
2372
+ },
2373
+ {
2374
+ "epoch": 3.6633433804776256,
2375
+ "eval_loss": 0.15964593569874527,
2376
+ "eval_runtime": 60.1965,
2377
+ "eval_samples_per_second": 690.837,
2378
+ "eval_steps_per_second": 0.681,
2379
+ "step": 7900
2380
+ },
2381
+ {
2382
+ "epoch": 3.6865290980755856,
2383
+ "grad_norm": 0.05148932337760925,
2384
+ "learning_rate": 1.021190797345839e-05,
2385
+ "loss": 0.1283,
2386
+ "step": 7950
2387
+ },
2388
+ {
2389
+ "epoch": 3.6865290980755856,
2390
+ "eval_loss": 0.15903354419354673,
2391
+ "eval_runtime": 60.0507,
2392
+ "eval_samples_per_second": 692.515,
2393
+ "eval_steps_per_second": 0.683,
2394
+ "step": 7950
2395
+ },
2396
+ {
2397
+ "epoch": 3.709714815673545,
2398
+ "grad_norm": 0.05164024233818054,
2399
+ "learning_rate": 9.73648712344707e-06,
2400
+ "loss": 0.128,
2401
+ "step": 8000
2402
+ },
2403
+ {
2404
+ "epoch": 3.709714815673545,
2405
+ "eval_loss": 0.15835035051131605,
2406
+ "eval_runtime": 60.5739,
2407
+ "eval_samples_per_second": 686.533,
2408
+ "eval_steps_per_second": 0.677,
2409
+ "step": 8000
2410
+ },
2411
+ {
2412
+ "epoch": 3.7329005332715046,
2413
+ "grad_norm": 0.04926716163754463,
2414
+ "learning_rate": 9.271202397483215e-06,
2415
+ "loss": 0.1276,
2416
+ "step": 8050
2417
+ },
2418
+ {
2419
+ "epoch": 3.7329005332715046,
2420
+ "eval_loss": 0.160225615529793,
2421
+ "eval_runtime": 60.4555,
2422
+ "eval_samples_per_second": 687.878,
2423
+ "eval_steps_per_second": 0.678,
2424
+ "step": 8050
2425
+ },
2426
+ {
2427
+ "epoch": 3.7560862508694646,
2428
+ "grad_norm": 0.04355842247605324,
2429
+ "learning_rate": 8.816170928508365e-06,
2430
+ "loss": 0.1287,
2431
+ "step": 8100
2432
+ },
2433
+ {
2434
+ "epoch": 3.7560862508694646,
2435
+ "eval_loss": 0.1601867779420742,
2436
+ "eval_runtime": 60.7386,
2437
+ "eval_samples_per_second": 684.672,
2438
+ "eval_steps_per_second": 0.675,
2439
+ "step": 8100
2440
+ },
2441
+ {
2442
+ "epoch": 3.779271968467424,
2443
+ "grad_norm": 0.039105553179979324,
2444
+ "learning_rate": 8.371507268261437e-06,
2445
+ "loss": 0.1306,
2446
+ "step": 8150
2447
+ },
2448
+ {
2449
+ "epoch": 3.779271968467424,
2450
+ "eval_loss": 0.15946348937187382,
2451
+ "eval_runtime": 60.9253,
2452
+ "eval_samples_per_second": 682.574,
2453
+ "eval_steps_per_second": 0.673,
2454
+ "step": 8150
2455
+ },
2456
+ {
2457
+ "epoch": 3.8024576860653836,
2458
+ "grad_norm": 0.04452899843454361,
2459
+ "learning_rate": 7.937323358440935e-06,
2460
+ "loss": 0.1286,
2461
+ "step": 8200
2462
+ },
2463
+ {
2464
+ "epoch": 3.8024576860653836,
2465
+ "eval_loss": 0.15871429728364056,
2466
+ "eval_runtime": 60.2776,
2467
+ "eval_samples_per_second": 689.908,
2468
+ "eval_steps_per_second": 0.68,
2469
+ "step": 8200
2470
+ },
2471
+ {
2472
+ "epoch": 3.8256434036633435,
2473
+ "grad_norm": 0.043075498193502426,
2474
+ "learning_rate": 7.513728502524286e-06,
2475
+ "loss": 0.1292,
2476
+ "step": 8250
2477
+ },
2478
+ {
2479
+ "epoch": 3.8256434036633435,
2480
+ "eval_loss": 0.1592580359542711,
2481
+ "eval_runtime": 60.7244,
2482
+ "eval_samples_per_second": 684.832,
2483
+ "eval_steps_per_second": 0.675,
2484
+ "step": 8250
2485
+ },
2486
+ {
2487
+ "epoch": 3.848829121261303,
2488
+ "grad_norm": 0.05848800390958786,
2489
+ "learning_rate": 7.100829338251147e-06,
2490
+ "loss": 0.1275,
2491
+ "step": 8300
2492
+ },
2493
+ {
2494
+ "epoch": 3.848829121261303,
2495
+ "eval_loss": 0.15895083163665807,
2496
+ "eval_runtime": 60.3677,
2497
+ "eval_samples_per_second": 688.878,
2498
+ "eval_steps_per_second": 0.679,
2499
+ "step": 8300
2500
+ },
2501
+ {
2502
+ "epoch": 3.8720148388592626,
2503
+ "grad_norm": 0.04980336129665375,
2504
+ "learning_rate": 6.698729810778065e-06,
2505
+ "loss": 0.1277,
2506
+ "step": 8350
2507
+ },
2508
+ {
2509
+ "epoch": 3.8720148388592626,
2510
+ "eval_loss": 0.16002303550437,
2511
+ "eval_runtime": 60.2742,
2512
+ "eval_samples_per_second": 689.947,
2513
+ "eval_steps_per_second": 0.68,
2514
+ "step": 8350
2515
+ },
2516
+ {
2517
+ "epoch": 3.8952005564572225,
2518
+ "grad_norm": 0.057385146617889404,
2519
+ "learning_rate": 6.3075311465107535e-06,
2520
+ "loss": 0.129,
2521
+ "step": 8400
2522
+ },
2523
+ {
2524
+ "epoch": 3.8952005564572225,
2525
+ "eval_loss": 0.1601535826416112,
2526
+ "eval_runtime": 60.4053,
2527
+ "eval_samples_per_second": 688.45,
2528
+ "eval_steps_per_second": 0.679,
2529
+ "step": 8400
2530
+ },
2531
+ {
2532
+ "epoch": 3.918386274055182,
2533
+ "grad_norm": 0.045788682997226715,
2534
+ "learning_rate": 5.927331827620903e-06,
2535
+ "loss": 0.1286,
2536
+ "step": 8450
2537
+ },
2538
+ {
2539
+ "epoch": 3.918386274055182,
2540
+ "eval_loss": 0.15926720973175468,
2541
+ "eval_runtime": 60.6783,
2542
+ "eval_samples_per_second": 685.352,
2543
+ "eval_steps_per_second": 0.676,
2544
+ "step": 8450
2545
+ },
2546
+ {
2547
+ "epoch": 3.9415719916531415,
2548
+ "grad_norm": 0.045575451105833054,
2549
+ "learning_rate": 5.558227567253832e-06,
2550
+ "loss": 0.1281,
2551
+ "step": 8500
2552
+ },
2553
+ {
2554
+ "epoch": 3.9415719916531415,
2555
+ "eval_loss": 0.16032033338606583,
2556
+ "eval_runtime": 60.4563,
2557
+ "eval_samples_per_second": 687.868,
2558
+ "eval_steps_per_second": 0.678,
2559
+ "step": 8500
2560
+ },
2561
+ {
2562
+ "epoch": 3.964757709251101,
2563
+ "grad_norm": 0.034972067922353745,
2564
+ "learning_rate": 5.200311285433213e-06,
2565
+ "loss": 0.1285,
2566
+ "step": 8550
2567
+ },
2568
+ {
2569
+ "epoch": 3.964757709251101,
2570
+ "eval_loss": 0.1590997571686103,
2571
+ "eval_runtime": 60.7642,
2572
+ "eval_samples_per_second": 684.384,
2573
+ "eval_steps_per_second": 0.675,
2574
+ "step": 8550
2575
+ },
2576
+ {
2577
+ "epoch": 3.987943426849061,
2578
+ "grad_norm": 0.05060684680938721,
2579
+ "learning_rate": 4.853673085668947e-06,
2580
+ "loss": 0.1293,
2581
+ "step": 8600
2582
+ },
2583
+ {
2584
+ "epoch": 3.987943426849061,
2585
+ "eval_loss": 0.15924322809570868,
2586
+ "eval_runtime": 60.0799,
2587
+ "eval_samples_per_second": 692.178,
2588
+ "eval_steps_per_second": 0.682,
2589
+ "step": 8600
2590
+ },
2591
+ {
2592
+ "epoch": 4.011129144447021,
2593
+ "grad_norm": 0.04898017644882202,
2594
+ "learning_rate": 4.5184002322740785e-06,
2595
+ "loss": 0.1283,
2596
+ "step": 8650
2597
+ },
2598
+ {
2599
+ "epoch": 4.011129144447021,
2600
+ "eval_loss": 0.1587491140112498,
2601
+ "eval_runtime": 60.6393,
2602
+ "eval_samples_per_second": 685.793,
2603
+ "eval_steps_per_second": 0.676,
2604
+ "step": 8650
2605
+ },
2606
+ {
2607
+ "epoch": 4.0343148620449805,
2608
+ "grad_norm": 0.058361586183309555,
2609
+ "learning_rate": 4.19457712839652e-06,
2610
+ "loss": 0.1277,
2611
+ "step": 8700
2612
+ },
2613
+ {
2614
+ "epoch": 4.0343148620449805,
2615
+ "eval_loss": 0.1597737118597627,
2616
+ "eval_runtime": 61.5486,
2617
+ "eval_samples_per_second": 675.661,
2618
+ "eval_steps_per_second": 0.666,
2619
+ "step": 8700
2620
+ },
2621
+ {
2622
+ "epoch": 4.05750057964294,
2623
+ "grad_norm": 0.05138258635997772,
2624
+ "learning_rate": 3.8822852947709375e-06,
2625
+ "loss": 0.1283,
2626
+ "step": 8750
2627
+ },
2628
+ {
2629
+ "epoch": 4.05750057964294,
2630
+ "eval_loss": 0.15985116115580386,
2631
+ "eval_runtime": 60.5634,
2632
+ "eval_samples_per_second": 686.652,
2633
+ "eval_steps_per_second": 0.677,
2634
+ "step": 8750
2635
+ },
2636
+ {
2637
+ "epoch": 4.0806862972408995,
2638
+ "grad_norm": 0.0461881086230278,
2639
+ "learning_rate": 3.581603349196372e-06,
2640
+ "loss": 0.1288,
2641
+ "step": 8800
2642
+ },
2643
+ {
2644
+ "epoch": 4.0806862972408995,
2645
+ "eval_loss": 0.15788726458515429,
2646
+ "eval_runtime": 60.6057,
2647
+ "eval_samples_per_second": 686.173,
2648
+ "eval_steps_per_second": 0.677,
2649
+ "step": 8800
2650
+ }
2651
+ ],
2652
+ "logging_steps": 50,
2653
+ "max_steps": 10000,
2654
+ "num_input_tokens_seen": 0,
2655
+ "num_train_epochs": 5,
2656
+ "save_steps": 50,
2657
+ "total_flos": 2.0443186999800627e+17,
2658
+ "train_batch_size": 1024,
2659
+ "trial_name": null,
2660
+ "trial_params": null
2661
+ }
checkpoint-8800/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3df77597c69632ab7d5a5d7987fd817d663727a57d77ca069bc92a72238da772
3
+ size 5393
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "GPNRoFormerForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "aux_features_vocab_size": 5,
7
+ "embedding_size": 768,
8
+ "group_tokens": 1,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 1536,
16
+ "model_type": "GPNRoFormer",
17
+ "n_aux_features": 445,
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 2,
20
+ "pad_token_id": 0,
21
+ "rotary_value": false,
22
+ "torch_dtype": "float32",
23
+ "transformers_version": "4.40.2",
24
+ "type_vocab_size": 2,
25
+ "use_cache": true,
26
+ "vocab_size": 6
27
+ }
eval_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 4.637143519591931,
3
+ "eval_loss": 0.15925397718542236,
4
+ "eval_runtime": 61.5353,
5
+ "eval_samples_per_second": 675.807,
6
+ "eval_steps_per_second": 0.666,
7
+ "perplexity": 1.1726357315864648
8
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a0e79347487c61d8217106a3f8f05bbf42d7c2038dab7a7a461077975e6acff9
3
+ size 59497084
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 4.637143519591931,
3
+ "total_flos": 2.3231400526217216e+17,
4
+ "train_loss": 0.13326009378433226,
5
+ "train_runtime": 41368.3669,
6
+ "train_samples_per_second": 495.064,
7
+ "train_steps_per_second": 0.242
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,3030 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": 0.15788726458515429,
3
+ "best_model_checkpoint": "checkpoints/checkpoint-8800",
4
+ "epoch": 4.637143519591931,
5
+ "eval_steps": 50,
6
+ "global_step": 10000,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.023185717597959656,
13
+ "grad_norm": 0.16052097082138062,
14
+ "learning_rate": 5e-05,
15
+ "loss": 0.6225,
16
+ "step": 50
17
+ },
18
+ {
19
+ "epoch": 0.023185717597959656,
20
+ "eval_loss": 0.1987911110084725,
21
+ "eval_runtime": 63.5433,
22
+ "eval_samples_per_second": 654.451,
23
+ "eval_steps_per_second": 0.645,
24
+ "step": 50
25
+ },
26
+ {
27
+ "epoch": 0.04637143519591931,
28
+ "grad_norm": 0.09532159566879272,
29
+ "learning_rate": 0.0001,
30
+ "loss": 0.1508,
31
+ "step": 100
32
+ },
33
+ {
34
+ "epoch": 0.04637143519591931,
35
+ "eval_loss": 0.18357936446787168,
36
+ "eval_runtime": 60.9844,
37
+ "eval_samples_per_second": 681.912,
38
+ "eval_steps_per_second": 0.672,
39
+ "step": 100
40
+ },
41
+ {
42
+ "epoch": 0.06955715279387897,
43
+ "grad_norm": 0.24056212604045868,
44
+ "learning_rate": 9.999370638369377e-05,
45
+ "loss": 0.1449,
46
+ "step": 150
47
+ },
48
+ {
49
+ "epoch": 0.06955715279387897,
50
+ "eval_loss": 0.17892658896642444,
51
+ "eval_runtime": 60.7834,
52
+ "eval_samples_per_second": 684.167,
53
+ "eval_steps_per_second": 0.675,
54
+ "step": 150
55
+ },
56
+ {
57
+ "epoch": 0.09274287039183862,
58
+ "grad_norm": 0.09350813180208206,
59
+ "learning_rate": 9.997482711915927e-05,
60
+ "loss": 0.1421,
61
+ "step": 200
62
+ },
63
+ {
64
+ "epoch": 0.09274287039183862,
65
+ "eval_loss": 0.17624869869175752,
66
+ "eval_runtime": 60.3826,
67
+ "eval_samples_per_second": 688.708,
68
+ "eval_steps_per_second": 0.679,
69
+ "step": 200
70
+ },
71
+ {
72
+ "epoch": 0.11592858798979828,
73
+ "grad_norm": 0.12230529636144638,
74
+ "learning_rate": 9.99433669591504e-05,
75
+ "loss": 0.141,
76
+ "step": 250
77
+ },
78
+ {
79
+ "epoch": 0.11592858798979828,
80
+ "eval_loss": 0.17641382363047173,
81
+ "eval_runtime": 60.4169,
82
+ "eval_samples_per_second": 688.317,
83
+ "eval_steps_per_second": 0.679,
84
+ "step": 250
85
+ },
86
+ {
87
+ "epoch": 0.13911430558775795,
88
+ "grad_norm": 0.14592748880386353,
89
+ "learning_rate": 9.989933382359422e-05,
90
+ "loss": 0.1397,
91
+ "step": 300
92
+ },
93
+ {
94
+ "epoch": 0.13911430558775795,
95
+ "eval_loss": 0.17552215792639078,
96
+ "eval_runtime": 61.6101,
97
+ "eval_samples_per_second": 674.987,
98
+ "eval_steps_per_second": 0.665,
99
+ "step": 300
100
+ },
101
+ {
102
+ "epoch": 0.1623000231857176,
103
+ "grad_norm": 0.10219988226890564,
104
+ "learning_rate": 9.984273879759713e-05,
105
+ "loss": 0.1393,
106
+ "step": 350
107
+ },
108
+ {
109
+ "epoch": 0.1623000231857176,
110
+ "eval_loss": 0.17414749172793012,
111
+ "eval_runtime": 61.4962,
112
+ "eval_samples_per_second": 676.237,
113
+ "eval_steps_per_second": 0.667,
114
+ "step": 350
115
+ },
116
+ {
117
+ "epoch": 0.18548574078367724,
118
+ "grad_norm": 0.12338168174028397,
119
+ "learning_rate": 9.977359612865423e-05,
120
+ "loss": 0.1388,
121
+ "step": 400
122
+ },
123
+ {
124
+ "epoch": 0.18548574078367724,
125
+ "eval_loss": 0.17378012638412807,
126
+ "eval_runtime": 61.0462,
127
+ "eval_samples_per_second": 681.221,
128
+ "eval_steps_per_second": 0.672,
129
+ "step": 400
130
+ },
131
+ {
132
+ "epoch": 0.20867145838163692,
133
+ "grad_norm": 0.09479879587888718,
134
+ "learning_rate": 9.969192322306271e-05,
135
+ "loss": 0.1394,
136
+ "step": 450
137
+ },
138
+ {
139
+ "epoch": 0.20867145838163692,
140
+ "eval_loss": 0.17252362204688398,
141
+ "eval_runtime": 60.963,
142
+ "eval_samples_per_second": 682.151,
143
+ "eval_steps_per_second": 0.673,
144
+ "step": 450
145
+ },
146
+ {
147
+ "epoch": 0.23185717597959657,
148
+ "grad_norm": 0.1108623668551445,
149
+ "learning_rate": 9.959774064153977e-05,
150
+ "loss": 0.1383,
151
+ "step": 500
152
+ },
153
+ {
154
+ "epoch": 0.23185717597959657,
155
+ "eval_loss": 0.17298176916843877,
156
+ "eval_runtime": 60.6546,
157
+ "eval_samples_per_second": 685.62,
158
+ "eval_steps_per_second": 0.676,
159
+ "step": 500
160
+ },
161
+ {
162
+ "epoch": 0.2550428935775562,
163
+ "grad_norm": 0.0725204199552536,
164
+ "learning_rate": 9.949107209404665e-05,
165
+ "loss": 0.1376,
166
+ "step": 550
167
+ },
168
+ {
169
+ "epoch": 0.2550428935775562,
170
+ "eval_loss": 0.17165218165539878,
171
+ "eval_runtime": 59.4888,
172
+ "eval_samples_per_second": 699.056,
173
+ "eval_steps_per_second": 0.689,
174
+ "step": 550
175
+ },
176
+ {
177
+ "epoch": 0.2782286111755159,
178
+ "grad_norm": 0.0955963134765625,
179
+ "learning_rate": 9.937194443381972e-05,
180
+ "loss": 0.1372,
181
+ "step": 600
182
+ },
183
+ {
184
+ "epoch": 0.2782286111755159,
185
+ "eval_loss": 0.17077083113718278,
186
+ "eval_runtime": 60.6021,
187
+ "eval_samples_per_second": 686.214,
188
+ "eval_steps_per_second": 0.677,
189
+ "step": 600
190
+ },
191
+ {
192
+ "epoch": 0.3014143287734755,
193
+ "grad_norm": 0.18736732006072998,
194
+ "learning_rate": 9.924038765061042e-05,
195
+ "loss": 0.1361,
196
+ "step": 650
197
+ },
198
+ {
199
+ "epoch": 0.3014143287734755,
200
+ "eval_loss": 0.1727738813183492,
201
+ "eval_runtime": 60.3343,
202
+ "eval_samples_per_second": 689.259,
203
+ "eval_steps_per_second": 0.68,
204
+ "step": 650
205
+ },
206
+ {
207
+ "epoch": 0.3246000463714352,
208
+ "grad_norm": 0.09572151303291321,
209
+ "learning_rate": 9.909643486313533e-05,
210
+ "loss": 0.1362,
211
+ "step": 700
212
+ },
213
+ {
214
+ "epoch": 0.3246000463714352,
215
+ "eval_loss": 0.17145407115151273,
216
+ "eval_runtime": 60.2732,
217
+ "eval_samples_per_second": 689.959,
218
+ "eval_steps_per_second": 0.68,
219
+ "step": 700
220
+ },
221
+ {
222
+ "epoch": 0.34778576396939487,
223
+ "grad_norm": 0.07214252650737762,
224
+ "learning_rate": 9.894012231073894e-05,
225
+ "loss": 0.1364,
226
+ "step": 750
227
+ },
228
+ {
229
+ "epoch": 0.34778576396939487,
230
+ "eval_loss": 0.17133199408489355,
231
+ "eval_runtime": 60.0148,
232
+ "eval_samples_per_second": 692.929,
233
+ "eval_steps_per_second": 0.683,
234
+ "step": 750
235
+ },
236
+ {
237
+ "epoch": 0.3709714815673545,
238
+ "grad_norm": 0.18224318325519562,
239
+ "learning_rate": 9.877148934427037e-05,
240
+ "loss": 0.1356,
241
+ "step": 800
242
+ },
243
+ {
244
+ "epoch": 0.3709714815673545,
245
+ "eval_loss": 0.16949569222888886,
246
+ "eval_runtime": 60.1491,
247
+ "eval_samples_per_second": 691.382,
248
+ "eval_steps_per_second": 0.682,
249
+ "step": 800
250
+ },
251
+ {
252
+ "epoch": 0.39415719916531416,
253
+ "grad_norm": 0.06306415796279907,
254
+ "learning_rate": 9.859057841617709e-05,
255
+ "loss": 0.1353,
256
+ "step": 850
257
+ },
258
+ {
259
+ "epoch": 0.39415719916531416,
260
+ "eval_loss": 0.1686952690798172,
261
+ "eval_runtime": 60.6447,
262
+ "eval_samples_per_second": 685.731,
263
+ "eval_steps_per_second": 0.676,
264
+ "step": 850
265
+ },
266
+ {
267
+ "epoch": 0.41734291676327384,
268
+ "grad_norm": 0.10090287029743195,
269
+ "learning_rate": 9.839743506981782e-05,
270
+ "loss": 0.1361,
271
+ "step": 900
272
+ },
273
+ {
274
+ "epoch": 0.41734291676327384,
275
+ "eval_loss": 0.17026100034088926,
276
+ "eval_runtime": 61.5224,
277
+ "eval_samples_per_second": 675.949,
278
+ "eval_steps_per_second": 0.666,
279
+ "step": 900
280
+ },
281
+ {
282
+ "epoch": 0.44052863436123346,
283
+ "grad_norm": 0.10061236470937729,
284
+ "learning_rate": 9.819210792799712e-05,
285
+ "loss": 0.1354,
286
+ "step": 950
287
+ },
288
+ {
289
+ "epoch": 0.44052863436123346,
290
+ "eval_loss": 0.16971544565694113,
291
+ "eval_runtime": 60.488,
292
+ "eval_samples_per_second": 687.508,
293
+ "eval_steps_per_second": 0.678,
294
+ "step": 950
295
+ },
296
+ {
297
+ "epoch": 0.46371435195919314,
298
+ "grad_norm": 0.06525534391403198,
299
+ "learning_rate": 9.797464868072488e-05,
300
+ "loss": 0.1352,
301
+ "step": 1000
302
+ },
303
+ {
304
+ "epoch": 0.46371435195919314,
305
+ "eval_loss": 0.16946903195393553,
306
+ "eval_runtime": 61.3558,
307
+ "eval_samples_per_second": 677.784,
308
+ "eval_steps_per_second": 0.668,
309
+ "step": 1000
310
+ },
311
+ {
312
+ "epoch": 0.4869000695571528,
313
+ "grad_norm": 0.06269507855176926,
314
+ "learning_rate": 9.77451120722037e-05,
315
+ "loss": 0.1335,
316
+ "step": 1050
317
+ },
318
+ {
319
+ "epoch": 0.4869000695571528,
320
+ "eval_loss": 0.16825352947444114,
321
+ "eval_runtime": 60.1971,
322
+ "eval_samples_per_second": 690.831,
323
+ "eval_steps_per_second": 0.681,
324
+ "step": 1050
325
+ },
326
+ {
327
+ "epoch": 0.5100857871551124,
328
+ "grad_norm": 0.08187470585107803,
329
+ "learning_rate": 9.750355588704727e-05,
330
+ "loss": 0.1327,
331
+ "step": 1100
332
+ },
333
+ {
334
+ "epoch": 0.5100857871551124,
335
+ "eval_loss": 0.16861737282523587,
336
+ "eval_runtime": 59.5715,
337
+ "eval_samples_per_second": 698.085,
338
+ "eval_steps_per_second": 0.688,
339
+ "step": 1100
340
+ },
341
+ {
342
+ "epoch": 0.5332715047530721,
343
+ "grad_norm": 0.06607680767774582,
344
+ "learning_rate": 9.725004093573342e-05,
345
+ "loss": 0.1337,
346
+ "step": 1150
347
+ },
348
+ {
349
+ "epoch": 0.5332715047530721,
350
+ "eval_loss": 0.1692070748762034,
351
+ "eval_runtime": 60.0637,
352
+ "eval_samples_per_second": 692.364,
353
+ "eval_steps_per_second": 0.683,
354
+ "step": 1150
355
+ },
356
+ {
357
+ "epoch": 0.5564572223510318,
358
+ "grad_norm": 0.09759815782308578,
359
+ "learning_rate": 9.698463103929542e-05,
360
+ "loss": 0.134,
361
+ "step": 1200
362
+ },
363
+ {
364
+ "epoch": 0.5564572223510318,
365
+ "eval_loss": 0.16649228385381692,
366
+ "eval_runtime": 60.4305,
367
+ "eval_samples_per_second": 688.162,
368
+ "eval_steps_per_second": 0.678,
369
+ "step": 1200
370
+ },
371
+ {
372
+ "epoch": 0.5796429399489914,
373
+ "grad_norm": 0.10353852063417435,
374
+ "learning_rate": 9.670739301325534e-05,
375
+ "loss": 0.1341,
376
+ "step": 1250
377
+ },
378
+ {
379
+ "epoch": 0.5796429399489914,
380
+ "eval_loss": 0.16802514322459206,
381
+ "eval_runtime": 60.0955,
382
+ "eval_samples_per_second": 691.999,
383
+ "eval_steps_per_second": 0.682,
384
+ "step": 1250
385
+ },
386
+ {
387
+ "epoch": 0.602828657546951,
388
+ "grad_norm": 0.11834366619586945,
389
+ "learning_rate": 9.641839665080363e-05,
390
+ "loss": 0.1347,
391
+ "step": 1300
392
+ },
393
+ {
394
+ "epoch": 0.602828657546951,
395
+ "eval_loss": 0.1672302417427292,
396
+ "eval_runtime": 60.3484,
397
+ "eval_samples_per_second": 689.098,
398
+ "eval_steps_per_second": 0.679,
399
+ "step": 1300
400
+ },
401
+ {
402
+ "epoch": 0.6260143751449108,
403
+ "grad_norm": 0.06963012367486954,
404
+ "learning_rate": 9.611771470522908e-05,
405
+ "loss": 0.1335,
406
+ "step": 1350
407
+ },
408
+ {
409
+ "epoch": 0.6260143751449108,
410
+ "eval_loss": 0.16607839684977216,
411
+ "eval_runtime": 60.3308,
412
+ "eval_samples_per_second": 689.3,
413
+ "eval_steps_per_second": 0.68,
414
+ "step": 1350
415
+ },
416
+ {
417
+ "epoch": 0.6492000927428704,
418
+ "grad_norm": 0.06842990219593048,
419
+ "learning_rate": 9.580542287160348e-05,
420
+ "loss": 0.1338,
421
+ "step": 1400
422
+ },
423
+ {
424
+ "epoch": 0.6492000927428704,
425
+ "eval_loss": 0.16628812684035693,
426
+ "eval_runtime": 59.9335,
427
+ "eval_samples_per_second": 693.87,
428
+ "eval_steps_per_second": 0.684,
429
+ "step": 1400
430
+ },
431
+ {
432
+ "epoch": 0.67238581034083,
433
+ "grad_norm": 0.07053674757480621,
434
+ "learning_rate": 9.548159976772592e-05,
435
+ "loss": 0.1335,
436
+ "step": 1450
437
+ },
438
+ {
439
+ "epoch": 0.67238581034083,
440
+ "eval_loss": 0.16696060882262428,
441
+ "eval_runtime": 59.8079,
442
+ "eval_samples_per_second": 695.326,
443
+ "eval_steps_per_second": 0.686,
444
+ "step": 1450
445
+ },
446
+ {
447
+ "epoch": 0.6955715279387897,
448
+ "grad_norm": 0.09175281971693039,
449
+ "learning_rate": 9.514632691433107e-05,
450
+ "loss": 0.1332,
451
+ "step": 1500
452
+ },
453
+ {
454
+ "epoch": 0.6955715279387897,
455
+ "eval_loss": 0.16521949465081834,
456
+ "eval_runtime": 60.1856,
457
+ "eval_samples_per_second": 690.963,
458
+ "eval_steps_per_second": 0.681,
459
+ "step": 1500
460
+ },
461
+ {
462
+ "epoch": 0.7187572455367494,
463
+ "grad_norm": 0.05836635082960129,
464
+ "learning_rate": 9.479968871456679e-05,
465
+ "loss": 0.1336,
466
+ "step": 1550
467
+ },
468
+ {
469
+ "epoch": 0.7187572455367494,
470
+ "eval_loss": 0.16626366255041727,
471
+ "eval_runtime": 60.6256,
472
+ "eval_samples_per_second": 685.948,
473
+ "eval_steps_per_second": 0.676,
474
+ "step": 1550
475
+ },
476
+ {
477
+ "epoch": 0.741942963134709,
478
+ "grad_norm": 0.07249301671981812,
479
+ "learning_rate": 9.444177243274618e-05,
480
+ "loss": 0.133,
481
+ "step": 1600
482
+ },
483
+ {
484
+ "epoch": 0.741942963134709,
485
+ "eval_loss": 0.1655649439629329,
486
+ "eval_runtime": 60.2447,
487
+ "eval_samples_per_second": 690.285,
488
+ "eval_steps_per_second": 0.681,
489
+ "step": 1600
490
+ },
491
+ {
492
+ "epoch": 0.7651286807326687,
493
+ "grad_norm": 0.07509302347898483,
494
+ "learning_rate": 9.407266817237911e-05,
495
+ "loss": 0.1332,
496
+ "step": 1650
497
+ },
498
+ {
499
+ "epoch": 0.7651286807326687,
500
+ "eval_loss": 0.16605371203296967,
501
+ "eval_runtime": 59.8196,
502
+ "eval_samples_per_second": 695.191,
503
+ "eval_steps_per_second": 0.685,
504
+ "step": 1650
505
+ },
506
+ {
507
+ "epoch": 0.7883143983306283,
508
+ "grad_norm": 0.07540406286716461,
509
+ "learning_rate": 9.369246885348926e-05,
510
+ "loss": 0.1327,
511
+ "step": 1700
512
+ },
513
+ {
514
+ "epoch": 0.7883143983306283,
515
+ "eval_loss": 0.16555590021301406,
516
+ "eval_runtime": 60.4119,
517
+ "eval_samples_per_second": 688.374,
518
+ "eval_steps_per_second": 0.679,
519
+ "step": 1700
520
+ },
521
+ {
522
+ "epoch": 0.811500115928588,
523
+ "grad_norm": 0.06061087176203728,
524
+ "learning_rate": 9.330127018922194e-05,
525
+ "loss": 0.1318,
526
+ "step": 1750
527
+ },
528
+ {
529
+ "epoch": 0.811500115928588,
530
+ "eval_loss": 0.16623179673527624,
531
+ "eval_runtime": 59.7807,
532
+ "eval_samples_per_second": 695.643,
533
+ "eval_steps_per_second": 0.686,
534
+ "step": 1750
535
+ },
536
+ {
537
+ "epoch": 0.8346858335265477,
538
+ "grad_norm": 0.05577518790960312,
539
+ "learning_rate": 9.289917066174886e-05,
540
+ "loss": 0.1319,
541
+ "step": 1800
542
+ },
543
+ {
544
+ "epoch": 0.8346858335265477,
545
+ "eval_loss": 0.16519989030959317,
546
+ "eval_runtime": 60.1508,
547
+ "eval_samples_per_second": 691.363,
548
+ "eval_steps_per_second": 0.682,
549
+ "step": 1800
550
+ },
551
+ {
552
+ "epoch": 0.8578715511245073,
553
+ "grad_norm": 0.06929640471935272,
554
+ "learning_rate": 9.248627149747573e-05,
555
+ "loss": 0.1337,
556
+ "step": 1850
557
+ },
558
+ {
559
+ "epoch": 0.8578715511245073,
560
+ "eval_loss": 0.16394849849125304,
561
+ "eval_runtime": 60.1044,
562
+ "eval_samples_per_second": 691.896,
563
+ "eval_steps_per_second": 0.682,
564
+ "step": 1850
565
+ },
566
+ {
567
+ "epoch": 0.8810572687224669,
568
+ "grad_norm": 0.07941466569900513,
569
+ "learning_rate": 9.206267664155907e-05,
570
+ "loss": 0.1324,
571
+ "step": 1900
572
+ },
573
+ {
574
+ "epoch": 0.8810572687224669,
575
+ "eval_loss": 0.1648257591054525,
576
+ "eval_runtime": 59.9818,
577
+ "eval_samples_per_second": 693.31,
578
+ "eval_steps_per_second": 0.684,
579
+ "step": 1900
580
+ },
581
+ {
582
+ "epoch": 0.9042429863204267,
583
+ "grad_norm": 0.09700328856706619,
584
+ "learning_rate": 9.162849273173857e-05,
585
+ "loss": 0.1334,
586
+ "step": 1950
587
+ },
588
+ {
589
+ "epoch": 0.9042429863204267,
590
+ "eval_loss": 0.16508159820082235,
591
+ "eval_runtime": 60.0956,
592
+ "eval_samples_per_second": 691.997,
593
+ "eval_steps_per_second": 0.682,
594
+ "step": 1950
595
+ },
596
+ {
597
+ "epoch": 0.9274287039183863,
598
+ "grad_norm": 0.09397923946380615,
599
+ "learning_rate": 9.118382907149165e-05,
600
+ "loss": 0.1317,
601
+ "step": 2000
602
+ },
603
+ {
604
+ "epoch": 0.9274287039183863,
605
+ "eval_loss": 0.16377645660046958,
606
+ "eval_runtime": 60.471,
607
+ "eval_samples_per_second": 687.701,
608
+ "eval_steps_per_second": 0.678,
609
+ "step": 2000
610
+ },
611
+ {
612
+ "epoch": 0.9506144215163459,
613
+ "grad_norm": 0.08097202330827713,
614
+ "learning_rate": 9.072879760251679e-05,
615
+ "loss": 0.1324,
616
+ "step": 2050
617
+ },
618
+ {
619
+ "epoch": 0.9506144215163459,
620
+ "eval_loss": 0.16491611914973717,
621
+ "eval_runtime": 60.6678,
622
+ "eval_samples_per_second": 685.471,
623
+ "eval_steps_per_second": 0.676,
624
+ "step": 2050
625
+ },
626
+ {
627
+ "epoch": 0.9738001391143056,
628
+ "grad_norm": 0.08455361425876617,
629
+ "learning_rate": 9.026351287655294e-05,
630
+ "loss": 0.1326,
631
+ "step": 2100
632
+ },
633
+ {
634
+ "epoch": 0.9738001391143056,
635
+ "eval_loss": 0.16602741997032858,
636
+ "eval_runtime": 60.5593,
637
+ "eval_samples_per_second": 686.698,
638
+ "eval_steps_per_second": 0.677,
639
+ "step": 2100
640
+ },
641
+ {
642
+ "epoch": 0.9969858567122653,
643
+ "grad_norm": 0.056316621601581573,
644
+ "learning_rate": 8.978809202654162e-05,
645
+ "loss": 0.1326,
646
+ "step": 2150
647
+ },
648
+ {
649
+ "epoch": 0.9969858567122653,
650
+ "eval_loss": 0.1640188218462461,
651
+ "eval_runtime": 60.9228,
652
+ "eval_samples_per_second": 682.602,
653
+ "eval_steps_per_second": 0.673,
654
+ "step": 2150
655
+ },
656
+ {
657
+ "epoch": 1.0201715743102249,
658
+ "grad_norm": 0.06686601787805557,
659
+ "learning_rate": 8.930265473713938e-05,
660
+ "loss": 0.132,
661
+ "step": 2200
662
+ },
663
+ {
664
+ "epoch": 1.0201715743102249,
665
+ "eval_loss": 0.1652621257294944,
666
+ "eval_runtime": 60.883,
667
+ "eval_samples_per_second": 683.048,
668
+ "eval_steps_per_second": 0.673,
669
+ "step": 2200
670
+ },
671
+ {
672
+ "epoch": 1.0433572919081846,
673
+ "grad_norm": 0.040202509611845016,
674
+ "learning_rate": 8.880732321458784e-05,
675
+ "loss": 0.1319,
676
+ "step": 2250
677
+ },
678
+ {
679
+ "epoch": 1.0433572919081846,
680
+ "eval_loss": 0.1655291575008717,
681
+ "eval_runtime": 60.3109,
682
+ "eval_samples_per_second": 689.527,
683
+ "eval_steps_per_second": 0.68,
684
+ "step": 2250
685
+ },
686
+ {
687
+ "epoch": 1.0665430095061441,
688
+ "grad_norm": 0.0656428411602974,
689
+ "learning_rate": 8.83022221559489e-05,
690
+ "loss": 0.1326,
691
+ "step": 2300
692
+ },
693
+ {
694
+ "epoch": 1.0665430095061441,
695
+ "eval_loss": 0.16431572407087935,
696
+ "eval_runtime": 60.1036,
697
+ "eval_samples_per_second": 691.906,
698
+ "eval_steps_per_second": 0.682,
699
+ "step": 2300
700
+ },
701
+ {
702
+ "epoch": 1.0897287271041038,
703
+ "grad_norm": 0.06945247948169708,
704
+ "learning_rate": 8.778747871771292e-05,
705
+ "loss": 0.1321,
706
+ "step": 2350
707
+ },
708
+ {
709
+ "epoch": 1.0897287271041038,
710
+ "eval_loss": 0.16585482329987242,
711
+ "eval_runtime": 60.6263,
712
+ "eval_samples_per_second": 685.94,
713
+ "eval_steps_per_second": 0.676,
714
+ "step": 2350
715
+ },
716
+ {
717
+ "epoch": 1.1129144447020636,
718
+ "grad_norm": 0.0523492731153965,
719
+ "learning_rate": 8.726322248378775e-05,
720
+ "loss": 0.1317,
721
+ "step": 2400
722
+ },
723
+ {
724
+ "epoch": 1.1129144447020636,
725
+ "eval_loss": 0.16438524923736036,
726
+ "eval_runtime": 60.317,
727
+ "eval_samples_per_second": 689.457,
728
+ "eval_steps_per_second": 0.68,
729
+ "step": 2400
730
+ },
731
+ {
732
+ "epoch": 1.136100162300023,
733
+ "grad_norm": 0.07777334004640579,
734
+ "learning_rate": 8.672958543287666e-05,
735
+ "loss": 0.1322,
736
+ "step": 2450
737
+ },
738
+ {
739
+ "epoch": 1.136100162300023,
740
+ "eval_loss": 0.16509696053644565,
741
+ "eval_runtime": 60.26,
742
+ "eval_samples_per_second": 690.109,
743
+ "eval_steps_per_second": 0.68,
744
+ "step": 2450
745
+ },
746
+ {
747
+ "epoch": 1.1592858798979828,
748
+ "grad_norm": 0.06430637836456299,
749
+ "learning_rate": 8.618670190525352e-05,
750
+ "loss": 0.1325,
751
+ "step": 2500
752
+ },
753
+ {
754
+ "epoch": 1.1592858798979828,
755
+ "eval_loss": 0.1639541608445008,
756
+ "eval_runtime": 60.5314,
757
+ "eval_samples_per_second": 687.015,
758
+ "eval_steps_per_second": 0.677,
759
+ "step": 2500
760
+ },
761
+ {
762
+ "epoch": 1.1824715974959426,
763
+ "grad_norm": 0.11194106936454773,
764
+ "learning_rate": 8.563470856894316e-05,
765
+ "loss": 0.1311,
766
+ "step": 2550
767
+ },
768
+ {
769
+ "epoch": 1.1824715974959426,
770
+ "eval_loss": 0.16260699934317355,
771
+ "eval_runtime": 60.3659,
772
+ "eval_samples_per_second": 688.899,
773
+ "eval_steps_per_second": 0.679,
774
+ "step": 2550
775
+ },
776
+ {
777
+ "epoch": 1.205657315093902,
778
+ "grad_norm": 0.06165901944041252,
779
+ "learning_rate": 8.507374438531607e-05,
780
+ "loss": 0.1323,
781
+ "step": 2600
782
+ },
783
+ {
784
+ "epoch": 1.205657315093902,
785
+ "eval_loss": 0.1626319663130242,
786
+ "eval_runtime": 59.9516,
787
+ "eval_samples_per_second": 693.66,
788
+ "eval_steps_per_second": 0.684,
789
+ "step": 2600
790
+ },
791
+ {
792
+ "epoch": 1.2288430326918618,
793
+ "grad_norm": 0.10654885321855545,
794
+ "learning_rate": 8.450395057410561e-05,
795
+ "loss": 0.1316,
796
+ "step": 2650
797
+ },
798
+ {
799
+ "epoch": 1.2288430326918618,
800
+ "eval_loss": 0.16393000041041636,
801
+ "eval_runtime": 59.576,
802
+ "eval_samples_per_second": 698.032,
803
+ "eval_steps_per_second": 0.688,
804
+ "step": 2650
805
+ },
806
+ {
807
+ "epoch": 1.2520287502898215,
808
+ "grad_norm": 0.04848140478134155,
809
+ "learning_rate": 8.392547057785661e-05,
810
+ "loss": 0.1314,
811
+ "step": 2700
812
+ },
813
+ {
814
+ "epoch": 1.2520287502898215,
815
+ "eval_loss": 0.16348152455768114,
816
+ "eval_runtime": 60.098,
817
+ "eval_samples_per_second": 691.97,
818
+ "eval_steps_per_second": 0.682,
819
+ "step": 2700
820
+ },
821
+ {
822
+ "epoch": 1.275214467887781,
823
+ "grad_norm": 0.0573604516685009,
824
+ "learning_rate": 8.333845002581458e-05,
825
+ "loss": 0.1314,
826
+ "step": 2750
827
+ },
828
+ {
829
+ "epoch": 1.275214467887781,
830
+ "eval_loss": 0.16364089140116167,
831
+ "eval_runtime": 60.1364,
832
+ "eval_samples_per_second": 691.528,
833
+ "eval_steps_per_second": 0.682,
834
+ "step": 2750
835
+ },
836
+ {
837
+ "epoch": 1.2984001854857408,
838
+ "grad_norm": 0.053159259259700775,
839
+ "learning_rate": 8.274303669726426e-05,
840
+ "loss": 0.131,
841
+ "step": 2800
842
+ },
843
+ {
844
+ "epoch": 1.2984001854857408,
845
+ "eval_loss": 0.16257415365129801,
846
+ "eval_runtime": 60.0025,
847
+ "eval_samples_per_second": 693.071,
848
+ "eval_steps_per_second": 0.683,
849
+ "step": 2800
850
+ },
851
+ {
852
+ "epoch": 1.3215859030837005,
853
+ "grad_norm": 0.09136148542165756,
854
+ "learning_rate": 8.213938048432697e-05,
855
+ "loss": 0.1313,
856
+ "step": 2850
857
+ },
858
+ {
859
+ "epoch": 1.3215859030837005,
860
+ "eval_loss": 0.16324665471619784,
861
+ "eval_runtime": 59.8429,
862
+ "eval_samples_per_second": 694.92,
863
+ "eval_steps_per_second": 0.685,
864
+ "step": 2850
865
+ },
866
+ {
867
+ "epoch": 1.34477162068166,
868
+ "grad_norm": 0.05825324356555939,
869
+ "learning_rate": 8.152763335422613e-05,
870
+ "loss": 0.1312,
871
+ "step": 2900
872
+ },
873
+ {
874
+ "epoch": 1.34477162068166,
875
+ "eval_loss": 0.16367374608121235,
876
+ "eval_runtime": 60.219,
877
+ "eval_samples_per_second": 690.579,
878
+ "eval_steps_per_second": 0.681,
879
+ "step": 2900
880
+ },
881
+ {
882
+ "epoch": 1.3679573382796197,
883
+ "grad_norm": 0.06379790604114532,
884
+ "learning_rate": 8.090794931103026e-05,
885
+ "loss": 0.1317,
886
+ "step": 2950
887
+ },
888
+ {
889
+ "epoch": 1.3679573382796197,
890
+ "eval_loss": 0.16400733758786312,
891
+ "eval_runtime": 59.9641,
892
+ "eval_samples_per_second": 693.515,
893
+ "eval_steps_per_second": 0.684,
894
+ "step": 2950
895
+ },
896
+ {
897
+ "epoch": 1.3911430558775795,
898
+ "grad_norm": 0.05361103266477585,
899
+ "learning_rate": 8.028048435688333e-05,
900
+ "loss": 0.1311,
901
+ "step": 3000
902
+ },
903
+ {
904
+ "epoch": 1.3911430558775795,
905
+ "eval_loss": 0.16210626991928834,
906
+ "eval_runtime": 59.5858,
907
+ "eval_samples_per_second": 697.919,
908
+ "eval_steps_per_second": 0.688,
909
+ "step": 3000
910
+ },
911
+ {
912
+ "epoch": 1.414328773475539,
913
+ "grad_norm": 0.04593402519822121,
914
+ "learning_rate": 7.964539645273204e-05,
915
+ "loss": 0.1304,
916
+ "step": 3050
917
+ },
918
+ {
919
+ "epoch": 1.414328773475539,
920
+ "eval_loss": 0.163067463275087,
921
+ "eval_runtime": 60.098,
922
+ "eval_samples_per_second": 691.97,
923
+ "eval_steps_per_second": 0.682,
924
+ "step": 3050
925
+ },
926
+ {
927
+ "epoch": 1.4375144910734987,
928
+ "grad_norm": 0.057480327785015106,
929
+ "learning_rate": 7.900284547855991e-05,
930
+ "loss": 0.1307,
931
+ "step": 3100
932
+ },
933
+ {
934
+ "epoch": 1.4375144910734987,
935
+ "eval_loss": 0.16243572043734797,
936
+ "eval_runtime": 59.5674,
937
+ "eval_samples_per_second": 698.133,
938
+ "eval_steps_per_second": 0.688,
939
+ "step": 3100
940
+ },
941
+ {
942
+ "epoch": 1.4607002086714584,
943
+ "grad_norm": 0.08223798871040344,
944
+ "learning_rate": 7.835299319313853e-05,
945
+ "loss": 0.1315,
946
+ "step": 3150
947
+ },
948
+ {
949
+ "epoch": 1.4607002086714584,
950
+ "eval_loss": 0.1641944734489707,
951
+ "eval_runtime": 59.5423,
952
+ "eval_samples_per_second": 698.428,
953
+ "eval_steps_per_second": 0.689,
954
+ "step": 3150
955
+ },
956
+ {
957
+ "epoch": 1.483885926269418,
958
+ "grad_norm": 0.09742949903011322,
959
+ "learning_rate": 7.769600319330552e-05,
960
+ "loss": 0.1303,
961
+ "step": 3200
962
+ },
963
+ {
964
+ "epoch": 1.483885926269418,
965
+ "eval_loss": 0.16355698856626613,
966
+ "eval_runtime": 60.1946,
967
+ "eval_samples_per_second": 690.859,
968
+ "eval_steps_per_second": 0.681,
969
+ "step": 3200
970
+ },
971
+ {
972
+ "epoch": 1.5070716438673777,
973
+ "grad_norm": 0.06401767581701279,
974
+ "learning_rate": 7.703204087277988e-05,
975
+ "loss": 0.1315,
976
+ "step": 3250
977
+ },
978
+ {
979
+ "epoch": 1.5070716438673777,
980
+ "eval_loss": 0.16215006705140952,
981
+ "eval_runtime": 59.7822,
982
+ "eval_samples_per_second": 695.625,
983
+ "eval_steps_per_second": 0.686,
984
+ "step": 3250
985
+ },
986
+ {
987
+ "epoch": 1.5302573614653374,
988
+ "grad_norm": 0.07916898280382156,
989
+ "learning_rate": 7.636127338052512e-05,
990
+ "loss": 0.1315,
991
+ "step": 3300
992
+ },
993
+ {
994
+ "epoch": 1.5302573614653374,
995
+ "eval_loss": 0.16288597734760557,
996
+ "eval_runtime": 59.2757,
997
+ "eval_samples_per_second": 701.57,
998
+ "eval_steps_per_second": 0.692,
999
+ "step": 3300
1000
+ },
1001
+ {
1002
+ "epoch": 1.553443079063297,
1003
+ "grad_norm": 0.06549016386270523,
1004
+ "learning_rate": 7.568386957867033e-05,
1005
+ "loss": 0.1303,
1006
+ "step": 3350
1007
+ },
1008
+ {
1009
+ "epoch": 1.553443079063297,
1010
+ "eval_loss": 0.16416664097655873,
1011
+ "eval_runtime": 59.84,
1012
+ "eval_samples_per_second": 694.953,
1013
+ "eval_steps_per_second": 0.685,
1014
+ "step": 3350
1015
+ },
1016
+ {
1017
+ "epoch": 1.5766287966612567,
1018
+ "grad_norm": 0.0709395632147789,
1019
+ "learning_rate": 7.500000000000001e-05,
1020
+ "loss": 0.1309,
1021
+ "step": 3400
1022
+ },
1023
+ {
1024
+ "epoch": 1.5766287966612567,
1025
+ "eval_loss": 0.16179194486424098,
1026
+ "eval_runtime": 59.8634,
1027
+ "eval_samples_per_second": 694.682,
1028
+ "eval_steps_per_second": 0.685,
1029
+ "step": 3400
1030
+ },
1031
+ {
1032
+ "epoch": 1.5998145142592164,
1033
+ "grad_norm": 0.05671363323926926,
1034
+ "learning_rate": 7.430983680502344e-05,
1035
+ "loss": 0.1307,
1036
+ "step": 3450
1037
+ },
1038
+ {
1039
+ "epoch": 1.5998145142592164,
1040
+ "eval_loss": 0.16309191886303373,
1041
+ "eval_runtime": 59.618,
1042
+ "eval_samples_per_second": 697.541,
1043
+ "eval_steps_per_second": 0.688,
1044
+ "step": 3450
1045
+ },
1046
+ {
1047
+ "epoch": 1.623000231857176,
1048
+ "grad_norm": 0.04889162629842758,
1049
+ "learning_rate": 7.361355373863414e-05,
1050
+ "loss": 0.1314,
1051
+ "step": 3500
1052
+ },
1053
+ {
1054
+ "epoch": 1.623000231857176,
1055
+ "eval_loss": 0.16290782983414598,
1056
+ "eval_runtime": 60.3904,
1057
+ "eval_samples_per_second": 688.619,
1058
+ "eval_steps_per_second": 0.679,
1059
+ "step": 3500
1060
+ },
1061
+ {
1062
+ "epoch": 1.6461859494551356,
1063
+ "grad_norm": 0.0970933735370636,
1064
+ "learning_rate": 7.291132608637052e-05,
1065
+ "loss": 0.1314,
1066
+ "step": 3550
1067
+ },
1068
+ {
1069
+ "epoch": 1.6461859494551356,
1070
+ "eval_loss": 0.16278222993823557,
1071
+ "eval_runtime": 59.8666,
1072
+ "eval_samples_per_second": 694.644,
1073
+ "eval_steps_per_second": 0.685,
1074
+ "step": 3550
1075
+ },
1076
+ {
1077
+ "epoch": 1.6693716670530954,
1078
+ "grad_norm": 0.056557025760412216,
1079
+ "learning_rate": 7.220333063028872e-05,
1080
+ "loss": 0.1312,
1081
+ "step": 3600
1082
+ },
1083
+ {
1084
+ "epoch": 1.6693716670530954,
1085
+ "eval_loss": 0.16313205291311117,
1086
+ "eval_runtime": 60.0092,
1087
+ "eval_samples_per_second": 692.993,
1088
+ "eval_steps_per_second": 0.683,
1089
+ "step": 3600
1090
+ },
1091
+ {
1092
+ "epoch": 1.6925573846510549,
1093
+ "grad_norm": 0.04870522394776344,
1094
+ "learning_rate": 7.148974560445859e-05,
1095
+ "loss": 0.1299,
1096
+ "step": 3650
1097
+ },
1098
+ {
1099
+ "epoch": 1.6925573846510549,
1100
+ "eval_loss": 0.1617941082289122,
1101
+ "eval_runtime": 60.1721,
1102
+ "eval_samples_per_second": 691.117,
1103
+ "eval_steps_per_second": 0.681,
1104
+ "step": 3650
1105
+ },
1106
+ {
1107
+ "epoch": 1.7157431022490146,
1108
+ "grad_norm": 0.0681833028793335,
1109
+ "learning_rate": 7.077075065009433e-05,
1110
+ "loss": 0.1304,
1111
+ "step": 3700
1112
+ },
1113
+ {
1114
+ "epoch": 1.7157431022490146,
1115
+ "eval_loss": 0.16243406602519425,
1116
+ "eval_runtime": 59.3626,
1117
+ "eval_samples_per_second": 700.542,
1118
+ "eval_steps_per_second": 0.691,
1119
+ "step": 3700
1120
+ },
1121
+ {
1122
+ "epoch": 1.7389288198469743,
1123
+ "grad_norm": 0.06506156921386719,
1124
+ "learning_rate": 7.004652677033068e-05,
1125
+ "loss": 0.1299,
1126
+ "step": 3750
1127
+ },
1128
+ {
1129
+ "epoch": 1.7389288198469743,
1130
+ "eval_loss": 0.16324780134312317,
1131
+ "eval_runtime": 59.6022,
1132
+ "eval_samples_per_second": 697.726,
1133
+ "eval_steps_per_second": 0.688,
1134
+ "step": 3750
1135
+ },
1136
+ {
1137
+ "epoch": 1.7621145374449338,
1138
+ "grad_norm": 0.06188170611858368,
1139
+ "learning_rate": 6.931725628465643e-05,
1140
+ "loss": 0.1309,
1141
+ "step": 3800
1142
+ },
1143
+ {
1144
+ "epoch": 1.7621145374449338,
1145
+ "eval_loss": 0.1623115342294882,
1146
+ "eval_runtime": 59.7694,
1147
+ "eval_samples_per_second": 695.774,
1148
+ "eval_steps_per_second": 0.686,
1149
+ "step": 3800
1150
+ },
1151
+ {
1152
+ "epoch": 1.7853002550428936,
1153
+ "grad_norm": 0.05675831064581871,
1154
+ "learning_rate": 6.858312278301637e-05,
1155
+ "loss": 0.1303,
1156
+ "step": 3850
1157
+ },
1158
+ {
1159
+ "epoch": 1.7853002550428936,
1160
+ "eval_loss": 0.1630547638293529,
1161
+ "eval_runtime": 59.779,
1162
+ "eval_samples_per_second": 695.662,
1163
+ "eval_steps_per_second": 0.686,
1164
+ "step": 3850
1165
+ },
1166
+ {
1167
+ "epoch": 1.8084859726408533,
1168
+ "grad_norm": 0.04727062210440636,
1169
+ "learning_rate": 6.784431107959359e-05,
1170
+ "loss": 0.1312,
1171
+ "step": 3900
1172
+ },
1173
+ {
1174
+ "epoch": 1.8084859726408533,
1175
+ "eval_loss": 0.1616409071893626,
1176
+ "eval_runtime": 59.6005,
1177
+ "eval_samples_per_second": 697.746,
1178
+ "eval_steps_per_second": 0.688,
1179
+ "step": 3900
1180
+ },
1181
+ {
1182
+ "epoch": 1.8316716902388128,
1183
+ "grad_norm": 0.06378892064094543,
1184
+ "learning_rate": 6.710100716628344e-05,
1185
+ "loss": 0.1303,
1186
+ "step": 3950
1187
+ },
1188
+ {
1189
+ "epoch": 1.8316716902388128,
1190
+ "eval_loss": 0.1622395658739077,
1191
+ "eval_runtime": 60.1499,
1192
+ "eval_samples_per_second": 691.373,
1193
+ "eval_steps_per_second": 0.682,
1194
+ "step": 3950
1195
+ },
1196
+ {
1197
+ "epoch": 1.8548574078367726,
1198
+ "grad_norm": 0.05470576509833336,
1199
+ "learning_rate": 6.635339816587109e-05,
1200
+ "loss": 0.1308,
1201
+ "step": 4000
1202
+ },
1203
+ {
1204
+ "epoch": 1.8548574078367726,
1205
+ "eval_loss": 0.16317236170181762,
1206
+ "eval_runtime": 60.014,
1207
+ "eval_samples_per_second": 692.939,
1208
+ "eval_steps_per_second": 0.683,
1209
+ "step": 4000
1210
+ },
1211
+ {
1212
+ "epoch": 1.8780431254347323,
1213
+ "grad_norm": 0.053886763751506805,
1214
+ "learning_rate": 6.560167228492436e-05,
1215
+ "loss": 0.1297,
1216
+ "step": 4050
1217
+ },
1218
+ {
1219
+ "epoch": 1.8780431254347323,
1220
+ "eval_loss": 0.16198886262197804,
1221
+ "eval_runtime": 60.8262,
1222
+ "eval_samples_per_second": 683.685,
1223
+ "eval_steps_per_second": 0.674,
1224
+ "step": 4050
1225
+ },
1226
+ {
1227
+ "epoch": 1.9012288430326918,
1228
+ "grad_norm": 0.054583676159381866,
1229
+ "learning_rate": 6.484601876641375e-05,
1230
+ "loss": 0.1301,
1231
+ "step": 4100
1232
+ },
1233
+ {
1234
+ "epoch": 1.9012288430326918,
1235
+ "eval_loss": 0.1616550050294764,
1236
+ "eval_runtime": 59.7779,
1237
+ "eval_samples_per_second": 695.675,
1238
+ "eval_steps_per_second": 0.686,
1239
+ "step": 4100
1240
+ },
1241
+ {
1242
+ "epoch": 1.9244145606306515,
1243
+ "grad_norm": 0.071171335875988,
1244
+ "learning_rate": 6.408662784207149e-05,
1245
+ "loss": 0.131,
1246
+ "step": 4150
1247
+ },
1248
+ {
1249
+ "epoch": 1.9244145606306515,
1250
+ "eval_loss": 0.15968682813566223,
1251
+ "eval_runtime": 60.227,
1252
+ "eval_samples_per_second": 690.487,
1253
+ "eval_steps_per_second": 0.681,
1254
+ "step": 4150
1255
+ },
1256
+ {
1257
+ "epoch": 1.9476002782286113,
1258
+ "grad_norm": 0.05775531381368637,
1259
+ "learning_rate": 6.332369068450174e-05,
1260
+ "loss": 0.1296,
1261
+ "step": 4200
1262
+ },
1263
+ {
1264
+ "epoch": 1.9476002782286113,
1265
+ "eval_loss": 0.16262199212265846,
1266
+ "eval_runtime": 60.3405,
1267
+ "eval_samples_per_second": 689.189,
1268
+ "eval_steps_per_second": 0.679,
1269
+ "step": 4200
1270
+ },
1271
+ {
1272
+ "epoch": 1.9707859958265708,
1273
+ "grad_norm": 0.06425776332616806,
1274
+ "learning_rate": 6.255739935905396e-05,
1275
+ "loss": 0.1299,
1276
+ "step": 4250
1277
+ },
1278
+ {
1279
+ "epoch": 1.9707859958265708,
1280
+ "eval_loss": 0.16324524366491053,
1281
+ "eval_runtime": 61.417,
1282
+ "eval_samples_per_second": 677.109,
1283
+ "eval_steps_per_second": 0.668,
1284
+ "step": 4250
1285
+ },
1286
+ {
1287
+ "epoch": 1.9939717134245305,
1288
+ "grad_norm": 0.045762140303850174,
1289
+ "learning_rate": 6.178794677547137e-05,
1290
+ "loss": 0.1299,
1291
+ "step": 4300
1292
+ },
1293
+ {
1294
+ "epoch": 1.9939717134245305,
1295
+ "eval_loss": 0.16053301797614244,
1296
+ "eval_runtime": 61.0801,
1297
+ "eval_samples_per_second": 680.844,
1298
+ "eval_steps_per_second": 0.671,
1299
+ "step": 4300
1300
+ },
1301
+ {
1302
+ "epoch": 2.0171574310224902,
1303
+ "grad_norm": 0.07060451060533524,
1304
+ "learning_rate": 6.1015526639327035e-05,
1305
+ "loss": 0.1296,
1306
+ "step": 4350
1307
+ },
1308
+ {
1309
+ "epoch": 2.0171574310224902,
1310
+ "eval_loss": 0.1620254674138633,
1311
+ "eval_runtime": 61.0829,
1312
+ "eval_samples_per_second": 680.812,
1313
+ "eval_steps_per_second": 0.671,
1314
+ "step": 4350
1315
+ },
1316
+ {
1317
+ "epoch": 2.0403431486204497,
1318
+ "grad_norm": 0.059919316321611404,
1319
+ "learning_rate": 6.024033340325954e-05,
1320
+ "loss": 0.1302,
1321
+ "step": 4400
1322
+ },
1323
+ {
1324
+ "epoch": 2.0403431486204497,
1325
+ "eval_loss": 0.16284223807997533,
1326
+ "eval_runtime": 61.5789,
1327
+ "eval_samples_per_second": 675.328,
1328
+ "eval_steps_per_second": 0.666,
1329
+ "step": 4400
1330
+ },
1331
+ {
1332
+ "epoch": 2.0635288662184093,
1333
+ "grad_norm": 0.07983385026454926,
1334
+ "learning_rate": 5.946256221802051e-05,
1335
+ "loss": 0.13,
1336
+ "step": 4450
1337
+ },
1338
+ {
1339
+ "epoch": 2.0635288662184093,
1340
+ "eval_loss": 0.16209282393788932,
1341
+ "eval_runtime": 61.6584,
1342
+ "eval_samples_per_second": 674.458,
1343
+ "eval_steps_per_second": 0.665,
1344
+ "step": 4450
1345
+ },
1346
+ {
1347
+ "epoch": 2.086714583816369,
1348
+ "grad_norm": 0.07582173496484756,
1349
+ "learning_rate": 5.868240888334653e-05,
1350
+ "loss": 0.1296,
1351
+ "step": 4500
1352
+ },
1353
+ {
1354
+ "epoch": 2.086714583816369,
1355
+ "eval_loss": 0.16158196377974565,
1356
+ "eval_runtime": 61.1826,
1357
+ "eval_samples_per_second": 679.703,
1358
+ "eval_steps_per_second": 0.67,
1359
+ "step": 4500
1360
+ },
1361
+ {
1362
+ "epoch": 2.1099003014143287,
1363
+ "grad_norm": 0.06049995869398117,
1364
+ "learning_rate": 5.79000697986675e-05,
1365
+ "loss": 0.1298,
1366
+ "step": 4550
1367
+ },
1368
+ {
1369
+ "epoch": 2.1099003014143287,
1370
+ "eval_loss": 0.16130609279963956,
1371
+ "eval_runtime": 61.0153,
1372
+ "eval_samples_per_second": 681.567,
1373
+ "eval_steps_per_second": 0.672,
1374
+ "step": 4550
1375
+ },
1376
+ {
1377
+ "epoch": 2.1330860190122882,
1378
+ "grad_norm": 0.0440148264169693,
1379
+ "learning_rate": 5.7115741913664264e-05,
1380
+ "loss": 0.1299,
1381
+ "step": 4600
1382
+ },
1383
+ {
1384
+ "epoch": 2.1330860190122882,
1385
+ "eval_loss": 0.16027799763638953,
1386
+ "eval_runtime": 61.1993,
1387
+ "eval_samples_per_second": 679.517,
1388
+ "eval_steps_per_second": 0.67,
1389
+ "step": 4600
1390
+ },
1391
+ {
1392
+ "epoch": 2.156271736610248,
1393
+ "grad_norm": 0.05254065990447998,
1394
+ "learning_rate": 5.6329622678687463e-05,
1395
+ "loss": 0.1299,
1396
+ "step": 4650
1397
+ },
1398
+ {
1399
+ "epoch": 2.156271736610248,
1400
+ "eval_loss": 0.16206274484291652,
1401
+ "eval_runtime": 61.4415,
1402
+ "eval_samples_per_second": 676.839,
1403
+ "eval_steps_per_second": 0.667,
1404
+ "step": 4650
1405
+ },
1406
+ {
1407
+ "epoch": 2.1794574542082077,
1408
+ "grad_norm": 0.06294432282447815,
1409
+ "learning_rate": 5.5541909995050554e-05,
1410
+ "loss": 0.1306,
1411
+ "step": 4700
1412
+ },
1413
+ {
1414
+ "epoch": 2.1794574542082077,
1415
+ "eval_loss": 0.16140170723024802,
1416
+ "eval_runtime": 60.8861,
1417
+ "eval_samples_per_second": 683.013,
1418
+ "eval_steps_per_second": 0.673,
1419
+ "step": 4700
1420
+ },
1421
+ {
1422
+ "epoch": 2.202643171806167,
1423
+ "grad_norm": 0.06710942089557648,
1424
+ "learning_rate": 5.475280216520913e-05,
1425
+ "loss": 0.1303,
1426
+ "step": 4750
1427
+ },
1428
+ {
1429
+ "epoch": 2.202643171806167,
1430
+ "eval_loss": 0.16245448075670843,
1431
+ "eval_runtime": 61.2839,
1432
+ "eval_samples_per_second": 678.58,
1433
+ "eval_steps_per_second": 0.669,
1434
+ "step": 4750
1435
+ },
1436
+ {
1437
+ "epoch": 2.225828889404127,
1438
+ "grad_norm": 0.05298132076859474,
1439
+ "learning_rate": 5.396249784283942e-05,
1440
+ "loss": 0.13,
1441
+ "step": 4800
1442
+ },
1443
+ {
1444
+ "epoch": 2.225828889404127,
1445
+ "eval_loss": 0.1623738898660767,
1446
+ "eval_runtime": 61.1531,
1447
+ "eval_samples_per_second": 680.031,
1448
+ "eval_steps_per_second": 0.67,
1449
+ "step": 4800
1450
+ },
1451
+ {
1452
+ "epoch": 2.2490146070020867,
1453
+ "grad_norm": 0.04066763445734978,
1454
+ "learning_rate": 5.317119598282823e-05,
1455
+ "loss": 0.1295,
1456
+ "step": 4850
1457
+ },
1458
+ {
1459
+ "epoch": 2.2490146070020867,
1460
+ "eval_loss": 0.1627438727811327,
1461
+ "eval_runtime": 61.0414,
1462
+ "eval_samples_per_second": 681.275,
1463
+ "eval_steps_per_second": 0.672,
1464
+ "step": 4850
1465
+ },
1466
+ {
1467
+ "epoch": 2.272200324600046,
1468
+ "grad_norm": 0.061821240931749344,
1469
+ "learning_rate": 5.2379095791187124e-05,
1470
+ "loss": 0.1299,
1471
+ "step": 4900
1472
+ },
1473
+ {
1474
+ "epoch": 2.272200324600046,
1475
+ "eval_loss": 0.16086717177928397,
1476
+ "eval_runtime": 60.7945,
1477
+ "eval_samples_per_second": 684.042,
1478
+ "eval_steps_per_second": 0.674,
1479
+ "step": 4900
1480
+ },
1481
+ {
1482
+ "epoch": 2.295386042198006,
1483
+ "grad_norm": 0.08038394153118134,
1484
+ "learning_rate": 5.158639667490339e-05,
1485
+ "loss": 0.13,
1486
+ "step": 4950
1487
+ },
1488
+ {
1489
+ "epoch": 2.295386042198006,
1490
+ "eval_loss": 0.16221664317086187,
1491
+ "eval_runtime": 61.6742,
1492
+ "eval_samples_per_second": 674.285,
1493
+ "eval_steps_per_second": 0.665,
1494
+ "step": 4950
1495
+ },
1496
+ {
1497
+ "epoch": 2.3185717597959656,
1498
+ "grad_norm": 0.0556926503777504,
1499
+ "learning_rate": 5.0793298191740404e-05,
1500
+ "loss": 0.1311,
1501
+ "step": 5000
1502
+ },
1503
+ {
1504
+ "epoch": 2.3185717597959656,
1505
+ "eval_loss": 0.16015339844546791,
1506
+ "eval_runtime": 61.3657,
1507
+ "eval_samples_per_second": 677.675,
1508
+ "eval_steps_per_second": 0.668,
1509
+ "step": 5000
1510
+ },
1511
+ {
1512
+ "epoch": 2.3417574773939256,
1513
+ "grad_norm": 0.06645477563142776,
1514
+ "learning_rate": 5e-05,
1515
+ "loss": 0.1284,
1516
+ "step": 5050
1517
+ },
1518
+ {
1519
+ "epoch": 2.3417574773939256,
1520
+ "eval_loss": 0.16160674186313023,
1521
+ "eval_runtime": 61.4737,
1522
+ "eval_samples_per_second": 676.484,
1523
+ "eval_steps_per_second": 0.667,
1524
+ "step": 5050
1525
+ },
1526
+ {
1527
+ "epoch": 2.364943194991885,
1528
+ "grad_norm": 0.05365500971674919,
1529
+ "learning_rate": 4.92067018082596e-05,
1530
+ "loss": 0.13,
1531
+ "step": 5100
1532
+ },
1533
+ {
1534
+ "epoch": 2.364943194991885,
1535
+ "eval_loss": 0.16016484459556096,
1536
+ "eval_runtime": 61.4058,
1537
+ "eval_samples_per_second": 677.232,
1538
+ "eval_steps_per_second": 0.668,
1539
+ "step": 5100
1540
+ },
1541
+ {
1542
+ "epoch": 2.3881289125898446,
1543
+ "grad_norm": 0.0499204620718956,
1544
+ "learning_rate": 4.841360332509663e-05,
1545
+ "loss": 0.129,
1546
+ "step": 5150
1547
+ },
1548
+ {
1549
+ "epoch": 2.3881289125898446,
1550
+ "eval_loss": 0.16054727464378063,
1551
+ "eval_runtime": 61.1539,
1552
+ "eval_samples_per_second": 680.023,
1553
+ "eval_steps_per_second": 0.67,
1554
+ "step": 5150
1555
+ },
1556
+ {
1557
+ "epoch": 2.411314630187804,
1558
+ "grad_norm": 0.07284457236528397,
1559
+ "learning_rate": 4.762090420881289e-05,
1560
+ "loss": 0.129,
1561
+ "step": 5200
1562
+ },
1563
+ {
1564
+ "epoch": 2.411314630187804,
1565
+ "eval_loss": 0.16057287778830004,
1566
+ "eval_runtime": 60.5785,
1567
+ "eval_samples_per_second": 686.481,
1568
+ "eval_steps_per_second": 0.677,
1569
+ "step": 5200
1570
+ },
1571
+ {
1572
+ "epoch": 2.434500347785764,
1573
+ "grad_norm": 0.06511891633272171,
1574
+ "learning_rate": 4.6828804017171776e-05,
1575
+ "loss": 0.1297,
1576
+ "step": 5250
1577
+ },
1578
+ {
1579
+ "epoch": 2.434500347785764,
1580
+ "eval_loss": 0.16202011190836896,
1581
+ "eval_runtime": 61.4053,
1582
+ "eval_samples_per_second": 677.238,
1583
+ "eval_steps_per_second": 0.668,
1584
+ "step": 5250
1585
+ },
1586
+ {
1587
+ "epoch": 2.4576860653837236,
1588
+ "grad_norm": 0.05936937406659126,
1589
+ "learning_rate": 4.603750215716057e-05,
1590
+ "loss": 0.1293,
1591
+ "step": 5300
1592
+ },
1593
+ {
1594
+ "epoch": 2.4576860653837236,
1595
+ "eval_loss": 0.16067086041480225,
1596
+ "eval_runtime": 60.4469,
1597
+ "eval_samples_per_second": 687.976,
1598
+ "eval_steps_per_second": 0.678,
1599
+ "step": 5300
1600
+ },
1601
+ {
1602
+ "epoch": 2.480871782981683,
1603
+ "grad_norm": 0.039836496114730835,
1604
+ "learning_rate": 4.5247197834790876e-05,
1605
+ "loss": 0.1288,
1606
+ "step": 5350
1607
+ },
1608
+ {
1609
+ "epoch": 2.480871782981683,
1610
+ "eval_loss": 0.1614640227625451,
1611
+ "eval_runtime": 60.9513,
1612
+ "eval_samples_per_second": 682.283,
1613
+ "eval_steps_per_second": 0.673,
1614
+ "step": 5350
1615
+ },
1616
+ {
1617
+ "epoch": 2.504057500579643,
1618
+ "grad_norm": 0.04305760934948921,
1619
+ "learning_rate": 4.445809000494946e-05,
1620
+ "loss": 0.1294,
1621
+ "step": 5400
1622
+ },
1623
+ {
1624
+ "epoch": 2.504057500579643,
1625
+ "eval_loss": 0.16139181990447046,
1626
+ "eval_runtime": 60.6766,
1627
+ "eval_samples_per_second": 685.371,
1628
+ "eval_steps_per_second": 0.676,
1629
+ "step": 5400
1630
+ },
1631
+ {
1632
+ "epoch": 2.5272432181776026,
1633
+ "grad_norm": 0.06780368089675903,
1634
+ "learning_rate": 4.3670377321312535e-05,
1635
+ "loss": 0.1285,
1636
+ "step": 5450
1637
+ },
1638
+ {
1639
+ "epoch": 2.5272432181776026,
1640
+ "eval_loss": 0.1619736397134425,
1641
+ "eval_runtime": 60.7281,
1642
+ "eval_samples_per_second": 684.79,
1643
+ "eval_steps_per_second": 0.675,
1644
+ "step": 5450
1645
+ },
1646
+ {
1647
+ "epoch": 2.550428935775562,
1648
+ "grad_norm": 0.052273835986852646,
1649
+ "learning_rate": 4.288425808633575e-05,
1650
+ "loss": 0.1303,
1651
+ "step": 5500
1652
+ },
1653
+ {
1654
+ "epoch": 2.550428935775562,
1655
+ "eval_loss": 0.16178818674979198,
1656
+ "eval_runtime": 60.8875,
1657
+ "eval_samples_per_second": 682.997,
1658
+ "eval_steps_per_second": 0.673,
1659
+ "step": 5500
1660
+ },
1661
+ {
1662
+ "epoch": 2.573614653373522,
1663
+ "grad_norm": 0.045574627816677094,
1664
+ "learning_rate": 4.20999302013325e-05,
1665
+ "loss": 0.1291,
1666
+ "step": 5550
1667
+ },
1668
+ {
1669
+ "epoch": 2.573614653373522,
1670
+ "eval_loss": 0.16034006952877458,
1671
+ "eval_runtime": 60.7378,
1672
+ "eval_samples_per_second": 684.681,
1673
+ "eval_steps_per_second": 0.675,
1674
+ "step": 5550
1675
+ },
1676
+ {
1677
+ "epoch": 2.5968003709714815,
1678
+ "grad_norm": 0.044092051684856415,
1679
+ "learning_rate": 4.131759111665349e-05,
1680
+ "loss": 0.1298,
1681
+ "step": 5600
1682
+ },
1683
+ {
1684
+ "epoch": 2.5968003709714815,
1685
+ "eval_loss": 0.16090484909780667,
1686
+ "eval_runtime": 60.4675,
1687
+ "eval_samples_per_second": 687.741,
1688
+ "eval_steps_per_second": 0.678,
1689
+ "step": 5600
1690
+ },
1691
+ {
1692
+ "epoch": 2.6199860885694415,
1693
+ "grad_norm": 0.05473971739411354,
1694
+ "learning_rate": 4.0537437781979506e-05,
1695
+ "loss": 0.1288,
1696
+ "step": 5650
1697
+ },
1698
+ {
1699
+ "epoch": 2.6199860885694415,
1700
+ "eval_loss": 0.1604315377337276,
1701
+ "eval_runtime": 62.8239,
1702
+ "eval_samples_per_second": 661.946,
1703
+ "eval_steps_per_second": 0.653,
1704
+ "step": 5650
1705
+ },
1706
+ {
1707
+ "epoch": 2.643171806167401,
1708
+ "grad_norm": 0.07100555300712585,
1709
+ "learning_rate": 3.9759666596740476e-05,
1710
+ "loss": 0.129,
1711
+ "step": 5700
1712
+ },
1713
+ {
1714
+ "epoch": 2.643171806167401,
1715
+ "eval_loss": 0.15997494100305837,
1716
+ "eval_runtime": 61.3008,
1717
+ "eval_samples_per_second": 678.392,
1718
+ "eval_steps_per_second": 0.669,
1719
+ "step": 5700
1720
+ },
1721
+ {
1722
+ "epoch": 2.6663575237653605,
1723
+ "grad_norm": 0.04020215570926666,
1724
+ "learning_rate": 3.898447336067297e-05,
1725
+ "loss": 0.1291,
1726
+ "step": 5750
1727
+ },
1728
+ {
1729
+ "epoch": 2.6663575237653605,
1730
+ "eval_loss": 0.1596748490832133,
1731
+ "eval_runtime": 60.6148,
1732
+ "eval_samples_per_second": 686.07,
1733
+ "eval_steps_per_second": 0.676,
1734
+ "step": 5750
1735
+ },
1736
+ {
1737
+ "epoch": 2.68954324136332,
1738
+ "grad_norm": 0.05526584014296532,
1739
+ "learning_rate": 3.821205322452863e-05,
1740
+ "loss": 0.1291,
1741
+ "step": 5800
1742
+ },
1743
+ {
1744
+ "epoch": 2.68954324136332,
1745
+ "eval_loss": 0.16091962633426782,
1746
+ "eval_runtime": 60.1717,
1747
+ "eval_samples_per_second": 691.122,
1748
+ "eval_steps_per_second": 0.681,
1749
+ "step": 5800
1750
+ },
1751
+ {
1752
+ "epoch": 2.71272895896128,
1753
+ "grad_norm": 0.052167922258377075,
1754
+ "learning_rate": 3.744260064094604e-05,
1755
+ "loss": 0.129,
1756
+ "step": 5850
1757
+ },
1758
+ {
1759
+ "epoch": 2.71272895896128,
1760
+ "eval_loss": 0.16112806362615253,
1761
+ "eval_runtime": 60.1273,
1762
+ "eval_samples_per_second": 691.633,
1763
+ "eval_steps_per_second": 0.682,
1764
+ "step": 5850
1765
+ },
1766
+ {
1767
+ "epoch": 2.7359146765592395,
1768
+ "grad_norm": 0.054320793598890305,
1769
+ "learning_rate": 3.6676309315498256e-05,
1770
+ "loss": 0.13,
1771
+ "step": 5900
1772
+ },
1773
+ {
1774
+ "epoch": 2.7359146765592395,
1775
+ "eval_loss": 0.15996250695505343,
1776
+ "eval_runtime": 60.655,
1777
+ "eval_samples_per_second": 685.616,
1778
+ "eval_steps_per_second": 0.676,
1779
+ "step": 5900
1780
+ },
1781
+ {
1782
+ "epoch": 2.7591003941571994,
1783
+ "grad_norm": 0.05470626428723335,
1784
+ "learning_rate": 3.591337215792852e-05,
1785
+ "loss": 0.1296,
1786
+ "step": 5950
1787
+ },
1788
+ {
1789
+ "epoch": 2.7591003941571994,
1790
+ "eval_loss": 0.16025288890609335,
1791
+ "eval_runtime": 60.826,
1792
+ "eval_samples_per_second": 683.688,
1793
+ "eval_steps_per_second": 0.674,
1794
+ "step": 5950
1795
+ },
1796
+ {
1797
+ "epoch": 2.782286111755159,
1798
+ "grad_norm": 0.04805810749530792,
1799
+ "learning_rate": 3.515398123358627e-05,
1800
+ "loss": 0.1294,
1801
+ "step": 6000
1802
+ },
1803
+ {
1804
+ "epoch": 2.782286111755159,
1805
+ "eval_loss": 0.15918263724182835,
1806
+ "eval_runtime": 60.2321,
1807
+ "eval_samples_per_second": 690.429,
1808
+ "eval_steps_per_second": 0.681,
1809
+ "step": 6000
1810
+ },
1811
+ {
1812
+ "epoch": 2.8054718293531185,
1813
+ "grad_norm": 0.04185302183032036,
1814
+ "learning_rate": 3.439832771507565e-05,
1815
+ "loss": 0.1283,
1816
+ "step": 6050
1817
+ },
1818
+ {
1819
+ "epoch": 2.8054718293531185,
1820
+ "eval_loss": 0.16179385240233157,
1821
+ "eval_runtime": 60.9176,
1822
+ "eval_samples_per_second": 682.66,
1823
+ "eval_steps_per_second": 0.673,
1824
+ "step": 6050
1825
+ },
1826
+ {
1827
+ "epoch": 2.828657546951078,
1828
+ "grad_norm": 0.04609336704015732,
1829
+ "learning_rate": 3.364660183412892e-05,
1830
+ "loss": 0.1292,
1831
+ "step": 6100
1832
+ },
1833
+ {
1834
+ "epoch": 2.828657546951078,
1835
+ "eval_loss": 0.1611929898635588,
1836
+ "eval_runtime": 60.5916,
1837
+ "eval_samples_per_second": 686.333,
1838
+ "eval_steps_per_second": 0.677,
1839
+ "step": 6100
1840
+ },
1841
+ {
1842
+ "epoch": 2.851843264549038,
1843
+ "grad_norm": 0.05404876172542572,
1844
+ "learning_rate": 3.289899283371657e-05,
1845
+ "loss": 0.128,
1846
+ "step": 6150
1847
+ },
1848
+ {
1849
+ "epoch": 2.851843264549038,
1850
+ "eval_loss": 0.16039360794951976,
1851
+ "eval_runtime": 60.5961,
1852
+ "eval_samples_per_second": 686.282,
1853
+ "eval_steps_per_second": 0.677,
1854
+ "step": 6150
1855
+ },
1856
+ {
1857
+ "epoch": 2.8750289821469974,
1858
+ "grad_norm": 0.06787659227848053,
1859
+ "learning_rate": 3.215568892040641e-05,
1860
+ "loss": 0.1288,
1861
+ "step": 6200
1862
+ },
1863
+ {
1864
+ "epoch": 2.8750289821469974,
1865
+ "eval_loss": 0.16113480515361805,
1866
+ "eval_runtime": 60.2775,
1867
+ "eval_samples_per_second": 689.909,
1868
+ "eval_steps_per_second": 0.68,
1869
+ "step": 6200
1870
+ },
1871
+ {
1872
+ "epoch": 2.8982146997449574,
1873
+ "grad_norm": 0.06937435269355774,
1874
+ "learning_rate": 3.141687721698363e-05,
1875
+ "loss": 0.1283,
1876
+ "step": 6250
1877
+ },
1878
+ {
1879
+ "epoch": 2.8982146997449574,
1880
+ "eval_loss": 0.16087572214972407,
1881
+ "eval_runtime": 60.6789,
1882
+ "eval_samples_per_second": 685.345,
1883
+ "eval_steps_per_second": 0.676,
1884
+ "step": 6250
1885
+ },
1886
+ {
1887
+ "epoch": 2.921400417342917,
1888
+ "grad_norm": 0.08074232190847397,
1889
+ "learning_rate": 3.0682743715343564e-05,
1890
+ "loss": 0.1292,
1891
+ "step": 6300
1892
+ },
1893
+ {
1894
+ "epoch": 2.921400417342917,
1895
+ "eval_loss": 0.16049740787316144,
1896
+ "eval_runtime": 60.3194,
1897
+ "eval_samples_per_second": 689.43,
1898
+ "eval_steps_per_second": 0.68,
1899
+ "step": 6300
1900
+ },
1901
+ {
1902
+ "epoch": 2.9445861349408764,
1903
+ "grad_norm": 0.03976515680551529,
1904
+ "learning_rate": 2.9953473229669328e-05,
1905
+ "loss": 0.1302,
1906
+ "step": 6350
1907
+ },
1908
+ {
1909
+ "epoch": 2.9445861349408764,
1910
+ "eval_loss": 0.16023700059761273,
1911
+ "eval_runtime": 60.8537,
1912
+ "eval_samples_per_second": 683.377,
1913
+ "eval_steps_per_second": 0.674,
1914
+ "step": 6350
1915
+ },
1916
+ {
1917
+ "epoch": 2.967771852538836,
1918
+ "grad_norm": 0.05303976684808731,
1919
+ "learning_rate": 2.9229249349905684e-05,
1920
+ "loss": 0.1285,
1921
+ "step": 6400
1922
+ },
1923
+ {
1924
+ "epoch": 2.967771852538836,
1925
+ "eval_loss": 0.1601465398516622,
1926
+ "eval_runtime": 60.6472,
1927
+ "eval_samples_per_second": 685.703,
1928
+ "eval_steps_per_second": 0.676,
1929
+ "step": 6400
1930
+ },
1931
+ {
1932
+ "epoch": 2.990957570136796,
1933
+ "grad_norm": 0.0519745759665966,
1934
+ "learning_rate": 2.851025439554142e-05,
1935
+ "loss": 0.1286,
1936
+ "step": 6450
1937
+ },
1938
+ {
1939
+ "epoch": 2.990957570136796,
1940
+ "eval_loss": 0.16085429229133483,
1941
+ "eval_runtime": 60.2507,
1942
+ "eval_samples_per_second": 690.216,
1943
+ "eval_steps_per_second": 0.68,
1944
+ "step": 6450
1945
+ },
1946
+ {
1947
+ "epoch": 3.0141432877347554,
1948
+ "grad_norm": 0.050518251955509186,
1949
+ "learning_rate": 2.7796669369711294e-05,
1950
+ "loss": 0.1301,
1951
+ "step": 6500
1952
+ },
1953
+ {
1954
+ "epoch": 3.0141432877347554,
1955
+ "eval_loss": 0.16015394660421692,
1956
+ "eval_runtime": 60.5015,
1957
+ "eval_samples_per_second": 687.355,
1958
+ "eval_steps_per_second": 0.678,
1959
+ "step": 6500
1960
+ },
1961
+ {
1962
+ "epoch": 3.037329005332715,
1963
+ "grad_norm": 0.04253960773348808,
1964
+ "learning_rate": 2.708867391362948e-05,
1965
+ "loss": 0.1296,
1966
+ "step": 6550
1967
+ },
1968
+ {
1969
+ "epoch": 3.037329005332715,
1970
+ "eval_loss": 0.1597283595131218,
1971
+ "eval_runtime": 60.13,
1972
+ "eval_samples_per_second": 691.601,
1973
+ "eval_steps_per_second": 0.682,
1974
+ "step": 6550
1975
+ },
1976
+ {
1977
+ "epoch": 3.060514722930675,
1978
+ "grad_norm": 0.06899340450763702,
1979
+ "learning_rate": 2.638644626136587e-05,
1980
+ "loss": 0.1291,
1981
+ "step": 6600
1982
+ },
1983
+ {
1984
+ "epoch": 3.060514722930675,
1985
+ "eval_loss": 0.1604277250117246,
1986
+ "eval_runtime": 60.4618,
1987
+ "eval_samples_per_second": 687.806,
1988
+ "eval_steps_per_second": 0.678,
1989
+ "step": 6600
1990
+ },
1991
+ {
1992
+ "epoch": 3.0837004405286343,
1993
+ "grad_norm": 0.06556117534637451,
1994
+ "learning_rate": 2.5690163194976575e-05,
1995
+ "loss": 0.1288,
1996
+ "step": 6650
1997
+ },
1998
+ {
1999
+ "epoch": 3.0837004405286343,
2000
+ "eval_loss": 0.15953636330193482,
2001
+ "eval_runtime": 60.2757,
2002
+ "eval_samples_per_second": 689.93,
2003
+ "eval_steps_per_second": 0.68,
2004
+ "step": 6650
2005
+ },
2006
+ {
2007
+ "epoch": 3.106886158126594,
2008
+ "grad_norm": 0.03685734421014786,
2009
+ "learning_rate": 2.500000000000001e-05,
2010
+ "loss": 0.129,
2011
+ "step": 6700
2012
+ },
2013
+ {
2014
+ "epoch": 3.106886158126594,
2015
+ "eval_loss": 0.159308270335797,
2016
+ "eval_runtime": 60.624,
2017
+ "eval_samples_per_second": 685.966,
2018
+ "eval_steps_per_second": 0.676,
2019
+ "step": 6700
2020
+ },
2021
+ {
2022
+ "epoch": 3.130071875724554,
2023
+ "grad_norm": 0.0451020672917366,
2024
+ "learning_rate": 2.4316130421329697e-05,
2025
+ "loss": 0.1286,
2026
+ "step": 6750
2027
+ },
2028
+ {
2029
+ "epoch": 3.130071875724554,
2030
+ "eval_loss": 0.15995884031774596,
2031
+ "eval_runtime": 60.3654,
2032
+ "eval_samples_per_second": 688.905,
2033
+ "eval_steps_per_second": 0.679,
2034
+ "step": 6750
2035
+ },
2036
+ {
2037
+ "epoch": 3.1532575933225133,
2038
+ "grad_norm": 0.0495733842253685,
2039
+ "learning_rate": 2.363872661947488e-05,
2040
+ "loss": 0.1293,
2041
+ "step": 6800
2042
+ },
2043
+ {
2044
+ "epoch": 3.1532575933225133,
2045
+ "eval_loss": 0.15987331824692497,
2046
+ "eval_runtime": 60.4636,
2047
+ "eval_samples_per_second": 687.786,
2048
+ "eval_steps_per_second": 0.678,
2049
+ "step": 6800
2050
+ },
2051
+ {
2052
+ "epoch": 3.176443310920473,
2053
+ "grad_norm": 0.05756652355194092,
2054
+ "learning_rate": 2.296795912722014e-05,
2055
+ "loss": 0.1289,
2056
+ "step": 6850
2057
+ },
2058
+ {
2059
+ "epoch": 3.176443310920473,
2060
+ "eval_loss": 0.15986134614331013,
2061
+ "eval_runtime": 61.0063,
2062
+ "eval_samples_per_second": 681.667,
2063
+ "eval_steps_per_second": 0.672,
2064
+ "step": 6850
2065
+ },
2066
+ {
2067
+ "epoch": 3.199629028518433,
2068
+ "grad_norm": 0.0467820018529892,
2069
+ "learning_rate": 2.2303996806694488e-05,
2070
+ "loss": 0.1295,
2071
+ "step": 6900
2072
+ },
2073
+ {
2074
+ "epoch": 3.199629028518433,
2075
+ "eval_loss": 0.16011030076900337,
2076
+ "eval_runtime": 60.1041,
2077
+ "eval_samples_per_second": 691.9,
2078
+ "eval_steps_per_second": 0.682,
2079
+ "step": 6900
2080
+ },
2081
+ {
2082
+ "epoch": 3.2228147461163923,
2083
+ "grad_norm": 0.04179982468485832,
2084
+ "learning_rate": 2.164700680686147e-05,
2085
+ "loss": 0.1287,
2086
+ "step": 6950
2087
+ },
2088
+ {
2089
+ "epoch": 3.2228147461163923,
2090
+ "eval_loss": 0.15917751068552838,
2091
+ "eval_runtime": 60.5321,
2092
+ "eval_samples_per_second": 687.007,
2093
+ "eval_steps_per_second": 0.677,
2094
+ "step": 6950
2095
+ },
2096
+ {
2097
+ "epoch": 3.246000463714352,
2098
+ "grad_norm": 0.053910572081804276,
2099
+ "learning_rate": 2.09971545214401e-05,
2100
+ "loss": 0.1286,
2101
+ "step": 7000
2102
+ },
2103
+ {
2104
+ "epoch": 3.246000463714352,
2105
+ "eval_loss": 0.15998092838627764,
2106
+ "eval_runtime": 60.4067,
2107
+ "eval_samples_per_second": 688.434,
2108
+ "eval_steps_per_second": 0.679,
2109
+ "step": 7000
2110
+ },
2111
+ {
2112
+ "epoch": 3.2691861813123118,
2113
+ "grad_norm": 0.04404950886964798,
2114
+ "learning_rate": 2.0354603547267985e-05,
2115
+ "loss": 0.1283,
2116
+ "step": 7050
2117
+ },
2118
+ {
2119
+ "epoch": 3.2691861813123118,
2120
+ "eval_loss": 0.1597617331551387,
2121
+ "eval_runtime": 60.4218,
2122
+ "eval_samples_per_second": 688.262,
2123
+ "eval_steps_per_second": 0.679,
2124
+ "step": 7050
2125
+ },
2126
+ {
2127
+ "epoch": 3.2923718989102713,
2128
+ "grad_norm": 0.04763752967119217,
2129
+ "learning_rate": 1.9719515643116674e-05,
2130
+ "loss": 0.1288,
2131
+ "step": 7100
2132
+ },
2133
+ {
2134
+ "epoch": 3.2923718989102713,
2135
+ "eval_loss": 0.16116006530852447,
2136
+ "eval_runtime": 60.2132,
2137
+ "eval_samples_per_second": 690.646,
2138
+ "eval_steps_per_second": 0.681,
2139
+ "step": 7100
2140
+ },
2141
+ {
2142
+ "epoch": 3.3155576165082308,
2143
+ "grad_norm": 0.049567196518182755,
2144
+ "learning_rate": 1.9092050688969738e-05,
2145
+ "loss": 0.1298,
2146
+ "step": 7150
2147
+ },
2148
+ {
2149
+ "epoch": 3.3155576165082308,
2150
+ "eval_loss": 0.15965543804361845,
2151
+ "eval_runtime": 60.3928,
2152
+ "eval_samples_per_second": 688.592,
2153
+ "eval_steps_per_second": 0.679,
2154
+ "step": 7150
2155
+ },
2156
+ {
2157
+ "epoch": 3.3387433341061907,
2158
+ "grad_norm": 0.05488676205277443,
2159
+ "learning_rate": 1.847236664577389e-05,
2160
+ "loss": 0.1284,
2161
+ "step": 7200
2162
+ },
2163
+ {
2164
+ "epoch": 3.3387433341061907,
2165
+ "eval_loss": 0.16050384662882064,
2166
+ "eval_runtime": 60.121,
2167
+ "eval_samples_per_second": 691.705,
2168
+ "eval_steps_per_second": 0.682,
2169
+ "step": 7200
2170
+ },
2171
+ {
2172
+ "epoch": 3.3619290517041502,
2173
+ "grad_norm": 0.04124298691749573,
2174
+ "learning_rate": 1.7860619515673033e-05,
2175
+ "loss": 0.1289,
2176
+ "step": 7250
2177
+ },
2178
+ {
2179
+ "epoch": 3.3619290517041502,
2180
+ "eval_loss": 0.16054145931691394,
2181
+ "eval_runtime": 60.2046,
2182
+ "eval_samples_per_second": 690.745,
2183
+ "eval_steps_per_second": 0.681,
2184
+ "step": 7250
2185
+ },
2186
+ {
2187
+ "epoch": 3.3851147693021097,
2188
+ "grad_norm": 0.04400424286723137,
2189
+ "learning_rate": 1.725696330273575e-05,
2190
+ "loss": 0.1289,
2191
+ "step": 7300
2192
+ },
2193
+ {
2194
+ "epoch": 3.3851147693021097,
2195
+ "eval_loss": 0.15999099129576416,
2196
+ "eval_runtime": 60.4869,
2197
+ "eval_samples_per_second": 687.52,
2198
+ "eval_steps_per_second": 0.678,
2199
+ "step": 7300
2200
+ },
2201
+ {
2202
+ "epoch": 3.4083004869000697,
2203
+ "grad_norm": 0.05488509312272072,
2204
+ "learning_rate": 1.6661549974185424e-05,
2205
+ "loss": 0.1285,
2206
+ "step": 7350
2207
+ },
2208
+ {
2209
+ "epoch": 3.4083004869000697,
2210
+ "eval_loss": 0.16051823730892306,
2211
+ "eval_runtime": 60.1981,
2212
+ "eval_samples_per_second": 690.819,
2213
+ "eval_steps_per_second": 0.681,
2214
+ "step": 7350
2215
+ },
2216
+ {
2217
+ "epoch": 3.431486204498029,
2218
+ "grad_norm": 0.06722457706928253,
2219
+ "learning_rate": 1.60745294221434e-05,
2220
+ "loss": 0.1286,
2221
+ "step": 7400
2222
+ },
2223
+ {
2224
+ "epoch": 3.431486204498029,
2225
+ "eval_loss": 0.1610307768591294,
2226
+ "eval_runtime": 60.7755,
2227
+ "eval_samples_per_second": 684.256,
2228
+ "eval_steps_per_second": 0.675,
2229
+ "step": 7400
2230
+ },
2231
+ {
2232
+ "epoch": 3.4546719220959887,
2233
+ "grad_norm": 0.04814394935965538,
2234
+ "learning_rate": 1.549604942589441e-05,
2235
+ "loss": 0.1278,
2236
+ "step": 7450
2237
+ },
2238
+ {
2239
+ "epoch": 3.4546719220959887,
2240
+ "eval_loss": 0.1598065741965525,
2241
+ "eval_runtime": 59.9968,
2242
+ "eval_samples_per_second": 693.136,
2243
+ "eval_steps_per_second": 0.683,
2244
+ "step": 7450
2245
+ },
2246
+ {
2247
+ "epoch": 3.4778576396939487,
2248
+ "grad_norm": 0.04934167116880417,
2249
+ "learning_rate": 1.4926255614683932e-05,
2250
+ "loss": 0.1274,
2251
+ "step": 7500
2252
+ },
2253
+ {
2254
+ "epoch": 3.4778576396939487,
2255
+ "eval_loss": 0.15982454893723042,
2256
+ "eval_runtime": 60.201,
2257
+ "eval_samples_per_second": 690.786,
2258
+ "eval_steps_per_second": 0.681,
2259
+ "step": 7500
2260
+ },
2261
+ {
2262
+ "epoch": 3.501043357291908,
2263
+ "grad_norm": 0.04529615864157677,
2264
+ "learning_rate": 1.4365291431056871e-05,
2265
+ "loss": 0.1297,
2266
+ "step": 7550
2267
+ },
2268
+ {
2269
+ "epoch": 3.501043357291908,
2270
+ "eval_loss": 0.15986133524024926,
2271
+ "eval_runtime": 59.95,
2272
+ "eval_samples_per_second": 693.678,
2273
+ "eval_steps_per_second": 0.684,
2274
+ "step": 7550
2275
+ },
2276
+ {
2277
+ "epoch": 3.5242290748898677,
2278
+ "grad_norm": 0.0399620421230793,
2279
+ "learning_rate": 1.3813298094746491e-05,
2280
+ "loss": 0.1288,
2281
+ "step": 7600
2282
+ },
2283
+ {
2284
+ "epoch": 3.5242290748898677,
2285
+ "eval_loss": 0.15905609221590689,
2286
+ "eval_runtime": 61.181,
2287
+ "eval_samples_per_second": 679.72,
2288
+ "eval_steps_per_second": 0.67,
2289
+ "step": 7600
2290
+ },
2291
+ {
2292
+ "epoch": 3.5474147924878277,
2293
+ "grad_norm": 0.05973295867443085,
2294
+ "learning_rate": 1.327041456712334e-05,
2295
+ "loss": 0.1281,
2296
+ "step": 7650
2297
+ },
2298
+ {
2299
+ "epoch": 3.5474147924878277,
2300
+ "eval_loss": 0.15981091550942805,
2301
+ "eval_runtime": 60.5605,
2302
+ "eval_samples_per_second": 686.685,
2303
+ "eval_steps_per_second": 0.677,
2304
+ "step": 7650
2305
+ },
2306
+ {
2307
+ "epoch": 3.570600510085787,
2308
+ "grad_norm": 0.04896661266684532,
2309
+ "learning_rate": 1.2736777516212266e-05,
2310
+ "loss": 0.1288,
2311
+ "step": 7700
2312
+ },
2313
+ {
2314
+ "epoch": 3.570600510085787,
2315
+ "eval_loss": 0.1599924400443614,
2316
+ "eval_runtime": 60.486,
2317
+ "eval_samples_per_second": 687.531,
2318
+ "eval_steps_per_second": 0.678,
2319
+ "step": 7700
2320
+ },
2321
+ {
2322
+ "epoch": 3.5937862276837467,
2323
+ "grad_norm": 0.07458525151014328,
2324
+ "learning_rate": 1.2212521282287092e-05,
2325
+ "loss": 0.128,
2326
+ "step": 7750
2327
+ },
2328
+ {
2329
+ "epoch": 3.5937862276837467,
2330
+ "eval_loss": 0.15936126278835275,
2331
+ "eval_runtime": 60.9341,
2332
+ "eval_samples_per_second": 682.475,
2333
+ "eval_steps_per_second": 0.673,
2334
+ "step": 7750
2335
+ },
2336
+ {
2337
+ "epoch": 3.6169719452817066,
2338
+ "grad_norm": 0.04200127348303795,
2339
+ "learning_rate": 1.1697777844051105e-05,
2340
+ "loss": 0.1287,
2341
+ "step": 7800
2342
+ },
2343
+ {
2344
+ "epoch": 3.6169719452817066,
2345
+ "eval_loss": 0.1603394617678833,
2346
+ "eval_runtime": 60.5155,
2347
+ "eval_samples_per_second": 687.195,
2348
+ "eval_steps_per_second": 0.678,
2349
+ "step": 7800
2350
+ },
2351
+ {
2352
+ "epoch": 3.640157662879666,
2353
+ "grad_norm": 0.06712640821933746,
2354
+ "learning_rate": 1.1192676785412154e-05,
2355
+ "loss": 0.1291,
2356
+ "step": 7850
2357
+ },
2358
+ {
2359
+ "epoch": 3.640157662879666,
2360
+ "eval_loss": 0.15920067938345067,
2361
+ "eval_runtime": 60.0225,
2362
+ "eval_samples_per_second": 692.84,
2363
+ "eval_steps_per_second": 0.683,
2364
+ "step": 7850
2365
+ },
2366
+ {
2367
+ "epoch": 3.6633433804776256,
2368
+ "grad_norm": 0.049462996423244476,
2369
+ "learning_rate": 1.0697345262860636e-05,
2370
+ "loss": 0.1287,
2371
+ "step": 7900
2372
+ },
2373
+ {
2374
+ "epoch": 3.6633433804776256,
2375
+ "eval_loss": 0.15964593569874527,
2376
+ "eval_runtime": 60.1965,
2377
+ "eval_samples_per_second": 690.837,
2378
+ "eval_steps_per_second": 0.681,
2379
+ "step": 7900
2380
+ },
2381
+ {
2382
+ "epoch": 3.6865290980755856,
2383
+ "grad_norm": 0.05148932337760925,
2384
+ "learning_rate": 1.021190797345839e-05,
2385
+ "loss": 0.1283,
2386
+ "step": 7950
2387
+ },
2388
+ {
2389
+ "epoch": 3.6865290980755856,
2390
+ "eval_loss": 0.15903354419354673,
2391
+ "eval_runtime": 60.0507,
2392
+ "eval_samples_per_second": 692.515,
2393
+ "eval_steps_per_second": 0.683,
2394
+ "step": 7950
2395
+ },
2396
+ {
2397
+ "epoch": 3.709714815673545,
2398
+ "grad_norm": 0.05164024233818054,
2399
+ "learning_rate": 9.73648712344707e-06,
2400
+ "loss": 0.128,
2401
+ "step": 8000
2402
+ },
2403
+ {
2404
+ "epoch": 3.709714815673545,
2405
+ "eval_loss": 0.15835035051131605,
2406
+ "eval_runtime": 60.5739,
2407
+ "eval_samples_per_second": 686.533,
2408
+ "eval_steps_per_second": 0.677,
2409
+ "step": 8000
2410
+ },
2411
+ {
2412
+ "epoch": 3.7329005332715046,
2413
+ "grad_norm": 0.04926716163754463,
2414
+ "learning_rate": 9.271202397483215e-06,
2415
+ "loss": 0.1276,
2416
+ "step": 8050
2417
+ },
2418
+ {
2419
+ "epoch": 3.7329005332715046,
2420
+ "eval_loss": 0.160225615529793,
2421
+ "eval_runtime": 60.4555,
2422
+ "eval_samples_per_second": 687.878,
2423
+ "eval_steps_per_second": 0.678,
2424
+ "step": 8050
2425
+ },
2426
+ {
2427
+ "epoch": 3.7560862508694646,
2428
+ "grad_norm": 0.04355842247605324,
2429
+ "learning_rate": 8.816170928508365e-06,
2430
+ "loss": 0.1287,
2431
+ "step": 8100
2432
+ },
2433
+ {
2434
+ "epoch": 3.7560862508694646,
2435
+ "eval_loss": 0.1601867779420742,
2436
+ "eval_runtime": 60.7386,
2437
+ "eval_samples_per_second": 684.672,
2438
+ "eval_steps_per_second": 0.675,
2439
+ "step": 8100
2440
+ },
2441
+ {
2442
+ "epoch": 3.779271968467424,
2443
+ "grad_norm": 0.039105553179979324,
2444
+ "learning_rate": 8.371507268261437e-06,
2445
+ "loss": 0.1306,
2446
+ "step": 8150
2447
+ },
2448
+ {
2449
+ "epoch": 3.779271968467424,
2450
+ "eval_loss": 0.15946348937187382,
2451
+ "eval_runtime": 60.9253,
2452
+ "eval_samples_per_second": 682.574,
2453
+ "eval_steps_per_second": 0.673,
2454
+ "step": 8150
2455
+ },
2456
+ {
2457
+ "epoch": 3.8024576860653836,
2458
+ "grad_norm": 0.04452899843454361,
2459
+ "learning_rate": 7.937323358440935e-06,
2460
+ "loss": 0.1286,
2461
+ "step": 8200
2462
+ },
2463
+ {
2464
+ "epoch": 3.8024576860653836,
2465
+ "eval_loss": 0.15871429728364056,
2466
+ "eval_runtime": 60.2776,
2467
+ "eval_samples_per_second": 689.908,
2468
+ "eval_steps_per_second": 0.68,
2469
+ "step": 8200
2470
+ },
2471
+ {
2472
+ "epoch": 3.8256434036633435,
2473
+ "grad_norm": 0.043075498193502426,
2474
+ "learning_rate": 7.513728502524286e-06,
2475
+ "loss": 0.1292,
2476
+ "step": 8250
2477
+ },
2478
+ {
2479
+ "epoch": 3.8256434036633435,
2480
+ "eval_loss": 0.1592580359542711,
2481
+ "eval_runtime": 60.7244,
2482
+ "eval_samples_per_second": 684.832,
2483
+ "eval_steps_per_second": 0.675,
2484
+ "step": 8250
2485
+ },
2486
+ {
2487
+ "epoch": 3.848829121261303,
2488
+ "grad_norm": 0.05848800390958786,
2489
+ "learning_rate": 7.100829338251147e-06,
2490
+ "loss": 0.1275,
2491
+ "step": 8300
2492
+ },
2493
+ {
2494
+ "epoch": 3.848829121261303,
2495
+ "eval_loss": 0.15895083163665807,
2496
+ "eval_runtime": 60.3677,
2497
+ "eval_samples_per_second": 688.878,
2498
+ "eval_steps_per_second": 0.679,
2499
+ "step": 8300
2500
+ },
2501
+ {
2502
+ "epoch": 3.8720148388592626,
2503
+ "grad_norm": 0.04980336129665375,
2504
+ "learning_rate": 6.698729810778065e-06,
2505
+ "loss": 0.1277,
2506
+ "step": 8350
2507
+ },
2508
+ {
2509
+ "epoch": 3.8720148388592626,
2510
+ "eval_loss": 0.16002303550437,
2511
+ "eval_runtime": 60.2742,
2512
+ "eval_samples_per_second": 689.947,
2513
+ "eval_steps_per_second": 0.68,
2514
+ "step": 8350
2515
+ },
2516
+ {
2517
+ "epoch": 3.8952005564572225,
2518
+ "grad_norm": 0.057385146617889404,
2519
+ "learning_rate": 6.3075311465107535e-06,
2520
+ "loss": 0.129,
2521
+ "step": 8400
2522
+ },
2523
+ {
2524
+ "epoch": 3.8952005564572225,
2525
+ "eval_loss": 0.1601535826416112,
2526
+ "eval_runtime": 60.4053,
2527
+ "eval_samples_per_second": 688.45,
2528
+ "eval_steps_per_second": 0.679,
2529
+ "step": 8400
2530
+ },
2531
+ {
2532
+ "epoch": 3.918386274055182,
2533
+ "grad_norm": 0.045788682997226715,
2534
+ "learning_rate": 5.927331827620903e-06,
2535
+ "loss": 0.1286,
2536
+ "step": 8450
2537
+ },
2538
+ {
2539
+ "epoch": 3.918386274055182,
2540
+ "eval_loss": 0.15926720973175468,
2541
+ "eval_runtime": 60.6783,
2542
+ "eval_samples_per_second": 685.352,
2543
+ "eval_steps_per_second": 0.676,
2544
+ "step": 8450
2545
+ },
2546
+ {
2547
+ "epoch": 3.9415719916531415,
2548
+ "grad_norm": 0.045575451105833054,
2549
+ "learning_rate": 5.558227567253832e-06,
2550
+ "loss": 0.1281,
2551
+ "step": 8500
2552
+ },
2553
+ {
2554
+ "epoch": 3.9415719916531415,
2555
+ "eval_loss": 0.16032033338606583,
2556
+ "eval_runtime": 60.4563,
2557
+ "eval_samples_per_second": 687.868,
2558
+ "eval_steps_per_second": 0.678,
2559
+ "step": 8500
2560
+ },
2561
+ {
2562
+ "epoch": 3.964757709251101,
2563
+ "grad_norm": 0.034972067922353745,
2564
+ "learning_rate": 5.200311285433213e-06,
2565
+ "loss": 0.1285,
2566
+ "step": 8550
2567
+ },
2568
+ {
2569
+ "epoch": 3.964757709251101,
2570
+ "eval_loss": 0.1590997571686103,
2571
+ "eval_runtime": 60.7642,
2572
+ "eval_samples_per_second": 684.384,
2573
+ "eval_steps_per_second": 0.675,
2574
+ "step": 8550
2575
+ },
2576
+ {
2577
+ "epoch": 3.987943426849061,
2578
+ "grad_norm": 0.05060684680938721,
2579
+ "learning_rate": 4.853673085668947e-06,
2580
+ "loss": 0.1293,
2581
+ "step": 8600
2582
+ },
2583
+ {
2584
+ "epoch": 3.987943426849061,
2585
+ "eval_loss": 0.15924322809570868,
2586
+ "eval_runtime": 60.0799,
2587
+ "eval_samples_per_second": 692.178,
2588
+ "eval_steps_per_second": 0.682,
2589
+ "step": 8600
2590
+ },
2591
+ {
2592
+ "epoch": 4.011129144447021,
2593
+ "grad_norm": 0.04898017644882202,
2594
+ "learning_rate": 4.5184002322740785e-06,
2595
+ "loss": 0.1283,
2596
+ "step": 8650
2597
+ },
2598
+ {
2599
+ "epoch": 4.011129144447021,
2600
+ "eval_loss": 0.1587491140112498,
2601
+ "eval_runtime": 60.6393,
2602
+ "eval_samples_per_second": 685.793,
2603
+ "eval_steps_per_second": 0.676,
2604
+ "step": 8650
2605
+ },
2606
+ {
2607
+ "epoch": 4.0343148620449805,
2608
+ "grad_norm": 0.058361586183309555,
2609
+ "learning_rate": 4.19457712839652e-06,
2610
+ "loss": 0.1277,
2611
+ "step": 8700
2612
+ },
2613
+ {
2614
+ "epoch": 4.0343148620449805,
2615
+ "eval_loss": 0.1597737118597627,
2616
+ "eval_runtime": 61.5486,
2617
+ "eval_samples_per_second": 675.661,
2618
+ "eval_steps_per_second": 0.666,
2619
+ "step": 8700
2620
+ },
2621
+ {
2622
+ "epoch": 4.05750057964294,
2623
+ "grad_norm": 0.05138258635997772,
2624
+ "learning_rate": 3.8822852947709375e-06,
2625
+ "loss": 0.1283,
2626
+ "step": 8750
2627
+ },
2628
+ {
2629
+ "epoch": 4.05750057964294,
2630
+ "eval_loss": 0.15985116115580386,
2631
+ "eval_runtime": 60.5634,
2632
+ "eval_samples_per_second": 686.652,
2633
+ "eval_steps_per_second": 0.677,
2634
+ "step": 8750
2635
+ },
2636
+ {
2637
+ "epoch": 4.0806862972408995,
2638
+ "grad_norm": 0.0461881086230278,
2639
+ "learning_rate": 3.581603349196372e-06,
2640
+ "loss": 0.1288,
2641
+ "step": 8800
2642
+ },
2643
+ {
2644
+ "epoch": 4.0806862972408995,
2645
+ "eval_loss": 0.15788726458515429,
2646
+ "eval_runtime": 60.6057,
2647
+ "eval_samples_per_second": 686.173,
2648
+ "eval_steps_per_second": 0.677,
2649
+ "step": 8800
2650
+ },
2651
+ {
2652
+ "epoch": 4.103872014838859,
2653
+ "grad_norm": 0.0618111789226532,
2654
+ "learning_rate": 3.2926069867446675e-06,
2655
+ "loss": 0.1287,
2656
+ "step": 8850
2657
+ },
2658
+ {
2659
+ "epoch": 4.103872014838859,
2660
+ "eval_loss": 0.15881183094974458,
2661
+ "eval_runtime": 60.3747,
2662
+ "eval_samples_per_second": 688.799,
2663
+ "eval_steps_per_second": 0.679,
2664
+ "step": 8850
2665
+ },
2666
+ {
2667
+ "epoch": 4.1270577324368185,
2668
+ "grad_norm": 0.04804789274930954,
2669
+ "learning_rate": 3.0153689607045845e-06,
2670
+ "loss": 0.1294,
2671
+ "step": 8900
2672
+ },
2673
+ {
2674
+ "epoch": 4.1270577324368185,
2675
+ "eval_loss": 0.1607356553004913,
2676
+ "eval_runtime": 60.6979,
2677
+ "eval_samples_per_second": 685.131,
2678
+ "eval_steps_per_second": 0.675,
2679
+ "step": 8900
2680
+ },
2681
+ {
2682
+ "epoch": 4.150243450034779,
2683
+ "grad_norm": 0.04835003986954689,
2684
+ "learning_rate": 2.7499590642665774e-06,
2685
+ "loss": 0.1277,
2686
+ "step": 8950
2687
+ },
2688
+ {
2689
+ "epoch": 4.150243450034779,
2690
+ "eval_loss": 0.1598761189516689,
2691
+ "eval_runtime": 61.2003,
2692
+ "eval_samples_per_second": 679.507,
2693
+ "eval_steps_per_second": 0.67,
2694
+ "step": 8950
2695
+ },
2696
+ {
2697
+ "epoch": 4.173429167632738,
2698
+ "grad_norm": 0.05750919133424759,
2699
+ "learning_rate": 2.496444112952734e-06,
2700
+ "loss": 0.1285,
2701
+ "step": 9000
2702
+ },
2703
+ {
2704
+ "epoch": 4.173429167632738,
2705
+ "eval_loss": 0.15946166188972705,
2706
+ "eval_runtime": 60.6795,
2707
+ "eval_samples_per_second": 685.339,
2708
+ "eval_steps_per_second": 0.676,
2709
+ "step": 9000
2710
+ },
2711
+ {
2712
+ "epoch": 4.196614885230698,
2713
+ "grad_norm": 0.06801807135343552,
2714
+ "learning_rate": 2.2548879277963064e-06,
2715
+ "loss": 0.1289,
2716
+ "step": 9050
2717
+ },
2718
+ {
2719
+ "epoch": 4.196614885230698,
2720
+ "eval_loss": 0.1609577221237089,
2721
+ "eval_runtime": 61.0186,
2722
+ "eval_samples_per_second": 681.53,
2723
+ "eval_steps_per_second": 0.672,
2724
+ "step": 9050
2725
+ },
2726
+ {
2727
+ "epoch": 4.219800602828657,
2728
+ "grad_norm": 0.04383298382163048,
2729
+ "learning_rate": 2.0253513192751373e-06,
2730
+ "loss": 0.1289,
2731
+ "step": 9100
2732
+ },
2733
+ {
2734
+ "epoch": 4.219800602828657,
2735
+ "eval_loss": 0.1598739506376352,
2736
+ "eval_runtime": 60.6256,
2737
+ "eval_samples_per_second": 685.948,
2738
+ "eval_steps_per_second": 0.676,
2739
+ "step": 9100
2740
+ },
2741
+ {
2742
+ "epoch": 4.242986320426617,
2743
+ "grad_norm": 0.044339120388031006,
2744
+ "learning_rate": 1.807892072002898e-06,
2745
+ "loss": 0.1283,
2746
+ "step": 9150
2747
+ },
2748
+ {
2749
+ "epoch": 4.242986320426617,
2750
+ "eval_loss": 0.158920794598519,
2751
+ "eval_runtime": 60.5454,
2752
+ "eval_samples_per_second": 686.856,
2753
+ "eval_steps_per_second": 0.677,
2754
+ "step": 9150
2755
+ },
2756
+ {
2757
+ "epoch": 4.2661720380245765,
2758
+ "grad_norm": 0.04090524837374687,
2759
+ "learning_rate": 1.6025649301821876e-06,
2760
+ "loss": 0.1282,
2761
+ "step": 9200
2762
+ },
2763
+ {
2764
+ "epoch": 4.2661720380245765,
2765
+ "eval_loss": 0.1596859048948022,
2766
+ "eval_runtime": 60.7716,
2767
+ "eval_samples_per_second": 684.3,
2768
+ "eval_steps_per_second": 0.675,
2769
+ "step": 9200
2770
+ },
2771
+ {
2772
+ "epoch": 4.289357755622537,
2773
+ "grad_norm": 0.042642634361982346,
2774
+ "learning_rate": 1.4094215838229176e-06,
2775
+ "loss": 0.1286,
2776
+ "step": 9250
2777
+ },
2778
+ {
2779
+ "epoch": 4.289357755622537,
2780
+ "eval_loss": 0.16079005979316527,
2781
+ "eval_runtime": 60.6239,
2782
+ "eval_samples_per_second": 685.967,
2783
+ "eval_steps_per_second": 0.676,
2784
+ "step": 9250
2785
+ },
2786
+ {
2787
+ "epoch": 4.312543473220496,
2788
+ "grad_norm": 0.04924129322171211,
2789
+ "learning_rate": 1.2285106557296477e-06,
2790
+ "loss": 0.1287,
2791
+ "step": 9300
2792
+ },
2793
+ {
2794
+ "epoch": 4.312543473220496,
2795
+ "eval_loss": 0.16084020796323667,
2796
+ "eval_runtime": 60.2581,
2797
+ "eval_samples_per_second": 690.131,
2798
+ "eval_steps_per_second": 0.68,
2799
+ "step": 9300
2800
+ },
2801
+ {
2802
+ "epoch": 4.335729190818456,
2803
+ "grad_norm": 0.04222133755683899,
2804
+ "learning_rate": 1.0598776892610685e-06,
2805
+ "loss": 0.1287,
2806
+ "step": 9350
2807
+ },
2808
+ {
2809
+ "epoch": 4.335729190818456,
2810
+ "eval_loss": 0.16020395921618655,
2811
+ "eval_runtime": 60.5816,
2812
+ "eval_samples_per_second": 686.446,
2813
+ "eval_steps_per_second": 0.677,
2814
+ "step": 9350
2815
+ },
2816
+ {
2817
+ "epoch": 4.358914908416415,
2818
+ "grad_norm": 0.05593874678015709,
2819
+ "learning_rate": 9.035651368646648e-07,
2820
+ "loss": 0.1286,
2821
+ "step": 9400
2822
+ },
2823
+ {
2824
+ "epoch": 4.358914908416415,
2825
+ "eval_loss": 0.15957607987630548,
2826
+ "eval_runtime": 60.8726,
2827
+ "eval_samples_per_second": 683.164,
2828
+ "eval_steps_per_second": 0.674,
2829
+ "step": 9400
2830
+ },
2831
+ {
2832
+ "epoch": 4.382100626014375,
2833
+ "grad_norm": 0.059049129486083984,
2834
+ "learning_rate": 7.596123493895991e-07,
2835
+ "loss": 0.1289,
2836
+ "step": 9450
2837
+ },
2838
+ {
2839
+ "epoch": 4.382100626014375,
2840
+ "eval_loss": 0.15975197211451994,
2841
+ "eval_runtime": 60.6704,
2842
+ "eval_samples_per_second": 685.441,
2843
+ "eval_steps_per_second": 0.676,
2844
+ "step": 9450
2845
+ },
2846
+ {
2847
+ "epoch": 4.405286343612334,
2848
+ "grad_norm": 0.053555767983198166,
2849
+ "learning_rate": 6.280555661802856e-07,
2850
+ "loss": 0.1286,
2851
+ "step": 9500
2852
+ },
2853
+ {
2854
+ "epoch": 4.405286343612334,
2855
+ "eval_loss": 0.16117730336557945,
2856
+ "eval_runtime": 61.9214,
2857
+ "eval_samples_per_second": 671.593,
2858
+ "eval_steps_per_second": 0.662,
2859
+ "step": 9500
2860
+ },
2861
+ {
2862
+ "epoch": 4.428472061210295,
2863
+ "grad_norm": 0.04488294571638107,
2864
+ "learning_rate": 5.089279059533658e-07,
2865
+ "loss": 0.1281,
2866
+ "step": 9550
2867
+ },
2868
+ {
2869
+ "epoch": 4.428472061210295,
2870
+ "eval_loss": 0.15896389365558133,
2871
+ "eval_runtime": 62.0889,
2872
+ "eval_samples_per_second": 669.782,
2873
+ "eval_steps_per_second": 0.66,
2874
+ "step": 9550
2875
+ },
2876
+ {
2877
+ "epoch": 4.451657778808254,
2878
+ "grad_norm": 0.044143371284008026,
2879
+ "learning_rate": 4.02259358460233e-07,
2880
+ "loss": 0.1276,
2881
+ "step": 9600
2882
+ },
2883
+ {
2884
+ "epoch": 4.451657778808254,
2885
+ "eval_loss": 0.15880485748262804,
2886
+ "eval_runtime": 61.9007,
2887
+ "eval_samples_per_second": 671.818,
2888
+ "eval_steps_per_second": 0.662,
2889
+ "step": 9600
2890
+ },
2891
+ {
2892
+ "epoch": 4.474843496406214,
2893
+ "grad_norm": 0.054890409111976624,
2894
+ "learning_rate": 3.080767769372939e-07,
2895
+ "loss": 0.1289,
2896
+ "step": 9650
2897
+ },
2898
+ {
2899
+ "epoch": 4.474843496406214,
2900
+ "eval_loss": 0.15899979579394047,
2901
+ "eval_runtime": 61.7264,
2902
+ "eval_samples_per_second": 673.714,
2903
+ "eval_steps_per_second": 0.664,
2904
+ "step": 9650
2905
+ },
2906
+ {
2907
+ "epoch": 4.498029214004173,
2908
+ "grad_norm": 0.04276006668806076,
2909
+ "learning_rate": 2.2640387134577058e-07,
2910
+ "loss": 0.1284,
2911
+ "step": 9700
2912
+ },
2913
+ {
2914
+ "epoch": 4.498029214004173,
2915
+ "eval_loss": 0.1587265635928511,
2916
+ "eval_runtime": 61.254,
2917
+ "eval_samples_per_second": 678.911,
2918
+ "eval_steps_per_second": 0.669,
2919
+ "step": 9700
2920
+ },
2921
+ {
2922
+ "epoch": 4.521214931602133,
2923
+ "grad_norm": 0.04374442994594574,
2924
+ "learning_rate": 1.5726120240288634e-07,
2925
+ "loss": 0.1284,
2926
+ "step": 9750
2927
+ },
2928
+ {
2929
+ "epoch": 4.521214931602133,
2930
+ "eval_loss": 0.1596641951113874,
2931
+ "eval_runtime": 61.5999,
2932
+ "eval_samples_per_second": 675.099,
2933
+ "eval_steps_per_second": 0.666,
2934
+ "step": 9750
2935
+ },
2936
+ {
2937
+ "epoch": 4.544400649200092,
2938
+ "grad_norm": 0.039518803358078,
2939
+ "learning_rate": 1.0066617640578368e-07,
2940
+ "loss": 0.1297,
2941
+ "step": 9800
2942
+ },
2943
+ {
2944
+ "epoch": 4.544400649200092,
2945
+ "eval_loss": 0.15941302591091938,
2946
+ "eval_runtime": 61.7763,
2947
+ "eval_samples_per_second": 673.17,
2948
+ "eval_steps_per_second": 0.664,
2949
+ "step": 9800
2950
+ },
2951
+ {
2952
+ "epoch": 4.567586366798053,
2953
+ "grad_norm": 0.037454187870025635,
2954
+ "learning_rate": 5.663304084960186e-08,
2955
+ "loss": 0.1276,
2956
+ "step": 9850
2957
+ },
2958
+ {
2959
+ "epoch": 4.567586366798053,
2960
+ "eval_loss": 0.15932704533345807,
2961
+ "eval_runtime": 61.0673,
2962
+ "eval_samples_per_second": 680.987,
2963
+ "eval_steps_per_second": 0.671,
2964
+ "step": 9850
2965
+ },
2966
+ {
2967
+ "epoch": 4.590772084396012,
2968
+ "grad_norm": 0.05642937496304512,
2969
+ "learning_rate": 2.5172880840745873e-08,
2970
+ "loss": 0.129,
2971
+ "step": 9900
2972
+ },
2973
+ {
2974
+ "epoch": 4.590772084396012,
2975
+ "eval_loss": 0.15923822751292327,
2976
+ "eval_runtime": 61.7949,
2977
+ "eval_samples_per_second": 672.968,
2978
+ "eval_steps_per_second": 0.663,
2979
+ "step": 9900
2980
+ },
2981
+ {
2982
+ "epoch": 4.613957801993972,
2983
+ "grad_norm": 0.03662274032831192,
2984
+ "learning_rate": 6.293616306246586e-09,
2985
+ "loss": 0.1285,
2986
+ "step": 9950
2987
+ },
2988
+ {
2989
+ "epoch": 4.613957801993972,
2990
+ "eval_loss": 0.160277338388973,
2991
+ "eval_runtime": 62.2438,
2992
+ "eval_samples_per_second": 668.115,
2993
+ "eval_steps_per_second": 0.659,
2994
+ "step": 9950
2995
+ },
2996
+ {
2997
+ "epoch": 4.637143519591931,
2998
+ "grad_norm": 0.0563049279153347,
2999
+ "learning_rate": 0.0,
3000
+ "loss": 0.1282,
3001
+ "step": 10000
3002
+ },
3003
+ {
3004
+ "epoch": 4.637143519591931,
3005
+ "eval_loss": 0.16006394581914293,
3006
+ "eval_runtime": 61.4917,
3007
+ "eval_samples_per_second": 676.286,
3008
+ "eval_steps_per_second": 0.667,
3009
+ "step": 10000
3010
+ },
3011
+ {
3012
+ "epoch": 4.637143519591931,
3013
+ "step": 10000,
3014
+ "total_flos": 2.3231400526217216e+17,
3015
+ "train_loss": 0.13326009378433226,
3016
+ "train_runtime": 41368.3669,
3017
+ "train_samples_per_second": 495.064,
3018
+ "train_steps_per_second": 0.242
3019
+ }
3020
+ ],
3021
+ "logging_steps": 50,
3022
+ "max_steps": 10000,
3023
+ "num_input_tokens_seen": 0,
3024
+ "num_train_epochs": 5,
3025
+ "save_steps": 50,
3026
+ "total_flos": 2.3231400526217216e+17,
3027
+ "train_batch_size": 1024,
3028
+ "trial_name": null,
3029
+ "trial_params": null
3030
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3df77597c69632ab7d5a5d7987fd817d663727a57d77ca069bc92a72238da772
3
+ size 5393