LH-Tech-AI commited on
Commit
de01003
Β·
verified Β·
1 Parent(s): 107cbdc

Create logs.log

Browse files
Files changed (1) hide show
  1. logs.log +528 -0
logs.log ADDED
@@ -0,0 +1,528 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [*] Loading libraries...
2
+ [*] Loading tokenizer...
3
+ [*] Gathering 100 million tokens by streaming dataset...
4
+ Resolving data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2410/2410 [00:00<00:00, 30853.46it/s]
5
+ [*] Gathering tokens: 100%|β–ˆβ–ˆ| 400000000/400000000 [13:58<00:00, 477048.96tok/s]
6
+ [+] Collected 400,000,000 tokens β†’ 1,562,500 chunks.
7
+ [*] Setting up model...
8
+ [*] Model parameters: 465,504
9
+ [*] Defining training arguments...
10
+ [*] Starting training...
11
+ {'loss': '5.986', 'grad_norm': '0.5017', 'learning_rate': '9.9e-05', 'epoch': '0.008192'}
12
+ {'loss': '5.403', 'grad_norm': '0.394', 'learning_rate': '0.000199', 'epoch': '0.01638'}
13
+ {'loss': '4.75', 'grad_norm': '0.9517', 'learning_rate': '0.000299', 'epoch': '0.02458'}
14
+ {'loss': '4.192', 'grad_norm': '1.073', 'learning_rate': '0.000399', 'epoch': '0.03277'}
15
+ {'loss': '3.702', 'grad_norm': '1.364', 'learning_rate': '0.000499', 'epoch': '0.04096'}
16
+ 1%|β–Œ | 500/36624 [00:34<40:21, 14.92it/s]
17
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 126.22it/s]
18
+ {'loss': '3.378', 'grad_norm': '1.906', 'learning_rate': '0.0004986', 'epoch': '0.04915'}
19
+ {'loss': '3.195', 'grad_norm': '1.332', 'learning_rate': '0.0004972', 'epoch': '0.05734'}
20
+ {'loss': '3.085', 'grad_norm': '1.36', 'learning_rate': '0.0004959', 'epoch': '0.06553'}
21
+ {'loss': '3.011', 'grad_norm': '1.354', 'learning_rate': '0.0004945', 'epoch': '0.07373'}
22
+ {'loss': '2.955', 'grad_norm': '1.423', 'learning_rate': '0.0004931', 'epoch': '0.08192'}
23
+ 3%|β–ˆ | 1000/36624 [01:08<40:59, 14.48it/s]
24
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 185.00it/s]
25
+ {'loss': '2.914', 'grad_norm': '1.194', 'learning_rate': '0.0004917', 'epoch': '0.09011'}
26
+ {'loss': '2.887', 'grad_norm': '1.145', 'learning_rate': '0.0004903', 'epoch': '0.0983'}
27
+ {'loss': '2.861', 'grad_norm': '1.353', 'learning_rate': '0.0004889', 'epoch': '0.1065'}
28
+ {'loss': '2.833', 'grad_norm': '1.226', 'learning_rate': '0.0004876', 'epoch': '0.1147'}
29
+ {'loss': '2.824', 'grad_norm': '1.226', 'learning_rate': '0.0004862', 'epoch': '0.1229'}
30
+ 4%|β–ˆβ–Œ | 1500/36624 [01:42<40:32, 14.44it/s]
31
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 182.87it/s]
32
+ {'loss': '2.806', 'grad_norm': '1.204', 'learning_rate': '0.0004848', 'epoch': '0.1311'}
33
+ {'loss': '2.786', 'grad_norm': '1.139', 'learning_rate': '0.0004834', 'epoch': '0.1393'}
34
+ {'loss': '2.777', 'grad_norm': '1.099', 'learning_rate': '0.000482', 'epoch': '0.1475'}
35
+ {'loss': '2.765', 'grad_norm': '1.127', 'learning_rate': '0.0004806', 'epoch': '0.1556'}
36
+ {'loss': '2.754', 'grad_norm': '1.186', 'learning_rate': '0.0004793', 'epoch': '0.1638'}
37
+ 5%|β–ˆβ–ˆ | 2000/36624 [02:16<39:37, 14.56it/s]
38
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 196.92it/s]
39
+ {'loss': '2.749', 'grad_norm': '1.068', 'learning_rate': '0.0004779', 'epoch': '0.172'}
40
+ {'loss': '2.732', 'grad_norm': '1.086', 'learning_rate': '0.0004765', 'epoch': '0.1802'}
41
+ {'loss': '2.73', 'grad_norm': '1.105', 'learning_rate': '0.0004751', 'epoch': '0.1884'}
42
+ {'loss': '2.721', 'grad_norm': '1.213', 'learning_rate': '0.0004737', 'epoch': '0.1966'}
43
+ {'loss': '2.717', 'grad_norm': '1.168', 'learning_rate': '0.0004723', 'epoch': '0.2048'}
44
+ 7%|β–ˆβ–ˆβ–Œ | 2500/36624 [02:50<39:00, 14.58it/s]
45
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 183.91it/s]
46
+ {'loss': '2.708', 'grad_norm': '1.081', 'learning_rate': '0.0004709', 'epoch': '0.213'}
47
+ {'loss': '2.705', 'grad_norm': '1.083', 'learning_rate': '0.0004696', 'epoch': '0.2212'}
48
+ {'loss': '2.697', 'grad_norm': '1.079', 'learning_rate': '0.0004682', 'epoch': '0.2294'}
49
+ {'loss': '2.692', 'grad_norm': '1.123', 'learning_rate': '0.0004668', 'epoch': '0.2376'}
50
+ {'loss': '2.687', 'grad_norm': '1.147', 'learning_rate': '0.0004654', 'epoch': '0.2458'}
51
+ 8%|β–ˆβ–ˆβ–ˆ | 3000/36624 [03:24<37:58, 14.76it/s]
52
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 192.12it/s]
53
+ {'loss': '2.681', 'grad_norm': '1.052', 'learning_rate': '0.000464', 'epoch': '0.2539'}
54
+ {'loss': '2.676', 'grad_norm': '1.099', 'learning_rate': '0.0004626', 'epoch': '0.2621'}
55
+ {'loss': '2.674', 'grad_norm': '1.084', 'learning_rate': '0.0004613', 'epoch': '0.2703'}
56
+ {'loss': '2.672', 'grad_norm': '1.057', 'learning_rate': '0.0004599', 'epoch': '0.2785'}
57
+ {'loss': '2.672', 'grad_norm': '1.103', 'learning_rate': '0.0004585', 'epoch': '0.2867'}
58
+ 10%|β–ˆβ–ˆβ–ˆβ–‹ | 3500/36624 [03:59<38:12, 14.45it/s]
59
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 199.64it/s]
60
+ {'loss': '2.661', 'grad_norm': '1.062', 'learning_rate': '0.0004571', 'epoch': '0.2949'}
61
+ {'loss': '2.658', 'grad_norm': '1.055', 'learning_rate': '0.0004557', 'epoch': '0.3031'}
62
+ {'loss': '2.656', 'grad_norm': '1.06', 'learning_rate': '0.0004543', 'epoch': '0.3113'}
63
+ {'loss': '2.653', 'grad_norm': '1.1', 'learning_rate': '0.000453', 'epoch': '0.3195'}
64
+ {'loss': '2.651', 'grad_norm': '1.137', 'learning_rate': '0.0004516', 'epoch': '0.3277'}
65
+ 11%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 4000/36624 [04:33<37:14, 14.60it/s]
66
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 196.63it/s]
67
+ {'loss': '2.648', 'grad_norm': '1.009', 'learning_rate': '0.0004502', 'epoch': '0.3359'}
68
+ {'loss': '2.639', 'grad_norm': '1', 'learning_rate': '0.0004488', 'epoch': '0.3441'}
69
+ {'loss': '2.641', 'grad_norm': '1.044', 'learning_rate': '0.0004474', 'epoch': '0.3522'}
70
+ {'loss': '2.641', 'grad_norm': '1.039', 'learning_rate': '0.000446', 'epoch': '0.3604'}
71
+ {'loss': '2.637', 'grad_norm': '1.036', 'learning_rate': '0.0004446', 'epoch': '0.3686'}
72
+ 12%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 4500/36624 [05:07<36:26, 14.69it/s]
73
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 193.36it/s]
74
+ {'loss': '2.632', 'grad_norm': '0.9873', 'learning_rate': '0.0004433', 'epoch': '0.3768'}
75
+ {'loss': '2.631', 'grad_norm': '1.043', 'learning_rate': '0.0004419', 'epoch': '0.385'}
76
+ {'loss': '2.63', 'grad_norm': '1.063', 'learning_rate': '0.0004405', 'epoch': '0.3932'}
77
+ {'loss': '2.624', 'grad_norm': '1.026', 'learning_rate': '0.0004391', 'epoch': '0.4014'}
78
+ {'loss': '2.624', 'grad_norm': '1.011', 'learning_rate': '0.0004377', 'epoch': '0.4096'}
79
+ 14%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 5000/36624 [05:41<36:09, 14.58it/s]
80
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 189.30it/s]
81
+ {'loss': '2.625', 'grad_norm': '1.08', 'learning_rate': '0.0004363', 'epoch': '0.4178'}
82
+ {'loss': '2.621', 'grad_norm': '1.007', 'learning_rate': '0.000435', 'epoch': '0.426'}
83
+ {'loss': '2.618', 'grad_norm': '1.025', 'learning_rate': '0.0004336', 'epoch': '0.4342'}
84
+ {'loss': '2.616', 'grad_norm': '0.9491', 'learning_rate': '0.0004322', 'epoch': '0.4424'}
85
+ {'loss': '2.615', 'grad_norm': '1.072', 'learning_rate': '0.0004308', 'epoch': '0.4505'}
86
+ 15%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 5500/36624 [06:15<35:20, 14.67it/s]
87
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 196.28it/s]
88
+ {'loss': '2.604', 'grad_norm': '0.986', 'learning_rate': '0.0004294', 'epoch': '0.4587'}
89
+ {'loss': '2.609', 'grad_norm': '0.9908', 'learning_rate': '0.000428', 'epoch': '0.4669'}
90
+ {'loss': '2.606', 'grad_norm': '0.9686', 'learning_rate': '0.0004267', 'epoch': '0.4751'}
91
+ {'loss': '2.61', 'grad_norm': '1.009', 'learning_rate': '0.0004253', 'epoch': '0.4833'}
92
+ {'loss': '2.606', 'grad_norm': '1.003', 'learning_rate': '0.0004239', 'epoch': '0.4915'}
93
+ 16%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 6000/36624 [06:49<34:56, 14.61it/s]
94
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 178.44it/s]
95
+ {'loss': '2.602', 'grad_norm': '0.9795', 'learning_rate': '0.0004225', 'epoch': '0.4997'}
96
+ {'loss': '2.601', 'grad_norm': '1.023', 'learning_rate': '0.0004211', 'epoch': '0.5079'}
97
+ {'loss': '2.596', 'grad_norm': '1.023', 'learning_rate': '0.0004197', 'epoch': '0.5161'}
98
+ {'loss': '2.598', 'grad_norm': '0.9583', 'learning_rate': '0.0004184', 'epoch': '0.5243'}
99
+ {'loss': '2.597', 'grad_norm': '0.9572', 'learning_rate': '0.000417', 'epoch': '0.5325'}
100
+ 18%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 6500/36624 [07:24<34:21, 14.61it/s]
101
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 207.81it/s]
102
+ {'loss': '2.596', 'grad_norm': '1.056', 'learning_rate': '0.0004156', 'epoch': '0.5407'}
103
+ {'loss': '2.594', 'grad_norm': '1.007', 'learning_rate': '0.0004142', 'epoch': '0.5488'}
104
+ {'loss': '2.593', 'grad_norm': '0.9365', 'learning_rate': '0.0004128', 'epoch': '0.557'}
105
+ {'loss': '2.593', 'grad_norm': '0.9879', 'learning_rate': '0.0004114', 'epoch': '0.5652'}
106
+ {'loss': '2.594', 'grad_norm': '1.078', 'learning_rate': '0.00041', 'epoch': '0.5734'}
107
+ 19%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 7000/36624 [07:58<33:22, 14.79it/s]
108
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 149.60it/s]
109
+ {'loss': '2.589', 'grad_norm': '1.011', 'learning_rate': '0.0004087', 'epoch': '0.5816'}
110
+ {'loss': '2.585', 'grad_norm': '0.9979', 'learning_rate': '0.0004073', 'epoch': '0.5898'}
111
+ {'loss': '2.587', 'grad_norm': '0.9675', 'learning_rate': '0.0004059', 'epoch': '0.598'}
112
+ {'loss': '2.584', 'grad_norm': '0.9291', 'learning_rate': '0.0004045', 'epoch': '0.6062'}
113
+ {'loss': '2.583', 'grad_norm': '0.9513', 'learning_rate': '0.0004031', 'epoch': '0.6144'}
114
+ 20%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 7500/36624 [08:32<33:15, 14.60it/s]
115
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 179.61it/s]
116
+ {'loss': '2.584', 'grad_norm': '1.012', 'learning_rate': '0.0004017', 'epoch': '0.6226'}
117
+ {'loss': '2.585', 'grad_norm': '1.012', 'learning_rate': '0.0004004', 'epoch': '0.6308'}
118
+ {'loss': '2.578', 'grad_norm': '1.016', 'learning_rate': '0.000399', 'epoch': '0.639'}
119
+ {'loss': '2.58', 'grad_norm': '0.994', 'learning_rate': '0.0003976', 'epoch': '0.6471'}
120
+ {'loss': '2.578', 'grad_norm': '1.003', 'learning_rate': '0.0003962', 'epoch': '0.6553'}
121
+ 22%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 8000/36624 [09:06<32:34, 14.64it/s]
122
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 178.38it/s]
123
+ {'loss': '2.581', 'grad_norm': '1.01', 'learning_rate': '0.0003948', 'epoch': '0.6635'}
124
+ {'loss': '2.573', 'grad_norm': '0.9192', 'learning_rate': '0.0003934', 'epoch': '0.6717'}
125
+ {'loss': '2.577', 'grad_norm': '0.955', 'learning_rate': '0.0003921', 'epoch': '0.6799'}
126
+ {'loss': '2.575', 'grad_norm': '1.005', 'learning_rate': '0.0003907', 'epoch': '0.6881'}
127
+ {'loss': '2.577', 'grad_norm': '0.922', 'learning_rate': '0.0003893', 'epoch': '0.6963'}
128
+ 23%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 8500/36624 [09:40<31:54, 14.69it/s]
129
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 189.22it/s]
130
+ {'loss': '2.573', 'grad_norm': '0.9621', 'learning_rate': '0.0003879', 'epoch': '0.7045'}
131
+ {'loss': '2.57', 'grad_norm': '0.9889', 'learning_rate': '0.0003865', 'epoch': '0.7127'}
132
+ {'loss': '2.568', 'grad_norm': '0.9244', 'learning_rate': '0.0003851', 'epoch': '0.7209'}
133
+ {'loss': '2.57', 'grad_norm': '1.009', 'learning_rate': '0.0003837', 'epoch': '0.7291'}
134
+ {'loss': '2.567', 'grad_norm': '0.9754', 'learning_rate': '0.0003824', 'epoch': '0.7373'}
135
+ 25%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 9000/36624 [10:14<31:31, 14.60it/s]
136
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 177.72it/s]
137
+ {'loss': '2.57', 'grad_norm': '0.964', 'learning_rate': '0.000381', 'epoch': '0.7454'}
138
+ {'loss': '2.567', 'grad_norm': '0.9354', 'learning_rate': '0.0003796', 'epoch': '0.7536'}
139
+ {'loss': '2.569', 'grad_norm': '0.9461', 'learning_rate': '0.0003782', 'epoch': '0.7618'}
140
+ {'loss': '2.565', 'grad_norm': '0.9415', 'learning_rate': '0.0003768', 'epoch': '0.77'}
141
+ {'loss': '2.566', 'grad_norm': '0.9319', 'learning_rate': '0.0003754', 'epoch': '0.7782'}
142
+ 26%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 9500/36624 [10:49<31:23, 14.40it/s]
143
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 187.52it/s]
144
+ {'loss': '2.56', 'grad_norm': '0.917', 'learning_rate': '0.0003741', 'epoch': '0.7864'}
145
+ {'loss': '2.562', 'grad_norm': '0.982', 'learning_rate': '0.0003727', 'epoch': '0.7946'}
146
+ {'loss': '2.563', 'grad_norm': '0.9996', 'learning_rate': '0.0003713', 'epoch': '0.8028'}
147
+ {'loss': '2.559', 'grad_norm': '0.9066', 'learning_rate': '0.0003699', 'epoch': '0.811'}
148
+ {'loss': '2.562', 'grad_norm': '0.9582', 'learning_rate': '0.0003685', 'epoch': '0.8192'}
149
+ 27%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 10000/36624 [11:23<30:09, 14.72it/s]
150
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 182.08it/s]
151
+ {'loss': '2.557', 'grad_norm': '0.9477', 'learning_rate': '0.0003671', 'epoch': '0.8274'}
152
+ {'loss': '2.56', 'grad_norm': '0.9513', 'learning_rate': '0.0003658', 'epoch': '0.8356'}
153
+ {'loss': '2.559', 'grad_norm': '0.9462', 'learning_rate': '0.0003644', 'epoch': '0.8437'}
154
+ {'loss': '2.558', 'grad_norm': '0.9505', 'learning_rate': '0.000363', 'epoch': '0.8519'}
155
+ {'loss': '2.556', 'grad_norm': '0.9055', 'learning_rate': '0.0003616', 'epoch': '0.8601'}
156
+ 29%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 10500/36624 [11:57<29:42, 14.66it/s]
157
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 186.19it/s]
158
+ {'loss': '2.552', 'grad_norm': '0.9765', 'learning_rate': '0.0003602', 'epoch': '0.8683'}
159
+ {'loss': '2.557', 'grad_norm': '0.9443', 'learning_rate': '0.0003588', 'epoch': '0.8765'}
160
+ {'loss': '2.555', 'grad_norm': '0.8971', 'learning_rate': '0.0003574', 'epoch': '0.8847'}
161
+ {'loss': '2.553', 'grad_norm': '0.9489', 'learning_rate': '0.0003561', 'epoch': '0.8929'}
162
+ {'loss': '2.552', 'grad_norm': '1', 'learning_rate': '0.0003547', 'epoch': '0.9011'}
163
+ 30%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 11000/36624 [12:31<28:47, 14.83it/s]
164
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 176.59it/s]
165
+ {'loss': '2.557', 'grad_norm': '0.915', 'learning_rate': '0.0003533', 'epoch': '0.9093'}
166
+ {'loss': '2.552', 'grad_norm': '0.911', 'learning_rate': '0.0003519', 'epoch': '0.9175'}
167
+ {'loss': '2.554', 'grad_norm': '0.9488', 'learning_rate': '0.0003505', 'epoch': '0.9257'}
168
+ {'loss': '2.547', 'grad_norm': '0.9326', 'learning_rate': '0.0003491', 'epoch': '0.9339'}
169
+ {'loss': '2.555', 'grad_norm': '0.9041', 'learning_rate': '0.0003478', 'epoch': '0.942'}
170
+ 31%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 11500/36624 [13:06<28:39, 14.61it/s]
171
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 193.03it/s]
172
+ {'loss': '2.547', 'grad_norm': '0.9229', 'learning_rate': '0.0003464', 'epoch': '0.9502'}
173
+ {'loss': '2.547', 'grad_norm': '0.9645', 'learning_rate': '0.000345', 'epoch': '0.9584'}
174
+ {'loss': '2.548', 'grad_norm': '0.9408', 'learning_rate': '0.0003436', 'epoch': '0.9666'}
175
+ {'loss': '2.546', 'grad_norm': '0.9032', 'learning_rate': '0.0003422', 'epoch': '0.9748'}
176
+ {'loss': '2.549', 'grad_norm': '0.918', 'learning_rate': '0.0003408', 'epoch': '0.983'}
177
+ 33%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 12000/36624 [13:40<28:04, 14.62it/s]
178
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 188.44it/s]
179
+ {'loss': '2.547', 'grad_norm': '0.9086', 'learning_rate': '0.0003395', 'epoch': '0.9912'}
180
+ {'loss': '2.544', 'grad_norm': '0.9125', 'learning_rate': '0.0003381', 'epoch': '0.9994'}
181
+ {'loss': '2.541', 'grad_norm': '0.9181', 'learning_rate': '0.0003367', 'epoch': '1.008'}
182
+ {'loss': '2.545', 'grad_norm': '0.9132', 'learning_rate': '0.0003353', 'epoch': '1.016'}
183
+ {'loss': '2.542', 'grad_norm': '0.9156', 'learning_rate': '0.0003339', 'epoch': '1.024'}
184
+ 34%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 12500/36624 [14:15<27:29, 14.62it/s]
185
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 114.07it/s]
186
+ {'loss': '2.538', 'grad_norm': '0.9441', 'learning_rate': '0.0003325', 'epoch': '1.032'}
187
+ {'loss': '2.542', 'grad_norm': '0.9385', 'learning_rate': '0.0003312', 'epoch': '1.04'}
188
+ {'loss': '2.536', 'grad_norm': '0.9842', 'learning_rate': '0.0003298', 'epoch': '1.048'}
189
+ {'loss': '2.542', 'grad_norm': '0.9319', 'learning_rate': '0.0003284', 'epoch': '1.057'}
190
+ {'loss': '2.537', 'grad_norm': '0.8883', 'learning_rate': '0.000327', 'epoch': '1.065'}
191
+ 35%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 13000/36624 [14:50<27:04, 14.54it/s]
192
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 171.43it/s]
193
+ {'loss': '2.54', 'grad_norm': '0.9869', 'learning_rate': '0.0003256', 'epoch': '1.073'}
194
+ {'loss': '2.539', 'grad_norm': '0.8919', 'learning_rate': '0.0003242', 'epoch': '1.081'}
195
+ {'loss': '2.533', 'grad_norm': '0.9155', 'learning_rate': '0.0003228', 'epoch': '1.089'}
196
+ {'loss': '2.537', 'grad_norm': '0.9485', 'learning_rate': '0.0003215', 'epoch': '1.098'}
197
+ {'loss': '2.539', 'grad_norm': '0.9354', 'learning_rate': '0.0003201', 'epoch': '1.106'}
198
+ 37%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 13500/36624 [15:24<26:16, 14.67it/s]
199
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 199.73it/s]
200
+ {'loss': '2.535', 'grad_norm': '0.9028', 'learning_rate': '0.0003187', 'epoch': '1.114'}
201
+ {'loss': '2.533', 'grad_norm': '0.9042', 'learning_rate': '0.0003173', 'epoch': '1.122'}
202
+ {'loss': '2.533', 'grad_norm': '0.9192', 'learning_rate': '0.0003159', 'epoch': '1.13'}
203
+ {'loss': '2.533', 'grad_norm': '0.8816', 'learning_rate': '0.0003145', 'epoch': '1.139'}
204
+ {'loss': '2.53', 'grad_norm': '0.9064', 'learning_rate': '0.0003132', 'epoch': '1.147'}
205
+ 38%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 14000/36624 [15:58<26:09, 14.42it/s]
206
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 174.55it/s]
207
+ {'loss': '2.534', 'grad_norm': '0.9424', 'learning_rate': '0.0003118', 'epoch': '1.155'}
208
+ {'loss': '2.53', 'grad_norm': '0.9198', 'learning_rate': '0.0003104', 'epoch': '1.163'}
209
+ {'loss': '2.53', 'grad_norm': '0.9234', 'learning_rate': '0.000309', 'epoch': '1.171'}
210
+ {'loss': '2.533', 'grad_norm': '1.027', 'learning_rate': '0.0003076', 'epoch': '1.18'}
211
+ {'loss': '2.531', 'grad_norm': '0.9083', 'learning_rate': '0.0003062', 'epoch': '1.188'}
212
+ 40%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 14500/36624 [16:32<25:17, 14.58it/s]
213
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 192.31it/s]
214
+ {'loss': '2.53', 'grad_norm': '0.8941', 'learning_rate': '0.0003049', 'epoch': '1.196'}
215
+ {'loss': '2.533', 'grad_norm': '0.9395', 'learning_rate': '0.0003035', 'epoch': '1.204'}
216
+ {'loss': '2.53', 'grad_norm': '0.9605', 'learning_rate': '0.0003021', 'epoch': '1.212'}
217
+ {'loss': '2.53', 'grad_norm': '0.9029', 'learning_rate': '0.0003007', 'epoch': '1.221'}
218
+ {'loss': '2.529', 'grad_norm': '0.9056', 'learning_rate': '0.0002993', 'epoch': '1.229'}
219
+ 41%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 15000/36624 [17:07<24:39, 14.62it/s]
220
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 180.68it/s]
221
+ {'loss': '2.528', 'grad_norm': '0.8955', 'learning_rate': '0.0002979', 'epoch': '1.237'}
222
+ {'loss': '2.53', 'grad_norm': '0.9041', 'learning_rate': '0.0002965', 'epoch': '1.245'}
223
+ {'loss': '2.527', 'grad_norm': '0.9242', 'learning_rate': '0.0002952', 'epoch': '1.253'}
224
+ {'loss': '2.525', 'grad_norm': '0.9313', 'learning_rate': '0.0002938', 'epoch': '1.261'}
225
+ {'loss': '2.525', 'grad_norm': '0.9721', 'learning_rate': '0.0002924', 'epoch': '1.27'}
226
+ 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 15500/36624 [17:41<23:50, 14.77it/s]
227
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 195.36it/s]
228
+ {'loss': '2.522', 'grad_norm': '0.9043', 'learning_rate': '0.000291', 'epoch': '1.278'}
229
+ {'loss': '2.524', 'grad_norm': '0.9181', 'learning_rate': '0.0002896', 'epoch': '1.286'}
230
+ {'loss': '2.527', 'grad_norm': '0.9111', 'learning_rate': '0.0002882', 'epoch': '1.294'}
231
+ {'loss': '2.523', 'grad_norm': '0.9105', 'learning_rate': '0.0002869', 'epoch': '1.302'}
232
+ {'loss': '2.526', 'grad_norm': '1.005', 'learning_rate': '0.0002855', 'epoch': '1.311'}
233
+ 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 16000/36624 [18:15<23:29, 14.63it/s]
234
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 192.30it/s]
235
+ {'loss': '2.526', 'grad_norm': '0.9184', 'learning_rate': '0.0002841', 'epoch': '1.319'}
236
+ {'loss': '2.52', 'grad_norm': '0.8872', 'learning_rate': '0.0002827', 'epoch': '1.327'}
237
+ {'loss': '2.519', 'grad_norm': '0.9441', 'learning_rate': '0.0002813', 'epoch': '1.335'}
238
+ {'loss': '2.525', 'grad_norm': '0.9462', 'learning_rate': '0.0002799', 'epoch': '1.343'}
239
+ {'loss': '2.525', 'grad_norm': '0.9307', 'learning_rate': '0.0002786', 'epoch': '1.352'}
240
+ 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 16500/36624 [18:49<23:00, 14.58it/s]
241
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 184.49it/s]
242
+ {'loss': '2.519', 'grad_norm': '0.9708', 'learning_rate': '0.0002772', 'epoch': '1.36'}
243
+ {'loss': '2.522', 'grad_norm': '0.9035', 'learning_rate': '0.0002758', 'epoch': '1.368'}
244
+ {'loss': '2.518', 'grad_norm': '0.9394', 'learning_rate': '0.0002744', 'epoch': '1.376'}
245
+ {'loss': '2.521', 'grad_norm': '0.9519', 'learning_rate': '0.000273', 'epoch': '1.384'}
246
+ {'loss': '2.518', 'grad_norm': '0.915', 'learning_rate': '0.0002716', 'epoch': '1.393'}
247
+ 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 17000/36624 [19:23<22:15, 14.69it/s]
248
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 188.87it/s]
249
+ {'loss': '2.517', 'grad_norm': '0.9166', 'learning_rate': '0.0002702', 'epoch': '1.401'}
250
+ {'loss': '2.513', 'grad_norm': '0.9377', 'learning_rate': '0.0002689', 'epoch': '1.409'}
251
+ {'loss': '2.516', 'grad_norm': '0.9178', 'learning_rate': '0.0002675', 'epoch': '1.417'}
252
+ {'loss': '2.519', 'grad_norm': '0.9151', 'learning_rate': '0.0002661', 'epoch': '1.425'}
253
+ {'loss': '2.515', 'grad_norm': '0.9612', 'learning_rate': '0.0002647', 'epoch': '1.434'}
254
+ 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 17500/36624 [19:58<21:56, 14.53it/s]
255
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 176.02it/s]
256
+ {'loss': '2.519', 'grad_norm': '0.9229', 'learning_rate': '0.0002633', 'epoch': '1.442'}
257
+ {'loss': '2.518', 'grad_norm': '0.9195', 'learning_rate': '0.0002619', 'epoch': '1.45'}
258
+ {'loss': '2.514', 'grad_norm': '0.9046', 'learning_rate': '0.0002606', 'epoch': '1.458'}
259
+ {'loss': '2.52', 'grad_norm': '0.9383', 'learning_rate': '0.0002592', 'epoch': '1.466'}
260
+ {'loss': '2.516', 'grad_norm': '0.9361', 'learning_rate': '0.0002578', 'epoch': '1.474'}
261
+ 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 18000/36624 [20:32<21:18, 14.57it/s]
262
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 184.81it/s]
263
+ {'loss': '2.509', 'grad_norm': '0.9623', 'learning_rate': '0.0002564', 'epoch': '1.483'}
264
+ {'loss': '2.511', 'grad_norm': '0.9627', 'learning_rate': '0.000255', 'epoch': '1.491'}
265
+ {'loss': '2.516', 'grad_norm': '0.9481', 'learning_rate': '0.0002536', 'epoch': '1.499'}
266
+ {'loss': '2.516', 'grad_norm': '0.9699', 'learning_rate': '0.0002523', 'epoch': '1.507'}
267
+ {'loss': '2.514', 'grad_norm': '0.9232', 'learning_rate': '0.0002509', 'epoch': '1.515'}
268
+ 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 18500/36624 [21:06<20:39, 14.63it/s]
269
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 181.07it/s]
270
+ {'loss': '2.508', 'grad_norm': '0.8967', 'learning_rate': '0.0002495', 'epoch': '1.524'}
271
+ {'loss': '2.51', 'grad_norm': '0.9512', 'learning_rate': '0.0002481', 'epoch': '1.532'}
272
+ {'loss': '2.511', 'grad_norm': '0.9096', 'learning_rate': '0.0002467', 'epoch': '1.54'}
273
+ {'loss': '2.509', 'grad_norm': '0.9213', 'learning_rate': '0.0002453', 'epoch': '1.548'}
274
+ {'loss': '2.513', 'grad_norm': '0.9172', 'learning_rate': '0.000244', 'epoch': '1.556'}
275
+ 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 19000/36624 [21:40<20:00, 14.69it/s]
276
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 180.06it/s]
277
+ {'loss': '2.51', 'grad_norm': '0.9369', 'learning_rate': '0.0002426', 'epoch': '1.565'}
278
+ {'loss': '2.512', 'grad_norm': '0.9091', 'learning_rate': '0.0002412', 'epoch': '1.573'}
279
+ {'loss': '2.512', 'grad_norm': '0.8935', 'learning_rate': '0.0002398', 'epoch': '1.581'}
280
+ {'loss': '2.51', 'grad_norm': '0.9206', 'learning_rate': '0.0002384', 'epoch': '1.589'}
281
+ {'loss': '2.507', 'grad_norm': '0.9272', 'learning_rate': '0.000237', 'epoch': '1.597'}
282
+ 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 19500/36624 [22:15<19:28, 14.66it/s]
283
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 183.45it/s]
284
+ {'loss': '2.51', 'grad_norm': '0.9499', 'learning_rate': '0.0002356', 'epoch': '1.606'}
285
+ {'loss': '2.513', 'grad_norm': '0.9095', 'learning_rate': '0.0002343', 'epoch': '1.614'}
286
+ {'loss': '2.508', 'grad_norm': '0.9086', 'learning_rate': '0.0002329', 'epoch': '1.622'}
287
+ {'loss': '2.507', 'grad_norm': '0.9389', 'learning_rate': '0.0002315', 'epoch': '1.63'}
288
+ {'loss': '2.514', 'grad_norm': '0.8963', 'learning_rate': '0.0002301', 'epoch': '1.638'}
289
+ 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 20000/36624 [22:49<18:57, 14.61it/s]
290
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 174.24it/s]
291
+ {'loss': '2.506', 'grad_norm': '0.978', 'learning_rate': '0.0002287', 'epoch': '1.646'}
292
+ {'loss': '2.507', 'grad_norm': '0.9966', 'learning_rate': '0.0002273', 'epoch': '1.655'}
293
+ {'loss': '2.507', 'grad_norm': '0.9281', 'learning_rate': '0.000226', 'epoch': '1.663'}
294
+ {'loss': '2.51', 'grad_norm': '0.9063', 'learning_rate': '0.0002246', 'epoch': '1.671'}
295
+ {'loss': '2.509', 'grad_norm': '0.9708', 'learning_rate': '0.0002232', 'epoch': '1.679'}
296
+ 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 20500/36624 [23:23<18:18, 14.67it/s]
297
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 182.16it/s]
298
+ {'loss': '2.505', 'grad_norm': '0.946', 'learning_rate': '0.0002218', 'epoch': '1.687'}
299
+ {'loss': '2.507', 'grad_norm': '0.9184', 'learning_rate': '0.0002204', 'epoch': '1.696'}
300
+ {'loss': '2.506', 'grad_norm': '0.9702', 'learning_rate': '0.000219', 'epoch': '1.704'}
301
+ {'loss': '2.499', 'grad_norm': '0.9535', 'learning_rate': '0.0002177', 'epoch': '1.712'}
302
+ {'loss': '2.502', 'grad_norm': '0.9017', 'learning_rate': '0.0002163', 'epoch': '1.72'}
303
+ 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 21000/36624 [23:57<18:03, 14.42it/s]
304
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 190.95it/s]
305
+ {'loss': '2.509', 'grad_norm': '0.9587', 'learning_rate': '0.0002149', 'epoch': '1.728'}
306
+ {'loss': '2.504', 'grad_norm': '0.9648', 'learning_rate': '0.0002135', 'epoch': '1.737'}
307
+ {'loss': '2.503', 'grad_norm': '0.953', 'learning_rate': '0.0002121', 'epoch': '1.745'}
308
+ {'loss': '2.5', 'grad_norm': '0.9445', 'learning_rate': '0.0002107', 'epoch': '1.753'}
309
+ {'loss': '2.501', 'grad_norm': '0.9414', 'learning_rate': '0.0002093', 'epoch': '1.761'}
310
+ 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 21500/36624 [24:32<17:10, 14.67it/s]
311
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 52.34it/s]
312
+ {'loss': '2.503', 'grad_norm': '0.9309', 'learning_rate': '0.000208', 'epoch': '1.769'}
313
+ {'loss': '2.502', 'grad_norm': '0.9301', 'learning_rate': '0.0002066', 'epoch': '1.778'}
314
+ {'loss': '2.504', 'grad_norm': '0.895', 'learning_rate': '0.0002052', 'epoch': '1.786'}
315
+ {'loss': '2.502', 'grad_norm': '0.9428', 'learning_rate': '0.0002038', 'epoch': '1.794'}
316
+ {'loss': '2.501', 'grad_norm': '0.9539', 'learning_rate': '0.0002024', 'epoch': '1.802'}
317
+ 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 22000/36624 [25:06<16:28, 14.79it/s]
318
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 203.41it/s]
319
+ {'loss': '2.5', 'grad_norm': '0.9179', 'learning_rate': '0.000201', 'epoch': '1.81'}
320
+ {'loss': '2.501', 'grad_norm': '0.9195', 'learning_rate': '0.0001997', 'epoch': '1.819'}
321
+ {'loss': '2.499', 'grad_norm': '1.047', 'learning_rate': '0.0001983', 'epoch': '1.827'}
322
+ {'loss': '2.499', 'grad_norm': '0.931', 'learning_rate': '0.0001969', 'epoch': '1.835'}
323
+ {'loss': '2.499', 'grad_norm': '0.9269', 'learning_rate': '0.0001955', 'epoch': '1.843'}
324
+ 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 22500/36624 [25:40<16:03, 14.65it/s]
325
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 194.41it/s]
326
+ {'loss': '2.501', 'grad_norm': '0.939', 'learning_rate': '0.0001941', 'epoch': '1.851'}
327
+ {'loss': '2.495', 'grad_norm': '0.9119', 'learning_rate': '0.0001927', 'epoch': '1.859'}
328
+ {'loss': '2.499', 'grad_norm': '0.9755', 'learning_rate': '0.0001914', 'epoch': '1.868'}
329
+ {'loss': '2.497', 'grad_norm': '0.9444', 'learning_rate': '0.00019', 'epoch': '1.876'}
330
+ {'loss': '2.496', 'grad_norm': '0.9551', 'learning_rate': '0.0001886', 'epoch': '1.884'}
331
+ 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 23000/36624 [26:15<15:34, 14.58it/s]
332
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 183.11it/s]
333
+ {'loss': '2.5', 'grad_norm': '0.9524', 'learning_rate': '0.0001872', 'epoch': '1.892'}
334
+ {'loss': '2.502', 'grad_norm': '0.9583', 'learning_rate': '0.0001858', 'epoch': '1.9'}
335
+ {'loss': '2.497', 'grad_norm': '0.9206', 'learning_rate': '0.0001844', 'epoch': '1.909'}
336
+ {'loss': '2.495', 'grad_norm': '0.9133', 'learning_rate': '0.0001831', 'epoch': '1.917'}
337
+ {'loss': '2.491', 'grad_norm': '0.9201', 'learning_rate': '0.0001817', 'epoch': '1.925'}
338
+ 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 23500/36624 [26:49<14:51, 14.72it/s]
339
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 175.74it/s]
340
+ {'loss': '2.499', 'grad_norm': '0.9536', 'learning_rate': '0.0001803', 'epoch': '1.933'}
341
+ {'loss': '2.497', 'grad_norm': '0.9332', 'learning_rate': '0.0001789', 'epoch': '1.941'}
342
+ {'loss': '2.491', 'grad_norm': '0.9358', 'learning_rate': '0.0001775', 'epoch': '1.95'}
343
+ {'loss': '2.493', 'grad_norm': '0.9568', 'learning_rate': '0.0001761', 'epoch': '1.958'}
344
+ {'loss': '2.494', 'grad_norm': '0.9585', 'learning_rate': '0.0001747', 'epoch': '1.966'}
345
+ 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 24000/36624 [27:23<14:14, 14.78it/s]
346
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 178.42it/s]
347
+ {'loss': '2.496', 'grad_norm': '0.9171', 'learning_rate': '0.0001734', 'epoch': '1.974'}
348
+ {'loss': '2.495', 'grad_norm': '0.9789', 'learning_rate': '0.000172', 'epoch': '1.982'}
349
+ {'loss': '2.493', 'grad_norm': '0.9548', 'learning_rate': '0.0001706', 'epoch': '1.991'}
350
+ {'loss': '2.495', 'grad_norm': '1.021', 'learning_rate': '0.0001692', 'epoch': '1.999'}
351
+ {'loss': '2.486', 'grad_norm': '0.9158', 'learning_rate': '0.0001678', 'epoch': '2.007'}
352
+ 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 24500/36624 [27:58<14:15, 14.17it/s]
353
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 42.99it/s]
354
+ {'loss': '2.494', 'grad_norm': '0.9763', 'learning_rate': '0.0001664', 'epoch': '2.015'}
355
+ {'loss': '2.495', 'grad_norm': '0.9613', 'learning_rate': '0.0001651', 'epoch': '2.023'}
356
+ {'loss': '2.489', 'grad_norm': '0.9664', 'learning_rate': '0.0001637', 'epoch': '2.031'}
357
+ {'loss': '2.498', 'grad_norm': '1.016', 'learning_rate': '0.0001623', 'epoch': '2.04'}
358
+ {'loss': '2.488', 'grad_norm': '0.9416', 'learning_rate': '0.0001609', 'epoch': '2.048'}
359
+ 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 25000/36624 [28:33<13:16, 14.60it/s]
360
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 180.22it/s]
361
+ {'loss': '2.489', 'grad_norm': '0.9494', 'learning_rate': '0.0001595', 'epoch': '2.056'}
362
+ {'loss': '2.486', 'grad_norm': '0.9252', 'learning_rate': '0.0001581', 'epoch': '2.064'}
363
+ {'loss': '2.492', 'grad_norm': '0.9568', 'learning_rate': '0.0001568', 'epoch': '2.072'}
364
+ {'loss': '2.492', 'grad_norm': '0.9466', 'learning_rate': '0.0001554', 'epoch': '2.081'}
365
+ {'loss': '2.486', 'grad_norm': '0.9349', 'learning_rate': '0.000154', 'epoch': '2.089'}
366
+ 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 25500/36624 [29:07<12:43, 14.57it/s]
367
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 167.93it/s]
368
+ {'loss': '2.489', 'grad_norm': '0.9689', 'learning_rate': '0.0001526', 'epoch': '2.097'}
369
+ {'loss': '2.488', 'grad_norm': '0.9909', 'learning_rate': '0.0001512', 'epoch': '2.105'}
370
+ {'loss': '2.49', 'grad_norm': '0.9703', 'learning_rate': '0.0001498', 'epoch': '2.113'}
371
+ {'loss': '2.487', 'grad_norm': '1.02', 'learning_rate': '0.0001484', 'epoch': '2.122'}
372
+ {'loss': '2.485', 'grad_norm': '1.005', 'learning_rate': '0.0001471', 'epoch': '2.13'}
373
+ 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 26000/36624 [29:41<12:02, 14.70it/s]
374
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 201.60it/s]
375
+ {'loss': '2.486', 'grad_norm': '0.9853', 'learning_rate': '0.0001457', 'epoch': '2.138'}
376
+ {'loss': '2.485', 'grad_norm': '0.9916', 'learning_rate': '0.0001443', 'epoch': '2.146'}
377
+ {'loss': '2.488', 'grad_norm': '0.9691', 'learning_rate': '0.0001429', 'epoch': '2.154'}
378
+ {'loss': '2.488', 'grad_norm': '0.9773', 'learning_rate': '0.0001415', 'epoch': '2.163'}
379
+ {'loss': '2.482', 'grad_norm': '0.953', 'learning_rate': '0.0001401', 'epoch': '2.171'}
380
+ 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 26500/36624 [30:15<11:31, 14.63it/s]
381
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 187.70it/s]
382
+ {'loss': '2.487', 'grad_norm': '0.9432', 'learning_rate': '0.0001388', 'epoch': '2.179'}
383
+ {'loss': '2.487', 'grad_norm': '0.962', 'learning_rate': '0.0001374', 'epoch': '2.187'}
384
+ {'loss': '2.488', 'grad_norm': '0.9646', 'learning_rate': '0.000136', 'epoch': '2.195'}
385
+ {'loss': '2.483', 'grad_norm': '0.9822', 'learning_rate': '0.0001346', 'epoch': '2.203'}
386
+ {'loss': '2.484', 'grad_norm': '0.9462', 'learning_rate': '0.0001332', 'epoch': '2.212'}
387
+ 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 27000/36624 [30:50<11:02, 14.53it/s]
388
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 148.78it/s]
389
+ {'loss': '2.485', 'grad_norm': '0.983', 'learning_rate': '0.0001318', 'epoch': '2.22'}
390
+ {'loss': '2.483', 'grad_norm': '0.9827', 'learning_rate': '0.0001305', 'epoch': '2.228'}
391
+ {'loss': '2.486', 'grad_norm': '0.987', 'learning_rate': '0.0001291', 'epoch': '2.236'}
392
+ {'loss': '2.486', 'grad_norm': '1.003', 'learning_rate': '0.0001277', 'epoch': '2.244'}
393
+ {'loss': '2.487', 'grad_norm': '0.9763', 'learning_rate': '0.0001263', 'epoch': '2.253'}
394
+ 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 27500/36624 [31:24<10:20, 14.70it/s]
395
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 197.76it/s]
396
+ {'loss': '2.484', 'grad_norm': '0.9718', 'learning_rate': '0.0001249', 'epoch': '2.261'}
397
+ {'loss': '2.482', 'grad_norm': '0.964', 'learning_rate': '0.0001235', 'epoch': '2.269'}
398
+ {'loss': '2.486', 'grad_norm': '0.9918', 'learning_rate': '0.0001221', 'epoch': '2.277'}
399
+ {'loss': '2.482', 'grad_norm': '0.9895', 'learning_rate': '0.0001208', 'epoch': '2.285'}
400
+ {'loss': '2.483', 'grad_norm': '0.9978', 'learning_rate': '0.0001194', 'epoch': '2.294'}
401
+ 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 28000/36624 [31:58<10:07, 14.20it/s]
402
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 187.44it/s]
403
+ {'loss': '2.484', 'grad_norm': '1.008', 'learning_rate': '0.000118', 'epoch': '2.302'}
404
+ {'loss': '2.484', 'grad_norm': '1.029', 'learning_rate': '0.0001166', 'epoch': '2.31'}
405
+ {'loss': '2.482', 'grad_norm': '0.9828', 'learning_rate': '0.0001152', 'epoch': '2.318'}
406
+ {'loss': '2.484', 'grad_norm': '0.9815', 'learning_rate': '0.0001138', 'epoch': '2.326'}
407
+ {'loss': '2.484', 'grad_norm': '0.97', 'learning_rate': '0.0001125', 'epoch': '2.335'}
408
+ 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 28500/36624 [32:32<09:16, 14.61it/s]
409
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 194.61it/s]
410
+ {'loss': '2.48', 'grad_norm': '1.007', 'learning_rate': '0.0001111', 'epoch': '2.343'}
411
+ {'loss': '2.477', 'grad_norm': '1.018', 'learning_rate': '0.0001097', 'epoch': '2.351'}
412
+ {'loss': '2.477', 'grad_norm': '0.945', 'learning_rate': '0.0001083', 'epoch': '2.359'}
413
+ {'loss': '2.479', 'grad_norm': '1.001', 'learning_rate': '0.0001069', 'epoch': '2.367'}
414
+ {'loss': '2.48', 'grad_norm': '0.9743', 'learning_rate': '0.0001055', 'epoch': '2.376'}
415
+ 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 29000/36624 [33:07<08:43, 14.57it/s]
416
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 193.78it/s]
417
+ {'loss': '2.48', 'grad_norm': '1.005', 'learning_rate': '0.0001042', 'epoch': '2.384'}
418
+ {'loss': '2.482', 'grad_norm': '1.016', 'learning_rate': '0.0001028', 'epoch': '2.392'}
419
+ {'loss': '2.473', 'grad_norm': '0.9854', 'learning_rate': '0.0001014', 'epoch': '2.4'}
420
+ {'loss': '2.476', 'grad_norm': '0.9408', 'learning_rate': '0.0001', 'epoch': '2.408'}
421
+ {'loss': '2.475', 'grad_norm': '0.9968', 'learning_rate': '9.862e-05', 'epoch': '2.416'}
422
+ 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 29500/36624 [33:41<08:09, 14.55it/s]
423
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 189.43it/s]
424
+ {'loss': '2.476', 'grad_norm': '1.016', 'learning_rate': '9.723e-05', 'epoch': '2.425'}
425
+ {'loss': '2.48', 'grad_norm': '0.9962', 'learning_rate': '9.585e-05', 'epoch': '2.433'}
426
+ {'loss': '2.478', 'grad_norm': '1.032', 'learning_rate': '9.447e-05', 'epoch': '2.441'}
427
+ {'loss': '2.477', 'grad_norm': '1.001', 'learning_rate': '9.308e-05', 'epoch': '2.449'}
428
+ {'loss': '2.477', 'grad_norm': '0.9866', 'learning_rate': '9.17e-05', 'epoch': '2.457'}
429
+ 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 30000/36624 [34:15<07:36, 14.53it/s]
430
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 184.80it/s]
431
+ {'loss': '2.476', 'grad_norm': '1.028', 'learning_rate': '9.031e-05', 'epoch': '2.466'}
432
+ {'loss': '2.475', 'grad_norm': '0.9801', 'learning_rate': '8.893e-05', 'epoch': '2.474'}
433
+ {'loss': '2.48', 'grad_norm': '0.9972', 'learning_rate': '8.755e-05', 'epoch': '2.482'}
434
+ {'loss': '2.479', 'grad_norm': '0.9972', 'learning_rate': '8.616e-05', 'epoch': '2.49'}
435
+ {'loss': '2.479', 'grad_norm': '1.076', 'learning_rate': '8.478e-05', 'epoch': '2.498'}
436
+ 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 30500/36624 [34:50<07:03, 14.47it/s]
437
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 156.83it/s]
438
+ {'loss': '2.472', 'grad_norm': '0.9888', 'learning_rate': '8.339e-05', 'epoch': '2.507'}
439
+ {'loss': '2.477', 'grad_norm': '1.045', 'learning_rate': '8.201e-05', 'epoch': '2.515'}
440
+ {'loss': '2.475', 'grad_norm': '1.039', 'learning_rate': '8.063e-05', 'epoch': '2.523'}
441
+ {'loss': '2.476', 'grad_norm': '1.038', 'learning_rate': '7.924e-05', 'epoch': '2.531'}
442
+ {'loss': '2.474', 'grad_norm': '1.013', 'learning_rate': '7.786e-05', 'epoch': '2.539'}
443
+ 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 31000/36624 [35:25<06:26, 14.55it/s]
444
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 173.79it/s]
445
+ {'loss': '2.478', 'grad_norm': '0.9911', 'learning_rate': '7.647e-05', 'epoch': '2.548'}
446
+ {'loss': '2.475', 'grad_norm': '1.003', 'learning_rate': '7.509e-05', 'epoch': '2.556'}
447
+ {'loss': '2.476', 'grad_norm': '0.9986', 'learning_rate': '7.37e-05', 'epoch': '2.564'}
448
+ {'loss': '2.475', 'grad_norm': '1.034', 'learning_rate': '7.232e-05', 'epoch': '2.572'}
449
+ {'loss': '2.476', 'grad_norm': '0.9733', 'learning_rate': '7.094e-05', 'epoch': '2.58'}
450
+ 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 31500/36624 [35:59<05:53, 14.51it/s]
451
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 181.57it/s]
452
+ {'loss': '2.468', 'grad_norm': '1.055', 'learning_rate': '6.955e-05', 'epoch': '2.588'}
453
+ {'loss': '2.475', 'grad_norm': '1.026', 'learning_rate': '6.817e-05', 'epoch': '2.597'}
454
+ {'loss': '2.476', 'grad_norm': '1.029', 'learning_rate': '6.678e-05', 'epoch': '2.605'}
455
+ {'loss': '2.471', 'grad_norm': '1.032', 'learning_rate': '6.54e-05', 'epoch': '2.613'}
456
+ {'loss': '2.473', 'grad_norm': '1.007', 'learning_rate': '6.402e-05', 'epoch': '2.621'}
457
+ 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 32000/36624 [36:34<05:15, 14.66it/s]
458
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 196.78it/s]
459
+ {'loss': '2.47', 'grad_norm': '1.028', 'learning_rate': '6.263e-05', 'epoch': '2.629'}
460
+ {'loss': '2.47', 'grad_norm': '0.9969', 'learning_rate': '6.125e-05', 'epoch': '2.638'}
461
+ {'loss': '2.473', 'grad_norm': '1.037', 'learning_rate': '5.986e-05', 'epoch': '2.646'}
462
+ {'loss': '2.47', 'grad_norm': '1', 'learning_rate': '5.848e-05', 'epoch': '2.654'}
463
+ {'loss': '2.468', 'grad_norm': '1.034', 'learning_rate': '5.71e-05', 'epoch': '2.662'}
464
+ 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 32500/36624 [37:08<04:47, 14.34it/s]
465
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 181.32it/s]
466
+ {'loss': '2.471', 'grad_norm': '1.073', 'learning_rate': '5.571e-05', 'epoch': '2.67'}
467
+ {'loss': '2.47', 'grad_norm': '1.003', 'learning_rate': '5.433e-05', 'epoch': '2.679'}
468
+ {'loss': '2.472', 'grad_norm': '1.033', 'learning_rate': '5.294e-05', 'epoch': '2.687'}
469
+ {'loss': '2.469', 'grad_norm': '1.076', 'learning_rate': '5.156e-05', 'epoch': '2.695'}
470
+ {'loss': '2.469', 'grad_norm': '1.061', 'learning_rate': '5.017e-05', 'epoch': '2.703'}
471
+ 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 33000/36624 [37:43<04:11, 14.39it/s]
472
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 188.79it/s]
473
+ {'loss': '2.468', 'grad_norm': '1.043', 'learning_rate': '4.879e-05', 'epoch': '2.711'}
474
+ {'loss': '2.474', 'grad_norm': '1.115', 'learning_rate': '4.741e-05', 'epoch': '2.72'}
475
+ {'loss': '2.469', 'grad_norm': '1.028', 'learning_rate': '4.602e-05', 'epoch': '2.728'}
476
+ {'loss': '2.468', 'grad_norm': '1.017', 'learning_rate': '4.464e-05', 'epoch': '2.736'}
477
+ {'loss': '2.471', 'grad_norm': '0.9948', 'learning_rate': '4.325e-05', 'epoch': '2.744'}
478
+ 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 33500/36624 [38:18<03:37, 14.39it/s]
479
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 155.26it/s]
480
+ {'loss': '2.466', 'grad_norm': '1.027', 'learning_rate': '4.187e-05', 'epoch': '2.752'}
481
+ {'loss': '2.467', 'grad_norm': '1.023', 'learning_rate': '4.049e-05', 'epoch': '2.761'}
482
+ {'loss': '2.47', 'grad_norm': '1.021', 'learning_rate': '3.91e-05', 'epoch': '2.769'}
483
+ {'loss': '2.468', 'grad_norm': '0.9946', 'learning_rate': '3.772e-05', 'epoch': '2.777'}
484
+ {'loss': '2.464', 'grad_norm': '1.031', 'learning_rate': '3.633e-05', 'epoch': '2.785'}
485
+ 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 34000/36624 [38:52<03:00, 14.52it/s]
486
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 183.19it/s]
487
+ {'loss': '2.467', 'grad_norm': '1.05', 'learning_rate': '3.495e-05', 'epoch': '2.793'}
488
+ {'loss': '2.469', 'grad_norm': '1.043', 'learning_rate': '3.356e-05', 'epoch': '2.801'}
489
+ {'loss': '2.468', 'grad_norm': '0.9955', 'learning_rate': '3.218e-05', 'epoch': '2.81'}
490
+ {'loss': '2.461', 'grad_norm': '0.9882', 'learning_rate': '3.08e-05', 'epoch': '2.818'}
491
+ {'loss': '2.463', 'grad_norm': '1.023', 'learning_rate': '2.941e-05', 'epoch': '2.826'}
492
+ 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 34500/36624 [39:27<02:28, 14.27it/s]
493
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 201.85it/s]
494
+ {'loss': '2.464', 'grad_norm': '1.062', 'learning_rate': '2.803e-05', 'epoch': '2.834'}
495
+ {'loss': '2.465', 'grad_norm': '1.065', 'learning_rate': '2.664e-05', 'epoch': '2.842'}
496
+ {'loss': '2.468', 'grad_norm': '1.01', 'learning_rate': '2.526e-05', 'epoch': '2.851'}
497
+ {'loss': '2.463', 'grad_norm': '0.9994', 'learning_rate': '2.388e-05', 'epoch': '2.859'}
498
+ {'loss': '2.464', 'grad_norm': '1.032', 'learning_rate': '2.249e-05', 'epoch': '2.867'}
499
+ 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 35000/36624 [40:02<01:53, 14.32it/s]
500
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 174.93it/s]
501
+ {'loss': '2.467', 'grad_norm': '1.233', 'learning_rate': '2.111e-05', 'epoch': '2.875'}
502
+ {'loss': '2.466', 'grad_norm': '1.069', 'learning_rate': '1.972e-05', 'epoch': '2.883'}
503
+ {'loss': '2.469', 'grad_norm': '1.015', 'learning_rate': '1.834e-05', 'epoch': '2.892'}
504
+ {'loss': '2.466', 'grad_norm': '1.033', 'learning_rate': '1.696e-05', 'epoch': '2.9'}
505
+ {'loss': '2.463', 'grad_norm': '1.04', 'learning_rate': '1.557e-05', 'epoch': '2.908'}
506
+ 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 35500/36624 [40:37<01:17, 14.42it/s]
507
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 184.19it/s]
508
+ {'loss': '2.469', 'grad_norm': '1.007', 'learning_rate': '1.419e-05', 'epoch': '2.916'}
509
+ {'loss': '2.468', 'grad_norm': '1.025', 'learning_rate': '1.28e-05', 'epoch': '2.924'}
510
+ {'loss': '2.465', 'grad_norm': '1.033', 'learning_rate': '1.142e-05', 'epoch': '2.933'}
511
+ {'loss': '2.464', 'grad_norm': '1.045', 'learning_rate': '1.003e-05', 'epoch': '2.941'}
512
+ {'loss': '2.464', 'grad_norm': '1.008', 'learning_rate': '8.651e-06', 'epoch': '2.949'}
513
+ 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 36000/36624 [41:11<00:43, 14.40it/s]
514
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 163.65it/s]
515
+ {'loss': '2.463', 'grad_norm': '1.035', 'learning_rate': '7.267e-06', 'epoch': '2.957'}
516
+ {'loss': '2.46', 'grad_norm': '1.018', 'learning_rate': '5.883e-06', 'epoch': '2.965'}
517
+ {'loss': '2.463', 'grad_norm': '1.01', 'learning_rate': '4.498e-06', 'epoch': '2.973'}
518
+ {'loss': '2.463', 'grad_norm': '1.014', 'learning_rate': '3.114e-06', 'epoch': '2.982'}
519
+ {'loss': '2.459', 'grad_norm': '0.9757', 'learning_rate': '1.73e-06', 'epoch': '2.99'}
520
+ 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 36500/36624 [41:46<00:08, 14.23it/s]
521
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 177.29it/s]
522
+ {'loss': '2.464', 'grad_norm': '1.004', 'learning_rate': '3.46e-07', 'epoch': '2.998'}
523
+ 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 36623/36624 [41:55<00:00, 11.70it/s]
524
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 197.55it/s]
525
+ {'train_runtime': '2516', 'train_samples_per_second': '1863', 'train_steps_per_second': '14.56', 'train_loss': '2.575', 'epoch': '3'}
526
+ 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 36624/36624 [41:56<00:00, 14.56it/s]
527
+ Writing model shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 160.99it/s]
528
+ [*] Training finished.