File size: 75,123 Bytes
43dbce4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 |
Downloading shards: 0%| | 0/4 [00:00<?, ?it/s]
Downloading shards: 100%|██████████| 4/4 [00:00<00:00, 15019.89it/s
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [02:32<07:37, 152.47s/it]
Loading checkpoint shards: 50%|█████ | 2/4 [05:15<05:17, 158.95s/it]
Loading checkpoint shards: 75%|███████▌ | 3/4 [06:29<02:00, 120.08s/it
Loading checkpoint shards: 100%|██████████| 4/4 [08:05<00:00, 110.39s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [08:05<00:00, 121.32s/it] 2025-04-12 02:22:34 - INFO - __main__ - *** Initializing model kwargs ***
Downloading shards: 0%| | 0/4 [00:00<?, ?it/s]
Downloading shards: 100%|██████████| 4/4 [00:00<00:00, 15087.42it/s]
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [02:49<08:27, 169.15s/it]
Loading checkpoint shards: 50%|█████ | 2/4 [05:19<05:16, 158.29s/it]
Loading checkpoint shards: 75%|███████▌ | 3/4 [06:36<02:01, 121.00s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [08:40<00:00, 122.19s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [08:40<00:00, 130.12s/it]
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2025-04-12 02:31:30 - INFO - __main__ - *** Train ***
2025-04-12 02:31:30 - INFO - __main__ - DeepseekV2ForCausalLM(
(model): DeepseekV2Model(
(embed_tokens): Embedding(102400, 2048)
(layers): ModuleList(
(0): DeepseekV2DecoderLayer(
(self_attn): DeepseekV2FlashAttention2(
(q_proj): Linear(in_features=2048, out_features=3072, bias=False)
(kv_a_proj_with_mqa): Linear(in_features=2048, out_features=576, bias=False)
(kv_a_layernorm): DeepseekV2RMSNorm()
(kv_b_proj): Linear(in_features=512, out_features=4096, bias=False)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
(rotary_emb): DeepseekV2YarnRotaryEmbedding()
)
(mlp): DeepseekV2MLP(
(gate_proj): Linear(in_features=2048, out_features=10944, bias=False)
(up_proj): Linear(in_features=2048, out_features=10944, bias=False)
(down_proj): Linear(in_features=10944, out_features=2048, bias=False)
(act_fn): SiLU()
)
(input_layernorm): DeepseekV2RMSNorm()
(post_attention_layernorm): DeepseekV2RMSNorm()
)
(1-26): 26 x DeepseekV2DecoderLayer(
(self_attn): DeepseekV2FlashAttention2(
(q_proj): Linear(in_features=2048, out_features=3072, bias=False)
(kv_a_proj_with_mqa): Linear(in_features=2048, out_features=576, bias=False)
(kv_a_layernorm): DeepseekV2RMSNorm()
(kv_b_proj): Linear(in_features=512, out_features=4096, bias=False)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
(rotary_emb): DeepseekV2YarnRotaryEmbedding()
)
(mlp): DeepseekV2MoE(
(experts): ModuleList(
(0-63): 64 x DeepseekV2MLP(
(gate_proj): Linear(in_features=2048, out_features=1408, bias=False)
(up_proj): Linear(in_features=2048, out_features=1408, bias=False)
(down_proj): Linear(in_features=1408, out_features=2048, bias=False)
(act_fn): SiLU()
)
)
(gate): MoEGate()
(shared_experts): DeepseekV2MLP(
(gate_proj): Linear(in_features=2048, out_features=2816, bias=False)
(up_proj): Linear(in_features=2048, out_features=2816, bias=False)
(down_proj): Linear(in_features=2816, out_features=2048, bias=False)
(act_fn): SiLU()
)
)
(input_layernorm): DeepseekV2RMSNorm()
(post_attention_layernorm): DeepseekV2RMSNorm()
)
)
(norm): DeepseekV2RMSNorm()
)
(lm_head): Linear(in_features=2048, out_features=102400, bias=False)
)
Parameter Offload: Total persistent parameters: 126464 in 82 params
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: hector_ (hector_-carnegie-mellon-university) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.19.8
wandb: Run data is saved locally in /ocean/projects/cis240137p/hhe4/deepseek/open-r1/wandb/run-20250412_023140-uclca8i9
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run data/DeepSeek-Coder-V2-Lite-Instruct
wandb: ⭐️ View project at https://wandb.ai/hector_-carnegie-mellon-university/huggingface
wandb: 🚀 View run at https://wandb.ai/hector_-carnegie-mellon-university/huggingface/runs/uclca8i9
0%| | 0/27018 [00:00<?, ?it/s]
0%| | 1/27018 [02:09<968:56:05, 129.11s/it]
{'loss': 1.0767, 'grad_norm': 3.761256242740549, 'learning_rate': 1.850481125092524e-08, 'mean_token_accuracy': 0.7361599057912827, 'epoch': 0.0}
0%| | 1/27018 [02:09<968:56:05, 129.11s/it]
0%| | 2/27018 [02:45<557:46:13, 74.33s/it]
{'loss': 1.0521, 'grad_norm': 3.3387507575165776, 'learning_rate': 3.700962250185048e-08, 'mean_token_accuracy': 0.7391770929098129, 'epoch': 0.0}
0%| | 2/27018 [02:45<557:46:13, 74.33s/it]
0%| | 3/27018 [03:21<428:49:44, 57.15s/it]
{'loss': 1.0531, 'grad_norm': 4.061402524697594, 'learning_rate': 5.551443375277572e-08, 'mean_token_accuracy': 0.7451232373714447, 'epoch': 0.0}
0%| | 3/27018 [03:21<428:49:44, 57.15s/it]
0%| | 4/27018 [03:57<364:56:09, 48.63s/it]
{'loss': 1.0937, 'grad_norm': 4.054227716840588, 'learning_rate': 7.401924500370096e-08, 'mean_token_accuracy': 0.7327221184968948, 'epoch': 0.0}
0%| | 4/27018 [03:57<364:56:09, 48.63s/it]
0%| | 5/27018 [04:33<332:16:26, 44.28s/it]
{'loss': 1.0599, 'grad_norm': 3.682744277789701, 'learning_rate': 9.25240562546262e-08, 'mean_token_accuracy': 0.7354009747505188, 'epoch': 0.0}
0%| | 5/27018 [04:33<332:16:26, 44.28s/it]
0%| | 6/27018 [05:09<309:39:42, 41.27s/it]
{'loss': 1.1111, 'grad_norm': 4.125894488396847, 'learning_rate': 1.1102886750555144e-07, 'mean_token_accuracy': 0.7249621748924255, 'epoch': 0.0}
0%| | 6/27018 [05:09<309:39:42, 41.27s/it]
0%| | 7/27018 [05:45<297:45:40, 39.69s/it]
{'loss': 1.0761, 'grad_norm': 3.721979003437037, 'learning_rate': 1.295336787564767e-07, 'mean_token_accuracy': 0.7334488332271576, 'epoch': 0.0}
0%| | 7/27018 [05:45<297:45:40, 39.69s/it]
0%| | 8/27018 [06:21<287:43:51, 38.35s/it]
{'loss': 1.0865, 'grad_norm': 3.7726095466697966, 'learning_rate': 1.4803849000740193e-07, 'mean_token_accuracy': 0.7343186438083649, 'epoch': 0.0}
0%| | 8/27018 [06:21<287:43:51, 38.35s/it]
0%| | 9/27018 [06:57<283:41:28, 37.81s/it]
{'loss': 1.0436, 'grad_norm': 3.610468598945047, 'learning_rate': 1.6654330125832717e-07, 'mean_token_accuracy': 0.7429732382297516, 'epoch': 0.0}
0%| | 9/27018 [06:57<283:41:28, 37.81s/it]
0%| | 10/27018 [07:32<277:22:11, 36.97s/it]
{'loss': 1.0122, 'grad_norm': 3.5965404038806796, 'learning_rate': 1.850481125092524e-07, 'mean_token_accuracy': 0.7525608986616135, 'epoch': 0.0}
0%| | 10/27018 [07:32<277:22:11, 36.97s/it]
0%| | 11/27018 [08:09<275:24:40, 36.71s/it]
{'loss': 1.0813, 'grad_norm': 3.931084760956039, 'learning_rate': 2.0355292376017764e-07, 'mean_token_accuracy': 0.7335087954998016, 'epoch': 0.0}
0%| | 11/27018 [08:09<275:24:40, 36.71s/it]
0%| | 12/27018 [08:44<272:56:28, 36.38s/it]
{'loss': 1.0753, 'grad_norm': 3.7079464351417846, 'learning_rate': 2.220577350111029e-07, 'mean_token_accuracy': 0.7332916706800461, 'epoch': 0.0}
0%| | 12/27018 [08:44<272:56:28, 36.38s/it]
0%| | 13/27018 [09:20<271:17:08, 36.16s/it]
{'loss': 1.0568, 'grad_norm': 4.125622861984457, 'learning_rate': 2.4056254626202816e-07, 'mean_token_accuracy': 0.7445002794265747, 'epoch': 0.0}
0%| | 13/27018 [09:20<271:17:08, 36.16s/it]
0%| | 14/27018 [09:56<270:31:15, 36.06s/it]
{'loss': 1.0557, 'grad_norm': 4.003037058726355, 'learning_rate': 2.590673575129534e-07, 'mean_token_accuracy': 0.7420943975448608, 'epoch': 0.0}
0%| | 14/27018 [09:56<270:31:15, 36.06s/it]
0%| | 15/27018 [10:32<270:03:14, 36.00s/it]
{'loss': 1.0727, 'grad_norm': 3.790358899157455, 'learning_rate': 2.7757216876387866e-07, 'mean_token_accuracy': 0.7354038208723068, 'epoch': 0.0}
0%| | 15/27018 [10:32<270:03:14, 36.00s/it]
0%| | 16/27018 [11:07<268:57:33, 35.86s/it]
{'loss': 1.0238, 'grad_norm': 3.3304393221181923, 'learning_rate': 2.9607698001480385e-07, 'mean_token_accuracy': 0.7453643530607224, 'epoch': 0.0}
0%| | 16/27018 [11:07<268:57:33, 35.86s/it]
0%| | 17/27018 [11:43<268:32:58, 35.81s/it]
{'loss': 1.0195, 'grad_norm': 3.4746396715804218, 'learning_rate': 3.145817912657291e-07, 'mean_token_accuracy': 0.750056728720665, 'epoch': 0.0}
0%| | 17/27018 [11:43<268:32:58, 35.81s/it]
0%| | 18/27018 [12:19<268:35:39, 35.81s/it]
{'loss': 1.0552, 'grad_norm': 3.8093720705716314, 'learning_rate': 3.3308660251665435e-07, 'mean_token_accuracy': 0.7415639907121658, 'epoch': 0.0}
0%| | 18/27018 [12:19<268:35:39, 35.81s/it]
0%| | 19/27018 [12:56<271:16:28, 36.17s/it]
{'loss': 1.0326, 'grad_norm': 3.1067829742051125, 'learning_rate': 3.515914137675796e-07, 'mean_token_accuracy': 0.7420114874839783, 'epoch': 0.0}
0%| | 19/27018 [12:56<271:16:28, 36.17s/it]
0%| | 20/27018 [13:33<273:55:05, 36.53s/it]
{'loss': 1.0535, 'grad_norm': 3.795668018082126, 'learning_rate': 3.700962250185048e-07, 'mean_token_accuracy': 0.7384360879659653, 'epoch': 0.0}
0%| | 20/27018 [13:33<273:55:05, 36.53s/it]
0%| | 21/27018 [14:10<274:46:19, 36.64s/it]
{'loss': 1.0761, 'grad_norm': 3.8123932192381824, 'learning_rate': 3.886010362694301e-07, 'mean_token_accuracy': 0.7331705838441849, 'epoch': 0.0}
0%| | 21/27018 [14:10<274:46:19, 36.64s/it]
0%| | 22/27018 [14:46<272:59:09, 36.40s/it]
{'loss': 1.0696, 'grad_norm': 3.550652610501517, 'learning_rate': 4.071058475203553e-07, 'mean_token_accuracy': 0.7343079447746277, 'epoch': 0.0}
0%| | 22/27018 [14:46<272:59:09, 36.40s/it]
0%| | 23/27018 [15:22<271:47:36, 36.25s/it]
{'loss': 1.0861, 'grad_norm': 3.5457909221254247, 'learning_rate': 4.256106587712806e-07, 'mean_token_accuracy': 0.7308776825666428, 'epoch': 0.0}
0%| | 23/27018 [15:22<271:47:36, 36.25s/it]
0%| | 24/27018 [15:58<272:06:31, 36.29s/it]
{'loss': 1.0561, 'grad_norm': 3.340792237935092, 'learning_rate': 4.441154700222058e-07, 'mean_token_accuracy': 0.7375619858503342, 'epoch': 0.0}
0%| | 24/27018 [15:58<272:06:31, 36.29s/it]
0%| | 25/27018 [16:33<269:44:51, 35.98s/it]
{'loss': 0.9935, 'grad_norm': 3.2304598470651613, 'learning_rate': 4.626202812731311e-07, 'mean_token_accuracy': 0.7519623041152954, 'epoch': 0.0}
0%| | 25/27018 [16:33<269:44:51, 35.98s/it]
0%| | 26/27018 [17:10<270:36:15, 36.09s/it]
{'loss': 1.0704, 'grad_norm': 3.293349605415512, 'learning_rate': 4.811250925240563e-07, 'mean_token_accuracy': 0.7352389842271805, 'epoch': 0.0}
0%| | 26/27018 [17:10<270:36:15, 36.09s/it]
0%| | 27/27018 [17:45<269:02:34, 35.88s/it]
{'loss': 1.0484, 'grad_norm': 3.082820536283528, 'learning_rate': 4.996299037749815e-07, 'mean_token_accuracy': 0.7379477769136429, 'epoch': 0.0}
0%| | 27/27018 [17:45<269:02:34, 35.88s/it]
0%| | 28/27018 [18:21<270:15:27, 36.05s/it]
{'loss': 1.0014, 'grad_norm': 3.012780099153018, 'learning_rate': 5.181347150259068e-07, 'mean_token_accuracy': 0.7547429352998734, 'epoch': 0.0}
0%| | 28/27018 [18:21<270:15:27, 36.05s/it]
0%| | 29/27018 [18:57<268:35:06, 35.83s/it]
{'loss': 1.0073, 'grad_norm': 3.1423193425235794, 'learning_rate': 5.36639526276832e-07, 'mean_token_accuracy': 0.7512795478105545, 'epoch': 0.0}
0%| | 29/27018 [18:57<268:35:06, 35.83s/it]
0%| | 30/27018 [19:33<270:11:59, 36.04s/it]
{'loss': 1.0102, 'grad_norm': 2.8182207670842625, 'learning_rate': 5.551443375277573e-07, 'mean_token_accuracy': 0.7494419515132904, 'epoch': 0.0}
0%| | 30/27018 [19:33<270:11:59, 36.04s/it]
0%| | 31/27018 [20:09<269:34:34, 35.96s/it]
{'loss': 0.9696, 'grad_norm': 2.6473653597376305, 'learning_rate': 5.736491487786825e-07, 'mean_token_accuracy': 0.7585541754961014, 'epoch': 0.0}
0%| | 31/27018 [20:09<269:34:34, 35.96s/it]
0%| | 32/27018 [20:45<268:56:04, 35.88s/it]
{'loss': 1.0266, 'grad_norm': 2.711809066954901, 'learning_rate': 5.921539600296077e-07, 'mean_token_accuracy': 0.7423960566520691, 'epoch': 0.0}
0%| | 32/27018 [20:45<268:56:04, 35.88s/it]
0%| | 33/27018 [21:20<268:32:48, 35.83s/it]
{'loss': 1.0717, 'grad_norm': 2.708428658288387, 'learning_rate': 6.10658771280533e-07, 'mean_token_accuracy': 0.7314043194055557, 'epoch': 0.0}
0%| | 33/27018 [21:20<268:32:48, 35.83s/it]
0%| | 34/27018 [21:56<268:11:24, 35.78s/it]
{'loss': 1.0825, 'grad_norm': 2.709912464131371, 'learning_rate': 6.291635825314582e-07, 'mean_token_accuracy': 0.7291602790355682, 'epoch': 0.0}
0%| | 34/27018 [21:56<268:11:24, 35.78s/it]
0%| | 35/27018 [22:32<268:01:02, 35.76s/it]
{'loss': 1.0136, 'grad_norm': 2.408087149084196, 'learning_rate': 6.476683937823834e-07, 'mean_token_accuracy': 0.7427251785993576, 'epoch': 0.0}
0%| | 35/27018 [22:32<268:01:02, 35.76s/it]
0%| | 36/27018 [23:07<267:20:14, 35.67s/it]
{'loss': 0.991, 'grad_norm': 2.3137900824393136, 'learning_rate': 6.661732050333087e-07, 'mean_token_accuracy': 0.749654158949852, 'epoch': 0.0}
0%| | 36/27018 [23:07<267:20:14, 35.67s/it]
0%| | 37/27018 [23:43<267:11:06, 35.65s/it]
{'loss': 1.0549, 'grad_norm': 2.6148814226117136, 'learning_rate': 6.846780162842339e-07, 'mean_token_accuracy': 0.736319050192833, 'epoch': 0.0}
0%| | 37/27018 [23:43<267:11:06, 35.65s/it]
0%| | 38/27018 [24:18<266:56:56, 35.62s/it]
{'loss': 0.981, 'grad_norm': 2.371946280431981, 'learning_rate': 7.031828275351592e-07, 'mean_token_accuracy': 0.7526475787162781, 'epoch': 0.0}
0%| | 38/27018 [24:18<266:56:56, 35.62s/it]
0%| | 39/27018 [24:54<266:48:10, 35.60s/it]
{'loss': 1.0021, 'grad_norm': 2.193708934087373, 'learning_rate': 7.216876387860844e-07, 'mean_token_accuracy': 0.7496331185102463, 'epoch': 0.0}
0%| | 39/27018 [24:54<266:48:10, 35.60s/it]
0%| | 40/27018 [25:30<266:53:39, 35.61s/it]
{'loss': 0.996, 'grad_norm': 2.2381342463830642, 'learning_rate': 7.401924500370096e-07, 'mean_token_accuracy': 0.7501956969499588, 'epoch': 0.0}
0%| | 40/27018 [25:30<266:53:39, 35.61s/it]
0%| | 41/27018 [26:05<267:08:26, 35.65s/it]
{'loss': 1.1015, 'grad_norm': 2.427431617946151, 'learning_rate': 7.586972612879349e-07, 'mean_token_accuracy': 0.7215268909931183, 'epoch': 0.0}
0%| | 41/27018 [26:05<267:08:26, 35.65s/it]
0%| | 42/27018 [26:41<266:47:58, 35.60s/it]
{'loss': 0.9762, 'grad_norm': 2.015381037203806, 'learning_rate': 7.772020725388602e-07, 'mean_token_accuracy': 0.7515171468257904, 'epoch': 0.0}
0%| | 42/27018 [26:41<266:47:58, 35.60s/it]
0%| | 43/27018 [27:17<267:48:27, 35.74s/it]
{'loss': 1.0179, 'grad_norm': 1.894944561748791, 'learning_rate': 7.957068837897853e-07, 'mean_token_accuracy': 0.7424397468566895, 'epoch': 0.0}
0%| | 43/27018 [27:17<267:48:27, 35.74s/it]
0%| | 44/27018 [27:52<266:12:50, 35.53s/it]
{'loss': 0.9932, 'grad_norm': 1.6875287545644397, 'learning_rate': 8.142116950407106e-07, 'mean_token_accuracy': 0.7469182461500168, 'epoch': 0.0}
0%| | 44/27018 [27:52<266:12:50, 35.53s/it]
0%| | 45/27018 [28:28<267:39:00, 35.72s/it]
{'loss': 0.9765, 'grad_norm': 1.5711837552555594, 'learning_rate': 8.327165062916359e-07, 'mean_token_accuracy': 0.7469637244939804, 'epoch': 0.0}
0%| | 45/27018 [28:28<267:39:00, 35.72s/it]
0%| | 46/27018 [29:03<266:22:05, 35.55s/it]
{'loss': 0.9861, 'grad_norm': 1.5460518286078015, 'learning_rate': 8.512213175425612e-07, 'mean_token_accuracy': 0.7461169511079788, 'epoch': 0.01}
0%| | 46/27018 [29:03<266:22:05, 35.55s/it]
0%| | 47/27018 [29:40<267:46:00, 35.74s/it]
{'loss': 0.9319, 'grad_norm': 1.436017187203058, 'learning_rate': 8.697261287934863e-07, 'mean_token_accuracy': 0.7596640735864639, 'epoch': 0.01}
0%| | 47/27018 [29:40<267:46:00, 35.74s/it]
0%| | 48/27018 [30:15<267:35:45, 35.72s/it]
{'loss': 1.0189, 'grad_norm': 1.5034361240317038, 'learning_rate': 8.882309400444116e-07, 'mean_token_accuracy': 0.7376303225755692, 'epoch': 0.01}
0%| | 48/27018 [30:15<267:35:45, 35.72s/it]
0%| | 49/27018 [30:51<267:04:03, 35.65s/it]
{'loss': 0.9858, 'grad_norm': 1.4540662479396358, 'learning_rate': 9.067357512953368e-07, 'mean_token_accuracy': 0.7443555593490601, 'epoch': 0.01}
0%| | 49/27018 [30:51<267:04:03, 35.65s/it]
0%| | 50/27018 [31:26<267:19:03, 35.68s/it]
{'loss': 0.9897, 'grad_norm': 1.4219198434594897, 'learning_rate': 9.252405625462622e-07, 'mean_token_accuracy': 0.7405945956707001, 'epoch': 0.01}
0%| | 50/27018 [31:26<267:19:03, 35.68s/it]
0%| | 51/27018 [32:02<267:09:23, 35.66s/it]
{'loss': 0.9683, 'grad_norm': 1.3478602244140199, 'learning_rate': 9.437453737971873e-07, 'mean_token_accuracy': 0.7484601587057114, 'epoch': 0.01}
0%| | 51/27018 [32:02<267:09:23, 35.66s/it]
0%| | 52/27018 [32:38<267:10:30, 35.67s/it]
{'loss': 0.9942, 'grad_norm': 1.35817832477438, 'learning_rate': 9.622501850481126e-07, 'mean_token_accuracy': 0.7412698864936829, 'epoch': 0.01}
0%| | 52/27018 [32:38<267:10:30, 35.67s/it]
0%| | 53/27018 [33:13<266:44:11, 35.61s/it]
{'loss': 0.9529, 'grad_norm': 1.2962407609768567, 'learning_rate': 9.807549962990378e-07, 'mean_token_accuracy': 0.7520514279603958, 'epoch': 0.01}
0%| | 53/27018 [33:13<266:44:11, 35.61s/it]
0%| | 54/27018 [33:49<266:50:23, 35.63s/it]
{'loss': 0.9368, 'grad_norm': 1.3261655823572684, 'learning_rate': 9.99259807549963e-07, 'mean_token_accuracy': 0.7552744448184967, 'epoch': 0.01}
0%| | 54/27018 [33:49<266:50:23, 35.63s/it]
0%| | 55/27018 [34:24<266:30:29, 35.58s/it]
{'loss': 0.9704, 'grad_norm': 1.2358570958569075, 'learning_rate': 1.0177646188008883e-06, 'mean_token_accuracy': 0.7480522245168686, 'epoch': 0.01}
0%| | 55/27018 [34:24<266:30:29, 35.58s/it]
0%| | 56/27018 [35:00<266:43:25, 35.61s/it]
{'loss': 1.0048, 'grad_norm': 1.2065027662838093, 'learning_rate': 1.0362694300518136e-06, 'mean_token_accuracy': 0.7388745844364166, 'epoch': 0.01}
0%| | 56/27018 [35:00<266:43:25, 35.61s/it]
0%| | 57/27018 [35:35<266:07:10, 35.53s/it]
{'loss': 0.9815, 'grad_norm': 1.1387838763797022, 'learning_rate': 1.0547742413027388e-06, 'mean_token_accuracy': 0.741651862859726, 'epoch': 0.01}
0%| | 57/27018 [35:35<266:07:10, 35.53s/it]
0%| | 58/27018 [36:11<266:03:47, 35.53s/it]
{'loss': 0.9983, 'grad_norm': 1.097553072908179, 'learning_rate': 1.073279052553664e-06, 'mean_token_accuracy': 0.7401973009109497, 'epoch': 0.01}
0%| | 58/27018 [36:11<266:03:47, 35.53s/it]
0%| | 59/27018 [36:46<266:01:41, 35.52s/it]
{'loss': 0.953, 'grad_norm': 1.030468065506278, 'learning_rate': 1.0917838638045893e-06, 'mean_token_accuracy': 0.749171569943428, 'epoch': 0.01}
0%| | 59/27018 [36:46<266:01:41, 35.52s/it]
0%| | 60/27018 [37:22<266:45:21, 35.62s/it]
{'loss': 0.9184, 'grad_norm': 0.9742460961592471, 'learning_rate': 1.1102886750555146e-06, 'mean_token_accuracy': 0.7577142417430878, 'epoch': 0.01}
0%| | 60/27018 [37:22<266:45:21, 35.62s/it]
0%| | 61/27018 [37:58<266:07:53, 35.54s/it]
{'loss': 0.9073, 'grad_norm': 0.9379421342698382, 'learning_rate': 1.1287934863064398e-06, 'mean_token_accuracy': 0.7583868056535721, 'epoch': 0.01}
0%| | 61/27018 [37:58<266:07:53, 35.54s/it]
0%| | 62/27018 [38:33<266:11:55, 35.55s/it]
{'loss': 0.8743, 'grad_norm': 0.8959341830613178, 'learning_rate': 1.147298297557365e-06, 'mean_token_accuracy': 0.7668353468179703, 'epoch': 0.01}
0%| | 62/27018 [38:33<266:11:55, 35.55s/it]
0%| | 63/27018 [39:09<265:59:38, 35.53s/it]
{'loss': 0.9062, 'grad_norm': 0.935753593705495, 'learning_rate': 1.1658031088082903e-06, 'mean_token_accuracy': 0.7551182210445404, 'epoch': 0.01}
0%| | 63/27018 [39:09<265:59:38, 35.53s/it]
0%| | 64/27018 [39:44<266:02:05, 35.53s/it]
{'loss': 0.8541, 'grad_norm': 0.8666475431924053, 'learning_rate': 1.1843079200592154e-06, 'mean_token_accuracy': 0.7716537266969681, 'epoch': 0.01}
0%| | 64/27018 [39:44<266:02:05, 35.53s/it]
0%| | 65/27018 [40:20<265:49:43, 35.51s/it]
{'loss': 0.9431, 'grad_norm': 0.9410205958955291, 'learning_rate': 1.2028127313101408e-06, 'mean_token_accuracy': 0.7494966983795166, 'epoch': 0.01}
0%| | 65/27018 [40:20<265:49:43, 35.51s/it]
0%| | 66/27018 [40:56<266:58:40, 35.66s/it]
{'loss': 0.8943, 'grad_norm': 0.8400655731427539, 'learning_rate': 1.221317542561066e-06, 'mean_token_accuracy': 0.7571914792060852, 'epoch': 0.01}
0%| | 66/27018 [40:56<266:58:40, 35.66s/it]
0%| | 67/27018 [41:31<265:45:02, 35.50s/it]
{'loss': 0.9213, 'grad_norm': 0.8191163930178271, 'learning_rate': 1.2398223538119913e-06, 'mean_token_accuracy': 0.753596767783165, 'epoch': 0.01}
0%| | 67/27018 [41:31<265:45:02, 35.50s/it]
0%| | 68/27018 [42:07<266:53:33, 35.65s/it]
{'loss': 0.822, 'grad_norm': 0.8097934688084351, 'learning_rate': 1.2583271650629164e-06, 'mean_token_accuracy': 0.7756603062152863, 'epoch': 0.01}
0%| | 68/27018 [42:07<266:53:33, 35.65s/it]
0%| | 69/27018 [42:42<266:55:22, 35.66s/it]
{'loss': 0.838, 'grad_norm': 0.816407481402929, 'learning_rate': 1.2768319763138415e-06, 'mean_token_accuracy': 0.7720852047204971, 'epoch': 0.01}
0%| | 69/27018 [42:42<266:55:22, 35.66s/it]
0%| | 70/27018 [43:18<266:35:31, 35.61s/it]
{'loss': 0.9, 'grad_norm': 0.7972140050201161, 'learning_rate': 1.2953367875647669e-06, 'mean_token_accuracy': 0.7579542249441147, 'epoch': 0.01}
0%| | 70/27018 [43:18<266:35:31, 35.61s/it]
0%| | 71/27018 [43:54<266:45:10, 35.64s/it]
{'loss': 0.8341, 'grad_norm': 0.7601698407217086, 'learning_rate': 1.3138415988156922e-06, 'mean_token_accuracy': 0.7709158957004547, 'epoch': 0.01}
0%| | 71/27018 [43:54<266:45:10, 35.64s/it]
0%| | 72/27018 [44:29<266:28:20, 35.60s/it]
{'loss': 0.8812, 'grad_norm': 0.7755541278941626, 'learning_rate': 1.3323464100666174e-06, 'mean_token_accuracy': 0.7598071694374084, 'epoch': 0.01}
0%| | 72/27018 [44:29<266:28:20, 35.60s/it]
0%| | 73/27018 [45:05<266:26:18, 35.60s/it]
{'loss': 0.8308, 'grad_norm': 0.7130977725065148, 'learning_rate': 1.3508512213175425e-06, 'mean_token_accuracy': 0.7758853584527969, 'epoch': 0.01}
0%| | 73/27018 [45:05<266:26:18, 35.60s/it]
0%| | 74/27018 [45:40<265:47:03, 35.51s/it]
{'loss': 0.9045, 'grad_norm': 0.7447278218675328, 'learning_rate': 1.3693560325684679e-06, 'mean_token_accuracy': 0.7527081966400146, 'epoch': 0.01}
0%| | 74/27018 [45:40<265:47:03, 35.51s/it]
0%| | 75/27018 [46:16<265:53:30, 35.53s/it]
{'loss': 0.9194, 'grad_norm': 0.7575760410549229, 'learning_rate': 1.387860843819393e-06, 'mean_token_accuracy': 0.7522684335708618, 'epoch': 0.01}
0%| | 75/27018 [46:16<265:53:30, 35.53s/it]
0%| | 76/27018 [46:51<265:57:13, 35.54s/it]
{'loss': 0.7884, 'grad_norm': 0.677276657926885, 'learning_rate': 1.4063656550703184e-06, 'mean_token_accuracy': 0.7846749424934387, 'epoch': 0.01}
0%| | 76/27018 [46:51<265:57:13, 35.54s/it]
0%| | 77/27018 [47:27<266:13:56, 35.58s/it]
{'loss': 0.8379, 'grad_norm': 0.6868316084376532, 'learning_rate': 1.4248704663212437e-06, 'mean_token_accuracy': 0.772451713681221, 'epoch': 0.01}
0%| | 77/27018 [47:27<266:13:56, 35.58s/it]
0%| | 78/27018 [48:02<265:31:42, 35.48s/it]
{'loss': 0.8585, 'grad_norm': 0.7477854401253934, 'learning_rate': 1.4433752775721689e-06, 'mean_token_accuracy': 0.7650453746318817, 'epoch': 0.01}
0%| | 78/27018 [48:02<265:31:42, 35.48s/it]
0%| | 79/27018 [48:38<266:04:13, 35.56s/it]
{'loss': 0.7777, 'grad_norm': 0.6389417998246991, 'learning_rate': 1.461880088823094e-06, 'mean_token_accuracy': 0.7843832522630692, 'epoch': 0.01}
0%| | 79/27018 [48:38<266:04:13, 35.56s/it]
0%| | 80/27018 [49:13<265:17:00, 35.45s/it]
{'loss': 0.8268, 'grad_norm': 0.6868689316769971, 'learning_rate': 1.4803849000740192e-06, 'mean_token_accuracy': 0.7715295404195786, 'epoch': 0.01}
0%| | 80/27018 [49:13<265:17:00, 35.45s/it]
0%| | 81/27018 [49:49<265:28:29, 35.48s/it]
{'loss': 0.8193, 'grad_norm': 0.6637730419071131, 'learning_rate': 1.4988897113249447e-06, 'mean_token_accuracy': 0.7740767598152161, 'epoch': 0.01}
0%| | 81/27018 [49:49<265:28:29, 35.48s/it]
0%| | 82/27018 [50:24<265:24:22, 35.47s/it]
{'loss': 0.8683, 'grad_norm': 0.6894362487494093, 'learning_rate': 1.5173945225758699e-06, 'mean_token_accuracy': 0.7630463987588882, 'epoch': 0.01}
0%| | 82/27018 [50:24<265:24:22, 35.47s/it]
0%| | 83/27018 [51:00<266:26:36, 35.61s/it]
{'loss': 0.8346, 'grad_norm': 0.6377911308678318, 'learning_rate': 1.535899333826795e-06, 'mean_token_accuracy': 0.7699476480484009, 'epoch': 0.01}
0%| | 83/27018 [51:00<266:26:36, 35.61s/it]
0%| | 84/27018 [51:35<265:44:12, 35.52s/it]
{'loss': 0.8361, 'grad_norm': 0.6468651195025704, 'learning_rate': 1.5544041450777204e-06, 'mean_token_accuracy': 0.7670280039310455, 'epoch': 0.01}
0%| | 84/27018 [51:35<265:44:12, 35.52s/it]
0%| | 85/27018 [52:11<265:49:46, 35.53s/it]
{'loss': 0.8384, 'grad_norm': 0.6525954096402884, 'learning_rate': 1.5729089563286455e-06, 'mean_token_accuracy': 0.7692025601863861, 'epoch': 0.01}
0%| | 85/27018 [52:11<265:49:46, 35.53s/it]
0%| | 86/27018 [52:46<265:17:23, 35.46s/it]
{'loss': 0.8523, 'grad_norm': 0.6849049729888294, 'learning_rate': 1.5914137675795706e-06, 'mean_token_accuracy': 0.7634509801864624, 'epoch': 0.01}
0%| | 86/27018 [52:46<265:17:23, 35.46s/it]
0%| | 87/27018 [53:23<267:24:45, 35.75s/it]
{'loss': 0.7749, 'grad_norm': 0.6236266336943352, 'learning_rate': 1.6099185788304958e-06, 'mean_token_accuracy': 0.7860226482152939, 'epoch': 0.01}
0%| | 87/27018 [53:23<267:24:45, 35.75s/it]
0%| | 88/27018 [53:58<267:45:47, 35.79s/it]
{'loss': 0.8795, 'grad_norm': 0.648528593196904, 'learning_rate': 1.6284233900814211e-06, 'mean_token_accuracy': 0.7571646869182587, 'epoch': 0.01}
0%| | 88/27018 [53:58<267:45:47, 35.79s/it]
0%| | 89/27018 [54:35<268:36:31, 35.91s/it]
{'loss': 0.8484, 'grad_norm': 0.6308193394209017, 'learning_rate': 1.6469282013323467e-06, 'mean_token_accuracy': 0.7643623352050781, 'epoch': 0.01}
0%| | 89/27018 [54:35<268:36:31, 35.91s/it]
0%| | 90/27018 [55:12<270:54:46, 36.22s/it]
{'loss': 0.8267, 'grad_norm': 0.6466863046892821, 'learning_rate': 1.6654330125832718e-06, 'mean_token_accuracy': 0.7677317261695862, 'epoch': 0.01}
0%| | 90/27018 [55:12<270:54:46, 36.22s/it]
0%| | 91/27018 [55:47<270:05:38, 36.11s/it]
{'loss': 0.8067, 'grad_norm': 0.6103002414817679, 'learning_rate': 1.683937823834197e-06, 'mean_token_accuracy': 0.7732284963130951, 'epoch': 0.01}
0%| | 91/27018 [55:47<270:05:38, 36.11s/it]
0%| | 92/27018 [56:24<271:37:56, 36.32s/it]
{'loss': 0.8649, 'grad_norm': 0.641647518577714, 'learning_rate': 1.7024426350851223e-06, 'mean_token_accuracy': 0.7606352418661118, 'epoch': 0.01}
0%| | 92/27018 [56:24<271:37:56, 36.32s/it]
0%| | 93/27018 [57:00<270:22:01, 36.15s/it]
{'loss': 0.8243, 'grad_norm': 0.6223749103901997, 'learning_rate': 1.7209474463360475e-06, 'mean_token_accuracy': 0.7693850100040436, 'epoch': 0.01}
0%| | 93/27018 [57:00<270:22:01, 36.15s/it]
0%| | 94/27018 [57:36<270:22:10, 36.15s/it]
{'loss': 0.7611, 'grad_norm': 0.5971059828089051, 'learning_rate': 1.7394522575869726e-06, 'mean_token_accuracy': 0.7878217697143555, 'epoch': 0.01}
0%| | 94/27018 [57:36<270:22:10, 36.15s/it]
0%| | 95/27018 [58:12<269:27:15, 36.03s/it]
{'loss': 0.8265, 'grad_norm': 0.6116455417543558, 'learning_rate': 1.757957068837898e-06, 'mean_token_accuracy': 0.7702111601829529, 'epoch': 0.01}
0%| | 95/27018 [58:12<269:27:15, 36.03s/it]
0%| | 96/27018 [58:48<270:13:48, 36.14s/it]
{'loss': 0.756, 'grad_norm': 0.5753584852764662, 'learning_rate': 1.7764618800888231e-06, 'mean_token_accuracy': 0.7871744483709335, 'epoch': 0.01}
0%| | 96/27018 [58:48<270:13:48, 36.14s/it]
0%| | 97/27018 [59:24<270:17:36, 36.14s/it]
{'loss': 0.8121, 'grad_norm': 0.5915388529772352, 'learning_rate': 1.7949666913397482e-06, 'mean_token_accuracy': 0.7741342335939407, 'epoch': 0.01}
0%| | 97/27018 [59:24<270:17:36, 36.14s/it]
0%| | 98/27018 [1:00:01<270:12:52, 36.14s/it]
{'loss': 0.7978, 'grad_norm': 0.592854588373003, 'learning_rate': 1.8134715025906736e-06, 'mean_token_accuracy': 0.778115764260292, 'epoch': 0.01}
0%| | 98/27018 [1:00:01<270:12:52, 36.14s/it]
0%| | 99/27018 [1:00:36<269:33:10, 36.05s/it]
{'loss': 0.8165, 'grad_norm': 0.6015148556964603, 'learning_rate': 1.8319763138415992e-06, 'mean_token_accuracy': 0.7702448964118958, 'epoch': 0.01}
0%| | 99/27018 [1:00:36<269:33:10, 36.05s/it]
0%| | 100/27018 [1:01:13<270:18:37, 36.15s/it]
{'loss': 0.7789, 'grad_norm': 0.5812712534386835, 'learning_rate': 1.8504811250925243e-06, 'mean_token_accuracy': 0.7810820937156677, 'epoch': 0.01}
0%| | 100/27018 [1:01:13<270:18:37, 36.15s/it]
0%| | 101/27018 [1:01:49<269:44:59, 36.08s/it]
{'loss': 0.7765, 'grad_norm': 0.5782703861918091, 'learning_rate': 1.8689859363434495e-06, 'mean_token_accuracy': 0.7830040603876114, 'epoch': 0.01}
0%| | 101/27018 [1:01:49<269:44:59, 36.08s/it]
0%| | 102/27018 [1:02:25<270:56:12, 36.24s/it]
{'loss': 0.8159, 'grad_norm': 0.610733632765679, 'learning_rate': 1.8874907475943746e-06, 'mean_token_accuracy': 0.7720097601413727, 'epoch': 0.01}
0%| | 102/27018 [1:02:25<270:56:12, 36.24s/it]
0%| | 103/27018 [1:03:01<269:42:49, 36.08s/it]
{'loss': 0.7982, 'grad_norm': 0.5873488453449385, 'learning_rate': 1.9059955588453e-06, 'mean_token_accuracy': 0.7772262543439865, 'epoch': 0.01}
0%| | 103/27018 [1:03:01<269:42:49, 36.08s/it]
0%| | 104/27018 [1:03:38<271:09:57, 36.27s/it]
{'loss': 0.8718, 'grad_norm': 0.6174080230477802, 'learning_rate': 1.9245003700962253e-06, 'mean_token_accuracy': 0.7561975866556168, 'epoch': 0.01}
0%| | 104/27018 [1:03:38<271:09:57, 36.27s/it]
0%| | 105/27018 [1:04:14<270:18:56, 36.16s/it]
{'loss': 0.8345, 'grad_norm': 0.5923914949181017, 'learning_rate': 1.9430051813471504e-06, 'mean_token_accuracy': 0.7681046724319458, 'epoch': 0.01}
0%| | 105/27018 [1:04:14<270:18:56, 36.16s/it]
0%| | 106/27018 [1:04:50<271:12:50, 36.28s/it]
{'loss': 0.8012, 'grad_norm': 0.5864668621611937, 'learning_rate': 1.9615099925980756e-06, 'mean_token_accuracy': 0.7755322605371475, 'epoch': 0.01}
0%| | 106/27018 [1:04:50<271:12:50, 36.28s/it]
0%| | 107/27018 [1:05:27<271:37:19, 36.34s/it]
{'loss': 0.8272, 'grad_norm': 0.597449696505216, 'learning_rate': 1.9800148038490007e-06, 'mean_token_accuracy': 0.7676922231912613, 'epoch': 0.01}
0%| | 107/27018 [1:05:27<271:37:19, 36.34s/it]
0%| | 108/27018 [1:06:03<271:01:48, 36.26s/it]
{'loss': 0.8097, 'grad_norm': 0.5866985682626478, 'learning_rate': 1.998519615099926e-06, 'mean_token_accuracy': 0.7735882699489594, 'epoch': 0.01}
0%| | 108/27018 [1:06:03<271:01:48, 36.26s/it]
0%| | 109/27018 [1:06:39<271:46:38, 36.36s/it]
{'loss': 0.8409, 'grad_norm': 0.6097280709817102, 'learning_rate': 2.017024426350851e-06, 'mean_token_accuracy': 0.7624227404594421, 'epoch': 0.01}
0%| | 109/27018 [1:06:39<271:46:38, 36.36s/it]
0%| | 110/27018 [1:07:15<270:40:19, 36.21s/it]
{'loss': 0.7781, 'grad_norm': 0.5774045628240243, 'learning_rate': 2.0355292376017766e-06, 'mean_token_accuracy': 0.781528890132904, 'epoch': 0.01}
0%| | 110/27018 [1:07:15<270:40:19, 36.21s/it]
0%| | 111/27018 [1:07:52<272:21:27, 36.44s/it]
{'loss': 0.7814, 'grad_norm': 0.6040176239982079, 'learning_rate': 2.0540340488527017e-06, 'mean_token_accuracy': 0.7812371402978897, 'epoch': 0.01}
0%| | 111/27018 [1:07:52<272:21:27, 36.44s/it]
0%| | 112/27018 [1:08:28<270:52:15, 36.24s/it]
{'loss': 0.7832, 'grad_norm': 0.6046319138693763, 'learning_rate': 2.0725388601036273e-06, 'mean_token_accuracy': 0.7799837440252304, 'epoch': 0.01}
0%| | 112/27018 [1:08:28<270:52:15, 36.24s/it]
0%| | 113/27018 [1:09:05<272:44:41, 36.49s/it]
{'loss': 0.8037, 'grad_norm': 0.5880315067225539, 'learning_rate': 2.0910436713545524e-06, 'mean_token_accuracy': 0.776639997959137, 'epoch': 0.01}
0%| | 113/27018 [1:09:05<272:44:41, 36.49s/it]
0%| | 114/27018 [1:09:42<272:43:56, 36.49s/it]
{'loss': 0.8051, 'grad_norm': 0.5928608250253031, 'learning_rate': 2.1095484826054776e-06, 'mean_token_accuracy': 0.7722987532615662, 'epoch': 0.01}
0%| | 114/27018 [1:09:42<272:43:56, 36.49s/it]
0%| | 115/27018 [1:10:19<275:45:59, 36.90s/it]
{'loss': 0.761, 'grad_norm': 0.5718911307558772, 'learning_rate': 2.1280532938564027e-06, 'mean_token_accuracy': 0.7827950716018677, 'epoch': 0.01}
0%| | 115/27018 [1:10:19<275:45:59, 36.90s/it]
0%| | 116/27018 [1:10:56<275:54:19, 36.92s/it]
{'loss': 0.8099, 'grad_norm': 0.5925519974074734, 'learning_rate': 2.146558105107328e-06, 'mean_token_accuracy': 0.7742350548505783, 'epoch': 0.01}
0%| | 116/27018 [1:10:56<275:54:19, 36.92s/it]
0%| | 117/27018 [1:11:34<277:00:51, 37.07s/it]
{'loss': 0.7728, 'grad_norm': 0.5844337926292446, 'learning_rate': 2.165062916358253e-06, 'mean_token_accuracy': 0.7802036553621292, 'epoch': 0.01}
0%| | 117/27018 [1:11:34<277:00:51, 37.07s/it]
0%| | 118/27018 [1:12:10<275:45:07, 36.90s/it]
{'loss': 0.8345, 'grad_norm': 0.5921145114370471, 'learning_rate': 2.1835677276091785e-06, 'mean_token_accuracy': 0.7676735818386078, 'epoch': 0.01}
0%| | 118/27018 [1:12:10<275:45:07, 36.90s/it]
0%| | 119/27018 [1:12:48<277:04:41, 37.08s/it]
{'loss': 0.7462, 'grad_norm': 0.5739749733868359, 'learning_rate': 2.2020725388601037e-06, 'mean_token_accuracy': 0.7914443612098694, 'epoch': 0.01}
0%| | 119/27018 [1:12:48<277:04:41, 37.08s/it]
0%| | 120/27018 [1:13:24<274:22:15, 36.72s/it]
{'loss': 0.8043, 'grad_norm': 0.5881290439931065, 'learning_rate': 2.2205773501110293e-06, 'mean_token_accuracy': 0.7751884609460831, 'epoch': 0.01}
0%| | 120/27018 [1:13:24<274:22:15, 36.72s/it]
0%| | 121/27018 [1:14:01<274:48:36, 36.78s/it]
{'loss': 0.7706, 'grad_norm': 0.5676041980850483, 'learning_rate': 2.2390821613619544e-06, 'mean_token_accuracy': 0.7799371033906937, 'epoch': 0.01}
0%| | 121/27018 [1:14:01<274:48:36, 36.78s/it]
0%| | 122/27018 [1:14:36<272:25:29, 36.46s/it]
{'loss': 0.7498, 'grad_norm': 0.5735176985668837, 'learning_rate': 2.2575869726128795e-06, 'mean_token_accuracy': 0.7882247269153595, 'epoch': 0.01}
0%| | 122/27018 [1:14:36<272:25:29, 36.46s/it]
0%| | 123/27018 [1:15:13<272:16:55, 36.45s/it]
{'loss': 0.825, 'grad_norm': 0.5902339657457404, 'learning_rate': 2.2760917838638047e-06, 'mean_token_accuracy': 0.7664293348789215, 'epoch': 0.01}
0%| | 123/27018 [1:15:13<272:16:55, 36.45s/it]
0%| | 124/27018 [1:15:49<272:28:24, 36.47s/it]
{'loss': 0.7103, 'grad_norm': 0.5501161658645588, 'learning_rate': 2.29459659511473e-06, 'mean_token_accuracy': 0.7967648506164551, 'epoch': 0.01}
0%| | 124/27018 [1:15:49<272:28:24, 36.47s/it]
0%| | 125/27018 [1:16:25<271:40:13, 36.37s/it]
{'loss': 0.7619, 'grad_norm': 0.5701195633014386, 'learning_rate': 2.313101406365655e-06, 'mean_token_accuracy': 0.7843143343925476, 'epoch': 0.01}
0%| | 125/27018 [1:16:25<271:40:13, 36.37s/it]
0%| | 126/27018 [1:17:02<271:32:07, 36.35s/it]
{'loss': 0.8468, 'grad_norm': 0.603485337589627, 'learning_rate': 2.3316062176165805e-06, 'mean_token_accuracy': 0.7619727998971939, 'epoch': 0.01}
0%| | 126/27018 [1:17:02<271:32:07, 36.35s/it]
0%| | 127/27018 [1:17:38<271:45:15, 36.38s/it]
{'loss': 0.704, 'grad_norm': 0.5777032267611462, 'learning_rate': 2.3501110288675057e-06, 'mean_token_accuracy': 0.7986523807048798, 'epoch': 0.01}
0%| | 127/27018 [1:17:38<271:45:15, 36.38s/it]
0%| | 128/27018 [1:18:15<271:42:49, 36.38s/it]
{'loss': 0.8043, 'grad_norm': 0.5908105055911866, 'learning_rate': 2.368615840118431e-06, 'mean_token_accuracy': 0.7727077752351761, 'epoch': 0.01}
0%| | 128/27018 [1:18:15<271:42:49, 36.38s/it]
0%| | 129/27018 [1:18:51<272:24:35, 36.47s/it]
{'loss': 0.7839, 'grad_norm': 0.5807382482216293, 'learning_rate': 2.387120651369356e-06, 'mean_token_accuracy': 0.7773733139038086, 'epoch': 0.01}
0%| | 129/27018 [1:18:51<272:24:35, 36.47s/it]
0%| | 130/27018 [1:19:27<271:43:00, 36.38s/it]
{'loss': 0.7842, 'grad_norm': 0.5646013761211188, 'learning_rate': 2.4056254626202815e-06, 'mean_token_accuracy': 0.7758514881134033, 'epoch': 0.01}
0%| | 130/27018 [1:19:27<271:43:00, 36.38s/it]
0%| | 131/27018 [1:20:04<271:27:56, 36.35s/it]
{'loss': 0.783, 'grad_norm': 0.5699140000738375, 'learning_rate': 2.4241302738712067e-06, 'mean_token_accuracy': 0.7778105437755585, 'epoch': 0.01}
0%| | 131/27018 [1:20:04<271:27:56, 36.35s/it]
0%| | 132/27018 [1:20:41<273:22:05, 36.60s/it]
{'loss': 0.7583, 'grad_norm': 0.5725435052683828, 'learning_rate': 2.442635085122132e-06, 'mean_token_accuracy': 0.7819483727216721, 'epoch': 0.01}
0%| | 132/27018 [1:20:41<273:22:05, 36.60s/it]
0%| | 133/27018 [1:21:17<271:56:55, 36.41s/it]
{'loss': 0.7715, 'grad_norm': 0.5716719316808879, 'learning_rate': 2.4611398963730574e-06, 'mean_token_accuracy': 0.7799699008464813, 'epoch': 0.01}
0%| | 133/27018 [1:21:17<271:56:55, 36.41s/it]
0%| | 134/27018 [1:21:54<272:59:06, 36.56s/it]
{'loss': 0.7858, 'grad_norm': 0.5893362492810931, 'learning_rate': 2.4796447076239825e-06, 'mean_token_accuracy': 0.7768968045711517, 'epoch': 0.01}
0%| | 134/27018 [1:21:54<272:59:06, 36.56s/it]
0%| | 135/27018 [1:22:30<271:54:55, 36.41s/it]
{'loss': 0.7714, 'grad_norm': 0.555927956138565, 'learning_rate': 2.4981495188749076e-06, 'mean_token_accuracy': 0.7814275026321411, 'epoch': 0.01}
0%| | 135/27018 [1:22:30<271:54:55, 36.41s/it]
1%| | 136/27018 [1:23:07<272:42:09, 36.52s/it]
{'loss': 0.7072, 'grad_norm': 0.5449787390352295, 'learning_rate': 2.516654330125833e-06, 'mean_token_accuracy': 0.8002716153860092, 'epoch': 0.02}
1%| | 136/27018 [1:23:07<272:42:09, 36.52s/it]
1%| | 137/27018 [1:23:43<271:53:35, 36.41s/it]
{'loss': 0.7838, 'grad_norm': 0.5508835863675889, 'learning_rate': 2.535159141376758e-06, 'mean_token_accuracy': 0.7791081070899963, 'epoch': 0.02}
1%| | 137/27018 [1:23:43<271:53:35, 36.41s/it]
1%| | 138/27018 [1:24:20<273:02:34, 36.57s/it]
{'loss': 0.7404, 'grad_norm': 0.5882790664290809, 'learning_rate': 2.553663952627683e-06, 'mean_token_accuracy': 0.7900893539190292, 'epoch': 0.02}
1%| | 138/27018 [1:24:20<273:02:34, 36.57s/it]
1%| | 139/27018 [1:24:56<272:22:58, 36.48s/it]
{'loss': 0.753, 'grad_norm': 0.5604357390300815, 'learning_rate': 2.5721687638786086e-06, 'mean_token_accuracy': 0.7855844348669052, 'epoch': 0.02}
1%| | 139/27018 [1:24:56<272:22:58, 36.48s/it]
1%| | 140/27018 [1:25:33<273:31:05, 36.63s/it]
{'loss': 0.7593, 'grad_norm': 0.5738951118668525, 'learning_rate': 2.5906735751295338e-06, 'mean_token_accuracy': 0.783954530954361, 'epoch': 0.02}
1%| | 140/27018 [1:25:33<273:31:05, 36.63s/it]
1%| | 141/27018 [1:26:09<271:53:54, 36.42s/it]
{'loss': 0.7481, 'grad_norm': 0.5682732938047093, 'learning_rate': 2.6091783863804593e-06, 'mean_token_accuracy': 0.7859022617340088, 'epoch': 0.02}
1%| | 141/27018 [1:26:09<271:53:54, 36.42s/it]
1%| | 142/27018 [1:26:46<272:50:40, 36.55s/it]
{'loss': 0.7991, 'grad_norm': 0.599270842503034, 'learning_rate': 2.6276831976313845e-06, 'mean_token_accuracy': 0.7716686576604843, 'epoch': 0.02}
1%| | 142/27018 [1:26:46<272:50:40, 36.55s/it]
1%| | 143/27018 [1:27:22<271:50:31, 36.41s/it]
{'loss': 0.7388, 'grad_norm': 0.5603897643030938, 'learning_rate': 2.6461880088823096e-06, 'mean_token_accuracy': 0.7881675809621811, 'epoch': 0.02}
1%| | 143/27018 [1:27:22<271:50:31, 36.41s/it]
1%| | 144/27018 [1:27:58<272:27:11, 36.50s/it]
{'loss': 0.7466, 'grad_norm': 0.5696506489803409, 'learning_rate': 2.6646928201332348e-06, 'mean_token_accuracy': 0.7874024361371994, 'epoch': 0.02}
1%| | 144/27018 [1:27:58<272:27:11, 36.50s/it]
1%| | 145/27018 [1:28:34<271:17:43, 36.34s/it]
{'loss': 0.7466, 'grad_norm': 0.5578087355579777, 'learning_rate': 2.68319763138416e-06, 'mean_token_accuracy': 0.7878163605928421, 'epoch': 0.02}
1%| | 145/27018 [1:28:34<271:17:43, 36.34s/it]
1%| | 146/27018 [1:29:11<271:06:37, 36.32s/it]
{'loss': 0.7607, 'grad_norm': 0.5842094512863727, 'learning_rate': 2.701702442635085e-06, 'mean_token_accuracy': 0.7823595553636551, 'epoch': 0.02}
1%| | 146/27018 [1:29:11<271:06:37, 36.32s/it]
1%| | 147/27018 [1:29:47<270:49:01, 36.28s/it]
{'loss': 0.6992, 'grad_norm': 0.5544412614328007, 'learning_rate': 2.7202072538860106e-06, 'mean_token_accuracy': 0.7983556836843491, 'epoch': 0.02}
1%| | 147/27018 [1:29:47<270:49:01, 36.28s/it]
1%| | 148/27018 [1:30:23<270:39:27, 36.26s/it]
{'loss': 0.7938, 'grad_norm': 0.5975056403860753, 'learning_rate': 2.7387120651369358e-06, 'mean_token_accuracy': 0.7751787900924683, 'epoch': 0.02}
1%| | 148/27018 [1:30:23<270:39:27, 36.26s/it]
1%| | 149/27018 [1:30:59<269:26:35, 36.10s/it]
{'loss': 0.7432, 'grad_norm': 0.5659421070692527, 'learning_rate': 2.757216876387861e-06, 'mean_token_accuracy': 0.7883753627538681, 'epoch': 0.02}
1%| | 149/27018 [1:30:59<269:26:35, 36.10s/it]
1%| | 150/27018 [1:31:35<269:26:27, 36.10s/it]
{'loss': 0.7909, 'grad_norm': 0.5766302293651508, 'learning_rate': 2.775721687638786e-06, 'mean_token_accuracy': 0.7756650298833847, 'epoch': 0.02}
1%| | 150/27018 [1:31:35<269:26:27, 36.10s/it]
1%| | 151/27018 [1:32:11<270:09:53, 36.20s/it]
{'loss': 0.8005, 'grad_norm': 0.6583332446019091, 'learning_rate': 2.7942264988897116e-06, 'mean_token_accuracy': 0.770731046795845, 'epoch': 0.02}
1%| | 151/27018 [1:32:11<270:09:53, 36.20s/it]
1%| | 152/27018 [1:32:47<269:24:43, 36.10s/it]
{'loss': 0.7499, 'grad_norm': 0.5693059080627878, 'learning_rate': 2.8127313101406367e-06, 'mean_token_accuracy': 0.786851242184639, 'epoch': 0.02}
1%| | 152/27018 [1:32:47<269:24:43, 36.10s/it]
1%| | 153/27018 [1:33:24<270:01:29, 36.18s/it]
{'loss': 0.8033, 'grad_norm': 0.5667162469990945, 'learning_rate': 2.831236121391562e-06, 'mean_token_accuracy': 0.7726839184761047, 'epoch': 0.02}
1%| | 153/27018 [1:33:24<270:01:29, 36.18s/it]
1%| | 154/27018 [1:33:59<268:52:21, 36.03s/it]
{'loss': 0.7557, 'grad_norm': 0.5841078651401918, 'learning_rate': 2.8497409326424875e-06, 'mean_token_accuracy': 0.782449945807457, 'epoch': 0.02}
1%| | 154/27018 [1:33:59<268:52:21, 36.03s/it]
1%| | 155/27018 [1:34:36<270:21:22, 36.23s/it]
{'loss': 0.7567, 'grad_norm': 0.5713036051864312, 'learning_rate': 2.8682457438934126e-06, 'mean_token_accuracy': 0.7844241857528687, 'epoch': 0.02}
1%| | 155/27018 [1:34:36<270:21:22, 36.23s/it]
1%| | 156/27018 [1:35:12<269:26:55, 36.11s/it]
{'loss': 0.7345, 'grad_norm': 0.5676304343781067, 'learning_rate': 2.8867505551443377e-06, 'mean_token_accuracy': 0.7914735525846481, 'epoch': 0.02}
1%| | 156/27018 [1:35:12<269:26:55, 36.11s/it]
1%| | 157/27018 [1:35:48<270:31:13, 36.26s/it]
{'loss': 0.7464, 'grad_norm': 0.5843873149937405, 'learning_rate': 2.905255366395263e-06, 'mean_token_accuracy': 0.7851461172103882, 'epoch': 0.02}
1%| | 157/27018 [1:35:48<270:31:13, 36.26s/it]
1%| | 158/27018 [1:36:24<269:20:13, 36.10s/it]
{'loss': 0.7313, 'grad_norm': 0.5738658657953465, 'learning_rate': 2.923760177646188e-06, 'mean_token_accuracy': 0.7899124771356583, 'epoch': 0.02}
1%| | 158/27018 [1:36:24<269:20:13, 36.10s/it]
1%| | 159/27018 [1:37:01<270:56:08, 36.31s/it]
{'loss': 0.732, 'grad_norm': 0.5737474422396376, 'learning_rate': 2.942264988897113e-06, 'mean_token_accuracy': 0.7910120487213135, 'epoch': 0.02}
1%| | 159/27018 [1:37:01<270:56:08, 36.31s/it]
1%| | 160/27018 [1:37:37<269:39:02, 36.14s/it]
{'loss': 0.7129, 'grad_norm': 0.5640247001525789, 'learning_rate': 2.9607698001480383e-06, 'mean_token_accuracy': 0.7946542501449585, 'epoch': 0.02}
1%| | 160/27018 [1:37:37<269:39:02, 36.14s/it]
1%| | 161/27018 [1:38:13<270:46:10, 36.29s/it]
{'loss': 0.7722, 'grad_norm': 0.5842899897872303, 'learning_rate': 2.979274611398964e-06, 'mean_token_accuracy': 0.7779286652803421, 'epoch': 0.02}
1%| | 161/27018 [1:38:13<270:46:10, 36.29s/it]
1%| | 162/27018 [1:38:49<269:35:08, 36.14s/it]
{'loss': 0.7356, 'grad_norm': 0.5860715861198481, 'learning_rate': 2.9977794226498894e-06, 'mean_token_accuracy': 0.7899980247020721, 'epoch': 0.02}
1%| | 162/27018 [1:38:49<269:35:08, 36.14s/it]
1%| | 163/27018 [1:39:26<270:10:36, 36.22s/it]
{'loss': 0.7627, 'grad_norm': 0.5875159768924287, 'learning_rate': 3.0162842339008146e-06, 'mean_token_accuracy': 0.7779743522405624, 'epoch': 0.02}
1%| | 163/27018 [1:39:26<270:10:36, 36.22s/it]
1%| | 164/27018 [1:40:01<269:13:33, 36.09s/it]
{'loss': 0.7639, 'grad_norm': 0.5809756776197154, 'learning_rate': 3.0347890451517397e-06, 'mean_token_accuracy': 0.7827877998352051, 'epoch': 0.02}
1%| | 164/27018 [1:40:01<269:13:33, 36.09s/it]
1%| | 165/27018 [1:40:38<270:37:19, 36.28s/it]
{'loss': 0.7212, 'grad_norm': 0.5542239998175047, 'learning_rate': 3.053293856402665e-06, 'mean_token_accuracy': 0.7939881682395935, 'epoch': 0.02}
1%| | 165/27018 [1:40:38<270:37:19, 36.28s/it]
1%| | 166/27018 [1:41:14<268:59:28, 36.06s/it]
{'loss': 0.7509, 'grad_norm': 0.5857435808827416, 'learning_rate': 3.07179866765359e-06, 'mean_token_accuracy': 0.7838380187749863, 'epoch': 0.02}
1%| | 166/27018 [1:41:14<268:59:28, 36.06s/it]
1%| | 167/27018 [1:41:50<268:48:54, 36.04s/it]
{'loss': 0.75, 'grad_norm': 0.589641791673861, 'learning_rate': 3.090303478904515e-06, 'mean_token_accuracy': 0.786098524928093, 'epoch': 0.02}
1%| | 167/27018 [1:41:50<268:48:54, 36.04s/it]
1%| | 168/27018 [1:42:25<267:56:32, 35.93s/it]
{'loss': 0.7046, 'grad_norm': 0.5603118743658302, 'learning_rate': 3.1088082901554407e-06, 'mean_token_accuracy': 0.7995111495256424, 'epoch': 0.02}
1%| | 168/27018 [1:42:25<267:56:32, 35.93s/it]
1%| | 169/27018 [1:43:01<267:30:49, 35.87s/it]
{'loss': 0.7094, 'grad_norm': 0.5758804607101154, 'learning_rate': 3.127313101406366e-06, 'mean_token_accuracy': 0.7946171462535858, 'epoch': 0.02}
1%| | 169/27018 [1:43:01<267:30:49, 35.87s/it]
1%| | 170/27018 [1:43:37<268:05:19, 35.95s/it]
{'loss': 0.8012, 'grad_norm': 0.599125629915982, 'learning_rate': 3.145817912657291e-06, 'mean_token_accuracy': 0.7716994732618332, 'epoch': 0.02}
1%| | 170/27018 [1:43:37<268:05:19, 35.95s/it]
1%| | 171/27018 [1:44:12<266:34:44, 35.75s/it]
{'loss': 0.7436, 'grad_norm': 0.548367680953437, 'learning_rate': 3.164322723908216e-06, 'mean_token_accuracy': 0.7863889187574387, 'epoch': 0.02}
1%| | 171/27018 [1:44:12<266:34:44, 35.75s/it]
1%| | 172/27018 [1:44:49<267:37:22, 35.89s/it]
{'loss': 0.7136, 'grad_norm': 0.5485922774025175, 'learning_rate': 3.1828275351591413e-06, 'mean_token_accuracy': 0.7971586883068085, 'epoch': 0.02}
1%| | 172/27018 [1:44:49<267:37:22, 35.89s/it]
1%| | 173/27018 [1:45:24<266:09:26, 35.69s/it]
{'loss': 0.7249, 'grad_norm': 0.5559604802039144, 'learning_rate': 3.2013323464100664e-06, 'mean_token_accuracy': 0.7950968593358994, 'epoch': 0.02}
1%| | 173/27018 [1:45:24<266:09:26, 35.69s/it]
1%| | 174/27018 [1:46:00<267:29:05, 35.87s/it]
{'loss': 0.7282, 'grad_norm': 0.5693106698198828, 'learning_rate': 3.2198371576609916e-06, 'mean_token_accuracy': 0.7895412296056747, 'epoch': 0.02}
1%| | 174/27018 [1:46:00<267:29:05, 35.87s/it]
1%| | 175/27018 [1:46:35<265:57:07, 35.67s/it]
{'loss': 0.6961, 'grad_norm': 0.5615854831135817, 'learning_rate': 3.238341968911917e-06, 'mean_token_accuracy': 0.7980632781982422, 'epoch': 0.02}
1%| | 175/27018 [1:46:35<265:57:07, 35.67s/it]
1%| | 176/27018 [1:47:11<266:56:39, 35.80s/it]
{'loss': 0.7246, 'grad_norm': 0.5790052541778156, 'learning_rate': 3.2568467801628423e-06, 'mean_token_accuracy': 0.79371277987957, 'epoch': 0.02}
1%| | 176/27018 [1:47:11<266:56:39, 35.80s/it]
1%| | 177/27018 [1:47:47<266:00:44, 35.68s/it]
{'loss': 0.6903, 'grad_norm': 0.5528786698880457, 'learning_rate': 3.2753515914137682e-06, 'mean_token_accuracy': 0.7989947348833084, 'epoch': 0.02}
1%| | 177/27018 [1:47:47<266:00:44, 35.68s/it]
1%| | 178/27018 [1:48:23<266:41:48, 35.77s/it]
{'loss': 0.7896, 'grad_norm': 0.5870858945935968, 'learning_rate': 3.2938564026646934e-06, 'mean_token_accuracy': 0.7749260365962982, 'epoch': 0.02}
1%| | 178/27018 [1:48:23<266:41:48, 35.77s/it]
1%| | 179/27018 [1:48:58<265:38:31, 35.63s/it]
{'loss': 0.7558, 'grad_norm': 0.5849558467667858, 'learning_rate': 3.3123612139156185e-06, 'mean_token_accuracy': 0.7829612344503403, 'epoch': 0.02}
1%| | 179/27018 [1:48:58<265:38:31, 35.63s/it]
1%| | 180/27018 [1:49:34<266:16:25, 35.72s/it]
{'loss': 0.7674, 'grad_norm': 0.5950472948522422, 'learning_rate': 3.3308660251665437e-06, 'mean_token_accuracy': 0.7767878323793411, 'epoch': 0.02}
1%| | 180/27018 [1:49:34<266:16:25, 35.72s/it]
1%| | 181/27018 [1:50:09<265:00:52, 35.55s/it]
{'loss': 0.7235, 'grad_norm': 0.5690020592119117, 'learning_rate': 3.349370836417469e-06, 'mean_token_accuracy': 0.7911853343248367, 'epoch': 0.02}
1%| | 181/27018 [1:50:09<265:00:52, 35.55s/it]
1%| | 182/27018 [1:50:45<266:02:04, 35.69s/it]
{'loss': 0.6829, 'grad_norm': 0.5497745962746766, 'learning_rate': 3.367875647668394e-06, 'mean_token_accuracy': 0.803366094827652, 'epoch': 0.02}
1%| | 182/27018 [1:50:45<266:02:04, 35.69s/it]
1%| | 183/27018 [1:51:21<265:10:46, 35.57s/it]
{'loss': 0.727, 'grad_norm': 0.5574359293760289, 'learning_rate': 3.3863804589193195e-06, 'mean_token_accuracy': 0.7923353463411331, 'epoch': 0.02}
1%| | 183/27018 [1:51:21<265:10:46, 35.57s/it]
1%| | 184/27018 [1:51:57<267:00:34, 35.82s/it]
{'loss': 0.6842, 'grad_norm': 0.5566011288121414, 'learning_rate': 3.4048852701702447e-06, 'mean_token_accuracy': 0.8058837354183197, 'epoch': 0.02}
1%| | 184/27018 [1:51:57<267:00:34, 35.82s/it]
1%| | 185/27018 [1:52:32<265:55:48, 35.68s/it]
{'loss': 0.7516, 'grad_norm': 0.5794955456279601, 'learning_rate': 3.42339008142117e-06, 'mean_token_accuracy': 0.7843431383371353, 'epoch': 0.02}
1%| | 185/27018 [1:52:32<265:55:48, 35.68s/it]
1%| | 186/27018 [1:53:08<265:05:48, 35.57s/it]
{'loss': 0.7237, 'grad_norm': 0.5664713750674724, 'learning_rate': 3.441894892672095e-06, 'mean_token_accuracy': 0.7897331267595291, 'epoch': 0.02}
1%| | 186/27018 [1:53:08<265:05:48, 35.57s/it]
1%| | 187/27018 [1:53:44<266:04:30, 35.70s/it]
{'loss': 0.7626, 'grad_norm': 0.5789929332383672, 'learning_rate': 3.46039970392302e-06, 'mean_token_accuracy': 0.7813738733530045, 'epoch': 0.02}
1%| | 187/27018 [1:53:44<266:04:30, 35.70s/it]
1%| | 188/27018 [1:54:18<263:21:07, 35.34s/it]
{'loss': 0.7842, 'grad_norm': 0.582546429688926, 'learning_rate': 3.4789045151739452e-06, 'mean_token_accuracy': 0.7747888714075089, 'epoch': 0.02}
1%| | 188/27018 [1:54:18<263:21:07, 35.34s/it]
1%| | 189/27018 [1:54:54<264:08:19, 35.44s/it]
{'loss': 0.7495, 'grad_norm': 0.5926599399715496, 'learning_rate': 3.4974093264248704e-06, 'mean_token_accuracy': 0.7835961282253265, 'epoch': 0.02}
1%| | 189/27018 [1:54:54<264:08:19, 35.44s/it]
1%| | 190/27018 [1:55:29<262:56:24, 35.28s/it]
{'loss': 0.7246, 'grad_norm': 0.5647458005005965, 'learning_rate': 3.515914137675796e-06, 'mean_token_accuracy': 0.7919712215662003, 'epoch': 0.02}
1%| | 190/27018 [1:55:29<262:56:24, 35.28s/it]
1%| | 191/27018 [1:56:04<263:52:18, 35.41s/it]
{'loss': 0.7112, 'grad_norm': 0.5517622054077223, 'learning_rate': 3.534418948926721e-06, 'mean_token_accuracy': 0.7928812652826309, 'epoch': 0.02}
1%| | 191/27018 [1:56:04<263:52:18, 35.41s/it]
1%| | 192/27018 [1:56:39<262:29:45, 35.23s/it]
{'loss': 0.7447, 'grad_norm': 0.6056980313820571, 'learning_rate': 3.5529237601776462e-06, 'mean_token_accuracy': 0.7853602319955826, 'epoch': 0.02}
1%| | 192/27018 [1:56:39<262:29:45, 35.23s/it]
1%| | 193/27018 [1:57:15<262:53:04, 35.28s/it]
{'loss': 0.7417, 'grad_norm': 0.5765720650853381, 'learning_rate': 3.5714285714285714e-06, 'mean_token_accuracy': 0.7862393707036972, 'epoch': 0.02}
1%| | 193/27018 [1:57:15<262:53:04, 35.28s/it]
1%| | 194/27018 [1:57:50<262:08:58, 35.18s/it]
{'loss': 0.7342, 'grad_norm': 0.5803224079675525, 'learning_rate': 3.5899333826794965e-06, 'mean_token_accuracy': 0.7891211658716202, 'epoch': 0.02}
1%| | 194/27018 [1:57:50<262:08:58, 35.18s/it]
1%| | 195/27018 [1:58:25<263:07:43, 35.32s/it]
{'loss': 0.7555, 'grad_norm': 0.606517000925898, 'learning_rate': 3.6084381939304216e-06, 'mean_token_accuracy': 0.7834282517433167, 'epoch': 0.02}
1%| | 195/27018 [1:58:25<263:07:43, 35.32s/it]
1%| | 196/27018 [1:59:00<262:15:06, 35.20s/it]
{'loss': 0.7179, 'grad_norm': 0.5689665749704177, 'learning_rate': 3.626943005181347e-06, 'mean_token_accuracy': 0.7924306839704514, 'epoch': 0.02}
1%| | 196/27018 [1:59:00<262:15:06, 35.20s/it]
1%| | 197/27018 [1:59:35<262:24:31, 35.22s/it]
{'loss': 0.7545, 'grad_norm': 0.592631618061295, 'learning_rate': 3.6454478164322723e-06, 'mean_token_accuracy': 0.7827750891447067, 'epoch': 0.02}
1%| | 197/27018 [1:59:35<262:24:31, 35.22s/it]
1%| | 198/27018 [2:00:10<261:38:20, 35.12s/it]
{'loss': 0.7599, 'grad_norm': 0.5957749244861339, 'learning_rate': 3.6639526276831983e-06, 'mean_token_accuracy': 0.7802350372076035, 'epoch': 0.02}
1%| | 198/27018 [2:00:10<261:38:20, 35.12s/it]
1%| | 199/27018 [2:00:46<262:43:27, 35.27s/it]
{'loss': 0.7589, 'grad_norm': 0.5859678325348991, 'learning_rate': 3.6824574389341235e-06, 'mean_token_accuracy': 0.7813784331083298, 'epoch': 0.02}
1%| | 199/27018 [2:00:46<262:43:27, 35.27s/it]
1%| | 200/27018 [2:01:21<262:35:17, 35.25s/it]
{'loss': 0.7269, 'grad_norm': 0.5896250578002789, 'learning_rate': 3.7009622501850486e-06, 'mean_token_accuracy': 0.7907234877347946, 'epoch': 0.02}
1%| | 200/27018 [2:01:21<262:35:17, 35.25s/it]
1%| | 201/27018 [2:10:12<1371:10:49, 184.07s/it]
{'loss': 0.7259, 'grad_norm': 0.5862003846213085, 'learning_rate': 3.7194670614359738e-06, 'mean_token_accuracy': 0.7896487861871719, 'epoch': 0.02}
1%| | 201/27018 [2:10:12<1371:10:49, 184.07s/it]
1%| | 202/27018 [2:10:52<1047:33:43, 140.63s/it]
{'loss': 0.7457, 'grad_norm': 0.5857150380414821, 'learning_rate': 3.737971872686899e-06, 'mean_token_accuracy': 0.7841921001672745, 'epoch': 0.02}
1%| | 202/27018 [2:10:52<1047:33:43, 140.63s/it]
1%| | 203/27018 [2:11:31<820:16:34, 110.12s/it]
{'loss': 0.7513, 'grad_norm': 0.5627545811876944, 'learning_rate': 3.756476683937824e-06, 'mean_token_accuracy': 0.7838332951068878, 'epoch': 0.02}
1%| | 203/27018 [2:11:31<820:16:34, 110.12s/it]
1%| | 204/27018 [2:12:08<657:59:56, 88.34s/it]
{'loss': 0.7139, 'grad_norm': 0.5865478315771705, 'learning_rate': 3.774981495188749e-06, 'mean_token_accuracy': 0.7935836464166641, 'epoch': 0.02}
1%| | 204/27018 [2:12:08<657:59:56, 88.34s/it]
1%| | 205/27018 [2:12:44<539:45:48, 72.47s/it]
{'loss': 0.7443, 'grad_norm': 0.5696780012577445, 'learning_rate': 3.7934863064396747e-06, 'mean_token_accuracy': 0.7852583229541779, 'epoch': 0.02}
1%| | 205/27018 [2:12:44<539:45:48, 72.47s/it]
1%| | 206/27018 [2:13:20<458:33:10, 61.57s/it]
{'loss': 0.718, 'grad_norm': 0.5605578428533396, 'learning_rate': 3.8119911176906e-06, 'mean_token_accuracy': 0.7912840396165848, 'epoch': 0.02}
1%| | 206/27018 [2:13:20<458:33:10, 61.57s/it] |