Commit ·
4588f4c
1
Parent(s): 229827f
new data
Browse files- logs/main_log.txt +77 -0
logs/main_log.txt
CHANGED
|
@@ -8309,3 +8309,80 @@ valid loss at iteration 17000 | lm loss value: 1.893942E+00 | lm loss PPL: 6.645
|
|
| 8309 |
iteration 17400/ 296023 | consumed samples: 3829184 | consumed tokens: 7842168832 | elapsed time per iteration (ms): 4656.6 | learning rate: 1.988E-04 | global batch size: 512 | lm loss: 1.909471E+00 | loss scale: 32768.0 | grad norm: 5274.381 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
| 8310 |
iteration 17600/ 296023 | consumed samples: 3931584 | consumed tokens: 8051884032 | elapsed time per iteration (ms): 4643.5 | learning rate: 1.988E-04 | global batch size: 512 | lm loss: 1.909943E+00 | loss scale: 32768.0 | grad norm: 6170.035 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
| 8311 |
iteration 17800/ 296023 | consumed samples: 4033984 | consumed tokens: 8261599232 | elapsed time per iteration (ms): 4653.0 | learning rate: 1.987E-04 | global batch size: 512 | lm loss: 1.948319E+00 | loss scale: 32768.0 | grad norm: 6084.939 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8309 |
iteration 17400/ 296023 | consumed samples: 3829184 | consumed tokens: 7842168832 | elapsed time per iteration (ms): 4656.6 | learning rate: 1.988E-04 | global batch size: 512 | lm loss: 1.909471E+00 | loss scale: 32768.0 | grad norm: 5274.381 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
| 8310 |
iteration 17600/ 296023 | consumed samples: 3931584 | consumed tokens: 8051884032 | elapsed time per iteration (ms): 4643.5 | learning rate: 1.988E-04 | global batch size: 512 | lm loss: 1.909943E+00 | loss scale: 32768.0 | grad norm: 6170.035 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
| 8311 |
iteration 17800/ 296023 | consumed samples: 4033984 | consumed tokens: 8261599232 | elapsed time per iteration (ms): 4653.0 | learning rate: 1.987E-04 | global batch size: 512 | lm loss: 1.948319E+00 | loss scale: 32768.0 | grad norm: 6084.939 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
| 8312 |
+
[2021-11-05 09:56:34,720] [INFO] [logging.py:68:log_dist] [Rank 0] step=18000, skipped=38, lr=[0.00019863630787211344, 0.00019863630787211344], mom=[(0.9, 0.999), (0.9, 0.999)]
|
| 8313 |
+
steps: 18000 loss: 2.0673 iter time (s): 0.002 samples/sec: 218843.518
|
| 8314 |
+
iteration 18000/ 296023 | consumed samples: 4136384 | consumed tokens: 8471314432 | elapsed time per iteration (ms): 4665.8 | learning rate: 1.986E-04 | global batch size: 512 | lm loss: 1.892513E+00 | loss scale: 65536.0 | grad norm: 12609.619 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
| 8315 |
+
-------------------------------------------------------------------------------------------
|
| 8316 |
+
valid loss at iteration 18000 | lm loss value: 1.864149E+00 | lm loss PPL: 6.450444E+00 |
|
| 8317 |
+
-------------------------------------------------------------------------------------------
|
| 8318 |
+
saving checkpoint at iteration 18000 to /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints
|
| 8319 |
+
[2021-11-05 09:58:26,257] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/mp_rank_00_model_states.pt
|
| 8320 |
+
[2021-11-05 09:58:26,683] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_30_mp_rank_01_optim_states.pt
|
| 8321 |
+
[2021-11-05 09:58:26,683] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_31_mp_rank_01_optim_states.pt
|
| 8322 |
+
[2021-11-05 09:58:26,677] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_3_mp_rank_01_optim_states.pt
|
| 8323 |
+
[2021-11-05 09:58:26,685] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_3_mp_rank_00_optim_states.pt
|
| 8324 |
+
[2021-11-05 09:58:26,682] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_7_mp_rank_01_optim_states.pt
|
| 8325 |
+
[2021-11-05 09:58:26,685] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_5_mp_rank_00_optim_states.pt
|
| 8326 |
+
[2021-11-05 09:58:26,686] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_12_mp_rank_01_optim_states.pt
|
| 8327 |
+
[2021-11-05 09:58:26,691] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_14_mp_rank_01_optim_states.pt
|
| 8328 |
+
[2021-11-05 09:58:26,687] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_26_mp_rank_00_optim_states.pt
|
| 8329 |
+
[2021-11-05 09:58:26,688] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_25_mp_rank_00_optim_states.pt
|
| 8330 |
+
[2021-11-05 09:58:26,692] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_4_mp_rank_00_optim_states.pt
|
| 8331 |
+
[2021-11-05 09:58:26,694] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_6_mp_rank_01_optim_states.pt
|
| 8332 |
+
[2021-11-05 09:58:26,689] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_19_mp_rank_00_optim_states.pt
|
| 8333 |
+
[2021-11-05 09:58:26,695] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_1_mp_rank_00_optim_states.pt
|
| 8334 |
+
[2021-11-05 09:58:26,690] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_11_mp_rank_01_optim_states.pt
|
| 8335 |
+
[2021-11-05 09:58:26,690] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_9_mp_rank_00_optim_states.pt
|
| 8336 |
+
[2021-11-05 09:58:26,691] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_10_mp_rank_00_optim_states.pt
|
| 8337 |
+
[2021-11-05 09:58:26,690] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_18_mp_rank_01_optim_states.pt
|
| 8338 |
+
[2021-11-05 09:58:26,689] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_24_mp_rank_01_optim_states.pt
|
| 8339 |
+
[2021-11-05 09:58:26,690] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_27_mp_rank_01_optim_states.pt
|
| 8340 |
+
[2021-11-05 09:58:26,697] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_9_mp_rank_01_optim_states.pt
|
| 8341 |
+
[2021-11-05 09:58:26,702] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_17_mp_rank_01_optim_states.pt
|
| 8342 |
+
[2021-11-05 09:58:26,699] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_28_mp_rank_00_optim_states.pt
|
| 8343 |
+
[2021-11-05 09:58:26,699] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_21_mp_rank_01_optim_states.pt
|
| 8344 |
+
[2021-11-05 09:58:26,707] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_20_mp_rank_01_optim_states.pt
|
| 8345 |
+
[2021-11-05 09:58:26,712] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_25_mp_rank_01_optim_states.pt
|
| 8346 |
+
[2021-11-05 09:58:26,713] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_0_mp_rank_01_optim_states.pt
|
| 8347 |
+
[2021-11-05 09:58:26,713] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_16_mp_rank_00_optim_states.pt
|
| 8348 |
+
[2021-11-05 09:58:26,713] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_17_mp_rank_00_optim_states.pt
|
| 8349 |
+
[2021-11-05 09:58:26,714] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_4_mp_rank_01_optim_states.pt
|
| 8350 |
+
[2021-11-05 09:58:26,715] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_5_mp_rank_01_optim_states.pt
|
| 8351 |
+
[2021-11-05 09:58:26,717] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_29_mp_rank_00_optim_states.pt
|
| 8352 |
+
[2021-11-05 09:58:26,718] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_29_mp_rank_01_optim_states.pt
|
| 8353 |
+
[2021-11-05 09:58:26,718] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_18_mp_rank_00_optim_states.pt
|
| 8354 |
+
[2021-11-05 09:58:26,719] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_22_mp_rank_01_optim_states.pt
|
| 8355 |
+
[2021-11-05 09:58:26,719] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_19_mp_rank_01_optim_states.pt
|
| 8356 |
+
[2021-11-05 09:58:26,719] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_26_mp_rank_01_optim_states.pt
|
| 8357 |
+
[2021-11-05 09:58:26,717] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_13_mp_rank_00_optim_states.pt
|
| 8358 |
+
[2021-11-05 09:58:26,717] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_12_mp_rank_00_optim_states.pt
|
| 8359 |
+
[2021-11-05 09:58:26,722] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_7_mp_rank_00_optim_states.pt
|
| 8360 |
+
[2021-11-05 09:58:26,722] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_28_mp_rank_01_optim_states.pt
|
| 8361 |
+
[2021-11-05 09:58:26,724] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_11_mp_rank_00_optim_states.pt
|
| 8362 |
+
[2021-11-05 09:58:26,725] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_24_mp_rank_00_optim_states.pt
|
| 8363 |
+
[2021-11-05 09:58:26,726] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_27_mp_rank_00_optim_states.pt
|
| 8364 |
+
[2021-11-05 09:58:26,720] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_22_mp_rank_00_optim_states.pt
|
| 8365 |
+
[2021-11-05 09:58:26,721] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_21_mp_rank_00_optim_states.pt
|
| 8366 |
+
[2021-11-05 09:58:26,726] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_15_mp_rank_01_optim_states.pt
|
| 8367 |
+
[2021-11-05 09:58:26,726] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_10_mp_rank_01_optim_states.pt
|
| 8368 |
+
[2021-11-05 09:58:26,728] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_6_mp_rank_00_optim_states.pt
|
| 8369 |
+
[2021-11-05 09:58:26,728] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_2_mp_rank_00_optim_states.pt
|
| 8370 |
+
[2021-11-05 09:58:26,729] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_1_mp_rank_01_optim_states.pt
|
| 8371 |
+
[2021-11-05 09:58:26,729] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_23_mp_rank_01_optim_states.pt
|
| 8372 |
+
[2021-11-05 09:58:26,729] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_14_mp_rank_00_optim_states.pt
|
| 8373 |
+
[2021-11-05 09:58:26,730] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_8_mp_rank_00_optim_states.pt
|
| 8374 |
+
[2021-11-05 09:58:26,731] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_8_mp_rank_01_optim_states.pt
|
| 8375 |
+
[2021-11-05 09:58:26,732] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_31_mp_rank_00_optim_states.pt
|
| 8376 |
+
[2021-11-05 09:58:26,733] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_15_mp_rank_00_optim_states.pt
|
| 8377 |
+
[2021-11-05 09:58:26,734] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_20_mp_rank_00_optim_states.pt
|
| 8378 |
+
[2021-11-05 09:58:26,735] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_30_mp_rank_00_optim_states.pt
|
| 8379 |
+
[2021-11-05 09:58:26,742] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_23_mp_rank_00_optim_states.pt
|
| 8380 |
+
[2021-11-05 09:58:26,742] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_13_mp_rank_01_optim_states.pt
|
| 8381 |
+
[2021-11-05 09:58:26,745] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_0_mp_rank_00_optim_states.pt
|
| 8382 |
+
[2021-11-05 09:58:26,748] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_16_mp_rank_01_optim_states.pt
|
| 8383 |
+
[2021-11-05 09:58:26,750] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints/global_step18000/zero_pp_rank_2_mp_rank_01_optim_states.pt
|
| 8384 |
+
successfully saved checkpoint at iteration 18000 to /gpfsscratch/rech/six/commun/checkpoints/tr6e-1B3-pile/checkpoints
|
| 8385 |
+
time (ms) | save-checkpoint: 2752.92
|
| 8386 |
+
iteration 18200/ 296023 | consumed samples: 4238784 | consumed tokens: 8681029632 | elapsed time per iteration (ms): 5209.2 | learning rate: 1.986E-04 | global batch size: 512 | lm loss: 1.918148E+00 | loss scale: 65536.0 | grad norm: 9739.549 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
| 8387 |
+
iteration 18400/ 296023 | consumed samples: 4341184 | consumed tokens: 8890744832 | elapsed time per iteration (ms): 4642.1 | learning rate: 1.985E-04 | global batch size: 512 | lm loss: 1.884550E+00 | loss scale: 16384.0 | grad norm: 2549.465 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
| 8388 |
+
iteration 18600/ 296023 | consumed samples: 4443584 | consumed tokens: 9100460032 | elapsed time per iteration (ms): 4652.5 | learning rate: 1.984E-04 | global batch size: 512 | lm loss: 1.880175E+00 | loss scale: 16384.0 | grad norm: 2965.738 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|