Commit ·
2743386
1
Parent(s): a60dc4e
new data
Browse files- logs/main_log.txt +83 -0
logs/main_log.txt
CHANGED
|
@@ -12205,3 +12205,86 @@ Number of parameters without embeddings: 1.20860672 billion
|
|
| 12205 |
iteration 18200/ 296023 | consumed samples: 4238784 | consumed tokens: 8681029632 | elapsed time per iteration (ms): 4638.8 | learning rate: 1.986E-04 | global batch size: 512 | lm loss: 1.881505E+00 | loss scale: 32768.0 | grad norm: 4246.005 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
| 12206 |
[Rank 0] (after 18200 iterations) memory (MB) | allocated: 1631.6650390625 | max allocated: 3929.2744140625 | reserved: 6816.0 | max reserved: 6816.0
|
| 12207 |
iteration 18400/ 296023 | consumed samples: 4341184 | consumed tokens: 8890744832 | elapsed time per iteration (ms): 4630.3 | learning rate: 1.985E-04 | global batch size: 512 | lm loss: 1.853725E+00 | loss scale: 16384.0 | grad norm: 2445.581 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12205 |
iteration 18200/ 296023 | consumed samples: 4238784 | consumed tokens: 8681029632 | elapsed time per iteration (ms): 4638.8 | learning rate: 1.986E-04 | global batch size: 512 | lm loss: 1.881505E+00 | loss scale: 32768.0 | grad norm: 4246.005 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
| 12206 |
[Rank 0] (after 18200 iterations) memory (MB) | allocated: 1631.6650390625 | max allocated: 3929.2744140625 | reserved: 6816.0 | max reserved: 6816.0
|
| 12207 |
iteration 18400/ 296023 | consumed samples: 4341184 | consumed tokens: 8890744832 | elapsed time per iteration (ms): 4630.3 | learning rate: 1.985E-04 | global batch size: 512 | lm loss: 1.853725E+00 | loss scale: 16384.0 | grad norm: 2445.581 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
| 12208 |
+
iteration 18600/ 296023 | consumed samples: 4443584 | consumed tokens: 9100460032 | elapsed time per iteration (ms): 4642.4 | learning rate: 1.984E-04 | global batch size: 512 | lm loss: 1.849770E+00 | loss scale: 16384.0 | grad norm: 2880.275 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
| 12209 |
+
Killing subprocess 73114
|
| 12210 |
+
Killing subprocess 73115
|
| 12211 |
+
Killing subprocess 73116
|
| 12212 |
+
Killing subprocess 90168
|
| 12213 |
+
Killing subprocess 78775
|
| 12214 |
+
Killing subprocess 73117
|
| 12215 |
+
Killing subprocess 90017
|
| 12216 |
+
Killing subprocess 90169
|
| 12217 |
+
Killing subprocess 78776
|
| 12218 |
+
Main process received SIGTERM, exiting
|
| 12219 |
+
Killing subprocess 85084
|
| 12220 |
+
Killing subprocess 86427
|
| 12221 |
+
Killing subprocess 90170
|
| 12222 |
+
Killing subprocess 67764
|
| 12223 |
+
Killing subprocess 90018
|
| 12224 |
+
Killing subprocess 75423
|
| 12225 |
+
Killing subprocess 76711
|
| 12226 |
+
Killing subprocess 481575
|
| 12227 |
+
Killing subprocess 78777
|
| 12228 |
+
Killing subprocess 85085
|
| 12229 |
+
Killing subprocess 78778
|
| 12230 |
+
Killing subprocess 69992
|
| 12231 |
+
Killing subprocess 90171
|
| 12232 |
+
Killing subprocess 86428
|
| 12233 |
+
Killing subprocess 73589
|
| 12234 |
+
Killing subprocess 90019
|
| 12235 |
+
Killing subprocess 76712
|
| 12236 |
+
Killing subprocess 481576
|
| 12237 |
+
Killing subprocess 75424
|
| 12238 |
+
Killing subprocess 90020
|
| 12239 |
+
Killing subprocess 85086
|
| 12240 |
+
Killing subprocess 67765
|
| 12241 |
+
Main process received SIGTERM, exiting
|
| 12242 |
+
Killing subprocess 86429
|
| 12243 |
+
Killing subprocess 69993
|
| 12244 |
+
Killing subprocess 85087
|
| 12245 |
+
Main process received SIGTERM, exiting
|
| 12246 |
+
Killing subprocess 481612
|
| 12247 |
+
Killing subprocess 73590
|
| 12248 |
+
Killing subprocess 76713
|
| 12249 |
+
Killing subprocess 75425
|
| 12250 |
+
Killing subprocess 69467
|
| 12251 |
+
Killing subprocess 68888
|
| 12252 |
+
Killing subprocess 481577
|
| 12253 |
+
Killing subprocess 86430
|
| 12254 |
+
Main process received SIGTERM, exiting
|
| 12255 |
+
Killing subprocess 67766
|
| 12256 |
+
slurmstepd: error: *** STEP 1840970.0 ON r10i3n3 CANCELLED AT 2021-11-05T12:10:00 ***
|
| 12257 |
+
Killing subprocess 76714
|
| 12258 |
+
Killing subprocess 73591
|
| 12259 |
+
Killing subprocess 69994
|
| 12260 |
+
Killing subprocess 67767
|
| 12261 |
+
Killing subprocess 75426
|
| 12262 |
+
Main process received SIGTERM, exiting
|
| 12263 |
+
Main process received SIGTERM, exiting
|
| 12264 |
+
Killing subprocess 481613
|
| 12265 |
+
Killing subprocess 73592
|
| 12266 |
+
Killing subprocess 69468
|
| 12267 |
+
Main process received SIGTERM, exiting
|
| 12268 |
+
Killing subprocess 68889
|
| 12269 |
+
Killing subprocess 481578
|
| 12270 |
+
Killing subprocess 69469
|
| 12271 |
+
Main process received SIGTERM, exiting
|
| 12272 |
+
Killing subprocess 69995
|
| 12273 |
+
Main process received SIGTERM, exiting
|
| 12274 |
+
Main process received SIGTERM, exiting
|
| 12275 |
+
Killing subprocess 481614
|
| 12276 |
+
Killing subprocess 69470
|
| 12277 |
+
Killing subprocess 68890
|
| 12278 |
+
Killing subprocess 76149
|
| 12279 |
+
Main process received SIGTERM, exiting
|
| 12280 |
+
Killing subprocess 68891
|
| 12281 |
+
Main process received SIGTERM, exiting
|
| 12282 |
+
Main process received SIGTERM, exiting
|
| 12283 |
+
Killing subprocess 481616
|
| 12284 |
+
Killing subprocess 76150
|
| 12285 |
+
Main process received SIGTERM, exiting
|
| 12286 |
+
Main process received SIGTERM, exiting
|
| 12287 |
+
Killing subprocess 76151
|
| 12288 |
+
Killing subprocess 76152
|
| 12289 |
+
Main process received SIGTERM, exiting
|
| 12290 |
+
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
|