bigscience-bot commited on
Commit
2743386
·
1 Parent(s): a60dc4e
Files changed (1) hide show
  1. logs/main_log.txt +83 -0
logs/main_log.txt CHANGED
@@ -12205,3 +12205,86 @@ Number of parameters without embeddings: 1.20860672 billion
12205
  iteration 18200/ 296023 | consumed samples: 4238784 | consumed tokens: 8681029632 | elapsed time per iteration (ms): 4638.8 | learning rate: 1.986E-04 | global batch size: 512 | lm loss: 1.881505E+00 | loss scale: 32768.0 | grad norm: 4246.005 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
12206
  [Rank 0] (after 18200 iterations) memory (MB) | allocated: 1631.6650390625 | max allocated: 3929.2744140625 | reserved: 6816.0 | max reserved: 6816.0
12207
  iteration 18400/ 296023 | consumed samples: 4341184 | consumed tokens: 8890744832 | elapsed time per iteration (ms): 4630.3 | learning rate: 1.985E-04 | global batch size: 512 | lm loss: 1.853725E+00 | loss scale: 16384.0 | grad norm: 2445.581 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12205
  iteration 18200/ 296023 | consumed samples: 4238784 | consumed tokens: 8681029632 | elapsed time per iteration (ms): 4638.8 | learning rate: 1.986E-04 | global batch size: 512 | lm loss: 1.881505E+00 | loss scale: 32768.0 | grad norm: 4246.005 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
12206
  [Rank 0] (after 18200 iterations) memory (MB) | allocated: 1631.6650390625 | max allocated: 3929.2744140625 | reserved: 6816.0 | max reserved: 6816.0
12207
  iteration 18400/ 296023 | consumed samples: 4341184 | consumed tokens: 8890744832 | elapsed time per iteration (ms): 4630.3 | learning rate: 1.985E-04 | global batch size: 512 | lm loss: 1.853725E+00 | loss scale: 16384.0 | grad norm: 2445.581 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
12208
+ iteration 18600/ 296023 | consumed samples: 4443584 | consumed tokens: 9100460032 | elapsed time per iteration (ms): 4642.4 | learning rate: 1.984E-04 | global batch size: 512 | lm loss: 1.849770E+00 | loss scale: 16384.0 | grad norm: 2880.275 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
12209
+ Killing subprocess 73114
12210
+ Killing subprocess 73115
12211
+ Killing subprocess 73116
12212
+ Killing subprocess 90168
12213
+ Killing subprocess 78775
12214
+ Killing subprocess 73117
12215
+ Killing subprocess 90017
12216
+ Killing subprocess 90169
12217
+ Killing subprocess 78776
12218
+ Main process received SIGTERM, exiting
12219
+ Killing subprocess 85084
12220
+ Killing subprocess 86427
12221
+ Killing subprocess 90170
12222
+ Killing subprocess 67764
12223
+ Killing subprocess 90018
12224
+ Killing subprocess 75423
12225
+ Killing subprocess 76711
12226
+ Killing subprocess 481575
12227
+ Killing subprocess 78777
12228
+ Killing subprocess 85085
12229
+ Killing subprocess 78778
12230
+ Killing subprocess 69992
12231
+ Killing subprocess 90171
12232
+ Killing subprocess 86428
12233
+ Killing subprocess 73589
12234
+ Killing subprocess 90019
12235
+ Killing subprocess 76712
12236
+ Killing subprocess 481576
12237
+ Killing subprocess 75424
12238
+ Killing subprocess 90020
12239
+ Killing subprocess 85086
12240
+ Killing subprocess 67765
12241
+ Main process received SIGTERM, exiting
12242
+ Killing subprocess 86429
12243
+ Killing subprocess 69993
12244
+ Killing subprocess 85087
12245
+ Main process received SIGTERM, exiting
12246
+ Killing subprocess 481612
12247
+ Killing subprocess 73590
12248
+ Killing subprocess 76713
12249
+ Killing subprocess 75425
12250
+ Killing subprocess 69467
12251
+ Killing subprocess 68888
12252
+ Killing subprocess 481577
12253
+ Killing subprocess 86430
12254
+ Main process received SIGTERM, exiting
12255
+ Killing subprocess 67766
12256
+ slurmstepd: error: *** STEP 1840970.0 ON r10i3n3 CANCELLED AT 2021-11-05T12:10:00 ***
12257
+ Killing subprocess 76714
12258
+ Killing subprocess 73591
12259
+ Killing subprocess 69994
12260
+ Killing subprocess 67767
12261
+ Killing subprocess 75426
12262
+ Main process received SIGTERM, exiting
12263
+ Main process received SIGTERM, exiting
12264
+ Killing subprocess 481613
12265
+ Killing subprocess 73592
12266
+ Killing subprocess 69468
12267
+ Main process received SIGTERM, exiting
12268
+ Killing subprocess 68889
12269
+ Killing subprocess 481578
12270
+ Killing subprocess 69469
12271
+ Main process received SIGTERM, exiting
12272
+ Killing subprocess 69995
12273
+ Main process received SIGTERM, exiting
12274
+ Main process received SIGTERM, exiting
12275
+ Killing subprocess 481614
12276
+ Killing subprocess 69470
12277
+ Killing subprocess 68890
12278
+ Killing subprocess 76149
12279
+ Main process received SIGTERM, exiting
12280
+ Killing subprocess 68891
12281
+ Main process received SIGTERM, exiting
12282
+ Main process received SIGTERM, exiting
12283
+ Killing subprocess 481616
12284
+ Killing subprocess 76150
12285
+ Main process received SIGTERM, exiting
12286
+ Main process received SIGTERM, exiting
12287
+ Killing subprocess 76151
12288
+ Killing subprocess 76152
12289
+ Main process received SIGTERM, exiting
12290
+ srun: Job step aborted: Waiting up to 62 seconds for job step to finish.