bigscience-bot commited on
Commit
1511d6d
·
1 Parent(s): 45630f8
Files changed (1) hide show
  1. logs/main_log.txt +120 -0
logs/main_log.txt CHANGED
@@ -11180,3 +11180,123 @@ time (ms)
11180
  time (ms)
11181
  iteration 331/ 159576 | consumed samples: 5296 | elapsed time per iteration (ms): 13678.9 | learning rate: 1.469E-06 | global batch size: 16 | lm loss: 8.243130E+00 | loss scale: 4096.0 | grad norm: 39935.584 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11182
  time (ms)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11180
  time (ms)
11181
  iteration 331/ 159576 | consumed samples: 5296 | elapsed time per iteration (ms): 13678.9 | learning rate: 1.469E-06 | global batch size: 16 | lm loss: 8.243130E+00 | loss scale: 4096.0 | grad norm: 39935.584 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11182
  time (ms)
11183
+ iteration 332/ 159576 | consumed samples: 5312 | elapsed time per iteration (ms): 13653.3 | learning rate: 1.473E-06 | global batch size: 16 | lm loss: 8.148146E+00 | loss scale: 4096.0 | grad norm: 31710.971 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11184
+ time (ms)
11185
+ iteration 333/ 159576 | consumed samples: 5328 | elapsed time per iteration (ms): 13982.9 | learning rate: 1.478E-06 | global batch size: 16 | lm loss: 8.055049E+00 | loss scale: 4096.0 | grad norm: 40555.458 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11186
+ time (ms)
11187
+ iteration 334/ 159576 | consumed samples: 5344 | elapsed time per iteration (ms): 13576.5 | learning rate: 1.482E-06 | global batch size: 16 | lm loss: 8.154724E+00 | loss scale: 4096.0 | grad norm: 98189.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11188
+ time (ms)
11189
+ iteration 335/ 159576 | consumed samples: 5360 | elapsed time per iteration (ms): 13666.3 | learning rate: 1.487E-06 | global batch size: 16 | lm loss: 8.056485E+00 | loss scale: 4096.0 | grad norm: 53277.066 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11190
+ time (ms)
11191
+ iteration 336/ 159576 | consumed samples: 5376 | elapsed time per iteration (ms): 13667.7 | learning rate: 1.491E-06 | global batch size: 16 | lm loss: 7.902112E+00 | loss scale: 4096.0 | grad norm: 35520.620 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11192
+ time (ms)
11193
+ iteration 337/ 159576 | consumed samples: 5392 | elapsed time per iteration (ms): 14189.1 | learning rate: 1.496E-06 | global batch size: 16 | lm loss: 8.211933E+00 | loss scale: 4096.0 | grad norm: 102636.452 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11194
+ time (ms)
11195
+ iteration 338/ 159576 | consumed samples: 5408 | elapsed time per iteration (ms): 13538.3 | learning rate: 1.500E-06 | global batch size: 16 | lm loss: 8.077993E+00 | loss scale: 4096.0 | grad norm: 74161.424 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11196
+ time (ms)
11197
+ iteration 339/ 159576 | consumed samples: 5424 | elapsed time per iteration (ms): 13690.1 | learning rate: 1.504E-06 | global batch size: 16 | lm loss: 8.002722E+00 | loss scale: 4096.0 | grad norm: 41178.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11198
+ time (ms)
11199
+ iteration 340/ 159576 | consumed samples: 5440 | elapsed time per iteration (ms): 13761.4 | learning rate: 1.509E-06 | global batch size: 16 | lm loss: 8.070647E+00 | loss scale: 4096.0 | grad norm: 146660.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11200
+ time (ms)
11201
+ iteration 341/ 159576 | consumed samples: 5456 | elapsed time per iteration (ms): 13679.6 | learning rate: 1.513E-06 | global batch size: 16 | lm loss: 8.211810E+00 | loss scale: 4096.0 | grad norm: 56011.276 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11202
+ time (ms)
11203
+ iteration 342/ 159576 | consumed samples: 5472 | elapsed time per iteration (ms): 13958.7 | learning rate: 1.518E-06 | global batch size: 16 | lm loss: 8.028828E+00 | loss scale: 4096.0 | grad norm: 45507.509 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11204
+ time (ms)
11205
+ iteration 343/ 159576 | consumed samples: 5488 | elapsed time per iteration (ms): 13796.1 | learning rate: 1.522E-06 | global batch size: 16 | lm loss: 8.000618E+00 | loss scale: 4096.0 | grad norm: 41366.016 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11206
+ time (ms)
11207
+ iteration 344/ 159576 | consumed samples: 5504 | elapsed time per iteration (ms): 13566.5 | learning rate: 1.527E-06 | global batch size: 16 | lm loss: 8.106353E+00 | loss scale: 4096.0 | grad norm: 86487.826 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11208
+ time (ms)
11209
+ iteration 345/ 159576 | consumed samples: 5520 | elapsed time per iteration (ms): 13617.7 | learning rate: 1.531E-06 | global batch size: 16 | lm loss: 8.130958E+00 | loss scale: 4096.0 | grad norm: 65559.636 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11210
+ time (ms)
11211
+ iteration 346/ 159576 | consumed samples: 5536 | elapsed time per iteration (ms): 14006.3 | learning rate: 1.536E-06 | global batch size: 16 | lm loss: 8.100373E+00 | loss scale: 4096.0 | grad norm: 50918.888 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11212
+ time (ms)
11213
+ iteration 347/ 159576 | consumed samples: 5552 | elapsed time per iteration (ms): 13652.0 | learning rate: 1.540E-06 | global batch size: 16 | lm loss: 8.193462E+00 | loss scale: 4096.0 | grad norm: 49482.923 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11214
+ time (ms)
11215
+ iteration 348/ 159576 | consumed samples: 5568 | elapsed time per iteration (ms): 13785.4 | learning rate: 1.544E-06 | global batch size: 16 | lm loss: 8.185720E+00 | loss scale: 4096.0 | grad norm: 33616.818 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11216
+ time (ms)
11217
+ iteration 349/ 159576 | consumed samples: 5584 | elapsed time per iteration (ms): 13534.7 | learning rate: 1.549E-06 | global batch size: 16 | lm loss: 7.997324E+00 | loss scale: 4096.0 | grad norm: 41224.808 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11218
+ time (ms)
11219
+ iteration 350/ 159576 | consumed samples: 5600 | elapsed time per iteration (ms): 14148.0 | learning rate: 1.553E-06 | global batch size: 16 | lm loss: 8.069170E+00 | loss scale: 4096.0 | grad norm: 61139.413 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11220
+ time (ms)
11221
+ iteration 351/ 159576 | consumed samples: 5616 | elapsed time per iteration (ms): 13626.0 | learning rate: 1.558E-06 | global batch size: 16 | lm loss: 8.052499E+00 | loss scale: 4096.0 | grad norm: 58965.426 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11222
+ time (ms)
11223
+ iteration 352/ 159576 | consumed samples: 5632 | elapsed time per iteration (ms): 13633.5 | learning rate: 1.562E-06 | global batch size: 16 | lm loss: 8.036291E+00 | loss scale: 4096.0 | grad norm: 38820.487 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11224
+ time (ms)
11225
+ iteration 353/ 159576 | consumed samples: 5648 | elapsed time per iteration (ms): 13648.6 | learning rate: 1.567E-06 | global batch size: 16 | lm loss: 8.007360E+00 | loss scale: 4096.0 | grad norm: 33342.929 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11226
+ time (ms)
11227
+ iteration 354/ 159576 | consumed samples: 5664 | elapsed time per iteration (ms): 13707.0 | learning rate: 1.571E-06 | global batch size: 16 | lm loss: 7.890161E+00 | loss scale: 4096.0 | grad norm: 62589.896 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11228
+ time (ms)
11229
+ iteration 355/ 159576 | consumed samples: 5680 | elapsed time per iteration (ms): 14101.4 | learning rate: 1.575E-06 | global batch size: 16 | lm loss: 8.034273E+00 | loss scale: 4096.0 | grad norm: 62100.784 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11230
+ time (ms)
11231
+ iteration 356/ 159576 | consumed samples: 5696 | elapsed time per iteration (ms): 13548.4 | learning rate: 1.580E-06 | global batch size: 16 | lm loss: 7.964279E+00 | loss scale: 4096.0 | grad norm: 37283.643 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11232
+ time (ms)
11233
+ iteration 357/ 159576 | consumed samples: 5712 | elapsed time per iteration (ms): 13655.3 | learning rate: 1.584E-06 | global batch size: 16 | lm loss: 7.882459E+00 | loss scale: 4096.0 | grad norm: 36278.786 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11234
+ time (ms)
11235
+ iteration 358/ 159576 | consumed samples: 5728 | elapsed time per iteration (ms): 13872.1 | learning rate: 1.589E-06 | global batch size: 16 | lm loss: 8.081428E+00 | loss scale: 4096.0 | grad norm: 59624.520 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11236
+ time (ms)
11237
+ iteration 359/ 159576 | consumed samples: 5744 | elapsed time per iteration (ms): 13830.3 | learning rate: 1.593E-06 | global batch size: 16 | lm loss: 8.345490E+00 | loss scale: 4096.0 | grad norm: 101818.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11238
+ time (ms)
11239
+ iteration 360/ 159576 | consumed samples: 5760 | elapsed time per iteration (ms): 13738.3 | learning rate: 1.598E-06 | global batch size: 16 | lm loss: 8.090802E+00 | loss scale: 4096.0 | grad norm: 37735.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11240
+ time (ms)
11241
+ iteration 361/ 159576 | consumed samples: 5776 | elapsed time per iteration (ms): 13673.7 | learning rate: 1.602E-06 | global batch size: 16 | lm loss: 7.934822E+00 | loss scale: 4096.0 | grad norm: 35051.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11242
+ time (ms)
11243
+ iteration 362/ 159576 | consumed samples: 5792 | elapsed time per iteration (ms): 13779.0 | learning rate: 1.607E-06 | global batch size: 16 | lm loss: 8.217977E+00 | loss scale: 4096.0 | grad norm: 81671.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11244
+ time (ms)
11245
+ iteration 363/ 159576 | consumed samples: 5808 | elapsed time per iteration (ms): 14148.6 | learning rate: 1.611E-06 | global batch size: 16 | lm loss: 7.956856E+00 | loss scale: 4096.0 | grad norm: 123728.069 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11246
+ time (ms)
11247
+ iteration 364/ 159576 | consumed samples: 5824 | elapsed time per iteration (ms): 13509.6 | learning rate: 1.615E-06 | global batch size: 16 | lm loss: 7.980748E+00 | loss scale: 4096.0 | grad norm: 64323.538 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11248
+ time (ms)
11249
+ iteration 365/ 159576 | consumed samples: 5840 | elapsed time per iteration (ms): 13791.1 | learning rate: 1.620E-06 | global batch size: 16 | lm loss: 7.927495E+00 | loss scale: 4096.0 | grad norm: 38595.229 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11250
+ time (ms)
11251
+ iteration 366/ 159576 | consumed samples: 5856 | elapsed time per iteration (ms): 13535.8 | learning rate: 1.624E-06 | global batch size: 16 | lm loss: 7.992770E+00 | loss scale: 4096.0 | grad norm: 34786.799 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11252
+ time (ms)
11253
+ iteration 367/ 159576 | consumed samples: 5872 | elapsed time per iteration (ms): 13709.6 | learning rate: 1.629E-06 | global batch size: 16 | lm loss: 8.033854E+00 | loss scale: 4096.0 | grad norm: 26681.238 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11254
+ time (ms)
11255
+ iteration 368/ 159576 | consumed samples: 5888 | elapsed time per iteration (ms): 13923.8 | learning rate: 1.633E-06 | global batch size: 16 | lm loss: 8.086361E+00 | loss scale: 4096.0 | grad norm: 116063.612 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11256
+ time (ms)
11257
+ iteration 369/ 159576 | consumed samples: 5904 | elapsed time per iteration (ms): 13743.2 | learning rate: 1.638E-06 | global batch size: 16 | lm loss: 8.136069E+00 | loss scale: 4096.0 | grad norm: 192843.981 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11258
+ time (ms)
11259
+ iteration 370/ 159576 | consumed samples: 5920 | elapsed time per iteration (ms): 13586.5 | learning rate: 1.642E-06 | global batch size: 16 | lm loss: 8.213842E+00 | loss scale: 4096.0 | grad norm: 66749.630 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11260
+ time (ms)
11261
+ iteration 371/ 159576 | consumed samples: 5936 | elapsed time per iteration (ms): 13637.5 | learning rate: 1.646E-06 | global batch size: 16 | lm loss: 7.862526E+00 | loss scale: 4096.0 | grad norm: 35628.877 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11262
+ time (ms)
11263
+ iteration 372/ 159576 | consumed samples: 5952 | elapsed time per iteration (ms): 14269.3 | learning rate: 1.651E-06 | global batch size: 16 | lm loss: 8.111351E+00 | loss scale: 4096.0 | grad norm: 51284.654 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11264
+ time (ms)
11265
+ iteration 373/ 159576 | consumed samples: 5968 | elapsed time per iteration (ms): 13424.8 | learning rate: 1.655E-06 | global batch size: 16 | lm loss: 7.860275E+00 | loss scale: 4096.0 | grad norm: 51885.287 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11266
+ time (ms)
11267
+ iteration 374/ 159576 | consumed samples: 5984 | elapsed time per iteration (ms): 13638.9 | learning rate: 1.660E-06 | global batch size: 16 | lm loss: 7.995843E+00 | loss scale: 4096.0 | grad norm: 40982.716 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11268
+ time (ms)
11269
+ iteration 375/ 159576 | consumed samples: 6000 | elapsed time per iteration (ms): 13719.8 | learning rate: 1.664E-06 | global batch size: 16 | lm loss: 7.989121E+00 | loss scale: 4096.0 | grad norm: 43694.588 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11270
+ time (ms)
11271
+ iteration 376/ 159576 | consumed samples: 6016 | elapsed time per iteration (ms): 13718.2 | learning rate: 1.669E-06 | global batch size: 16 | lm loss: 8.054690E+00 | loss scale: 4096.0 | grad norm: 56142.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11272
+ time (ms)
11273
+ iteration 377/ 159576 | consumed samples: 6032 | elapsed time per iteration (ms): 14087.0 | learning rate: 1.673E-06 | global batch size: 16 | lm loss: 8.145277E+00 | loss scale: 4096.0 | grad norm: 77837.877 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11274
+ time (ms)
11275
+ iteration 378/ 159576 | consumed samples: 6048 | elapsed time per iteration (ms): 13621.7 | learning rate: 1.678E-06 | global batch size: 16 | lm loss: 7.879861E+00 | loss scale: 4096.0 | grad norm: 35054.780 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11276
+ time (ms)
11277
+ iteration 379/ 159576 | consumed samples: 6064 | elapsed time per iteration (ms): 13676.7 | learning rate: 1.682E-06 | global batch size: 16 | lm loss: 7.996103E+00 | loss scale: 4096.0 | grad norm: 31871.611 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11278
+ time (ms)
11279
+ iteration 380/ 159576 | consumed samples: 6080 | elapsed time per iteration (ms): 13756.2 | learning rate: 1.686E-06 | global batch size: 16 | lm loss: 7.788074E+00 | loss scale: 4096.0 | grad norm: 30378.507 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11280
+ time (ms)
11281
+ iteration 381/ 159576 | consumed samples: 6096 | elapsed time per iteration (ms): 13731.7 | learning rate: 1.691E-06 | global batch size: 16 | lm loss: 7.998044E+00 | loss scale: 4096.0 | grad norm: 78167.228 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11282
+ time (ms)
11283
+ iteration 382/ 159576 | consumed samples: 6112 | elapsed time per iteration (ms): 13696.8 | learning rate: 1.695E-06 | global batch size: 16 | lm loss: 8.001510E+00 | loss scale: 4096.0 | grad norm: 57981.800 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11284
+ time (ms)
11285
+ iteration 383/ 159576 | consumed samples: 6128 | elapsed time per iteration (ms): 13688.0 | learning rate: 1.700E-06 | global batch size: 16 | lm loss: 8.043833E+00 | loss scale: 4096.0 | grad norm: 40631.885 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11286
+ time (ms)
11287
+ iteration 384/ 159576 | consumed samples: 6144 | elapsed time per iteration (ms): 13680.4 | learning rate: 1.704E-06 | global batch size: 16 | lm loss: 8.029270E+00 | loss scale: 4096.0 | grad norm: 31579.477 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11288
+ time (ms)
11289
+ iteration 385/ 159576 | consumed samples: 6160 | elapsed time per iteration (ms): 14057.5 | learning rate: 1.709E-06 | global batch size: 16 | lm loss: 8.156369E+00 | loss scale: 4096.0 | grad norm: 87842.060 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11290
+ time (ms)
11291
+ iteration 386/ 159576 | consumed samples: 6176 | elapsed time per iteration (ms): 13765.1 | learning rate: 1.713E-06 | global batch size: 16 | lm loss: 8.024692E+00 | loss scale: 4096.0 | grad norm: 56881.857 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11292
+ time (ms)
11293
+ iteration 387/ 159576 | consumed samples: 6192 | elapsed time per iteration (ms): 13768.8 | learning rate: 1.717E-06 | global batch size: 16 | lm loss: 7.997876E+00 | loss scale: 4096.0 | grad norm: 31105.819 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11294
+ time (ms)
11295
+ iteration 388/ 159576 | consumed samples: 6208 | elapsed time per iteration (ms): 13433.5 | learning rate: 1.722E-06 | global batch size: 16 | lm loss: 7.985063E+00 | loss scale: 4096.0 | grad norm: 78090.353 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11296
+ time (ms)
11297
+ iteration 389/ 159576 | consumed samples: 6224 | elapsed time per iteration (ms): 13675.2 | learning rate: 1.726E-06 | global batch size: 16 | lm loss: 7.926050E+00 | loss scale: 4096.0 | grad norm: 61534.683 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11298
+ time (ms)
11299
+ iteration 390/ 159576 | consumed samples: 6240 | elapsed time per iteration (ms): 13989.4 | learning rate: 1.731E-06 | global batch size: 16 | lm loss: 7.938218E+00 | loss scale: 4096.0 | grad norm: 37749.344 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11300
+ time (ms)
11301
+ iteration 391/ 159576 | consumed samples: 6256 | elapsed time per iteration (ms): 13663.4 | learning rate: 1.735E-06 | global batch size: 16 | lm loss: 7.835842E+00 | loss scale: 4096.0 | grad norm: 48700.287 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
11302
+ time (ms)