| No checkpoint found � starting from scratch. | |
| ============================================================ | |
| PyCraft-1 Training Started | |
| Device : cuda | |
| Parameters : 19.3M | |
| Micro batch size : 4 | |
| Grad accumulation: 4 | |
| Effective batch : 16 | |
| Max steps : 500 | |
| Warmup steps : 100 | |
| Starting at step : 0 | |
| ============================================================ | |
| step 10 | loss 10.2851 | ppl 29293.0 | lr 3.00e-05 | grad_norm 4.052 | tok/s 25,864 | tokens 0.1M | |
| step 20 | loss 10.0198 | ppl 22467.3 | lr 6.00e-05 | grad_norm 2.655 | tok/s 160,697 | tokens 0.2M | |
| step 30 | loss 9.3891 | ppl 11957.0 | lr 9.00e-05 | grad_norm 3.728 | tok/s 160,392 | tokens 0.2M | |
| step 40 | loss 9.5445 | ppl 13967.4 | lr 1.20e-04 | grad_norm 1.768 | tok/s 163,989 | tokens 0.3M | |
| step 50 | loss 9.1239 | ppl 9171.6 | lr 1.50e-04 | grad_norm 1.714 | tok/s 27,413 | tokens 0.4M | |
| step 60 | loss 8.7827 | ppl 6520.4 | lr 1.80e-04 | grad_norm 1.704 | tok/s 141,023 | tokens 0.5M | |
| step 70 | loss 8.1783 | ppl 3562.8 | lr 2.10e-04 | grad_norm 2.826 | tok/s 148,033 | tokens 0.6M | |
| step 80 | loss 7.6606 | ppl 2123.0 | lr 2.40e-04 | grad_norm 2.140 | tok/s 147,202 | tokens 0.7M | |
| step 90 | loss 7.5468 | ppl 1894.7 | lr 2.70e-04 | grad_norm 2.090 | tok/s 146,928 | tokens 0.7M | |
| step 100 | loss 7.4350 | ppl 1694.2 | lr 3.00e-04 | grad_norm 1.494 | tok/s 144,700 | tokens 0.8M | |
| step 110 | loss 6.8166 | ppl 912.9 | lr 3.00e-04 | grad_norm 1.506 | tok/s 146,963 | tokens 0.9M | |
| step 120 | loss 6.7045 | ppl 816.1 | lr 2.98e-04 | grad_norm 0.905 | tok/s 143,154 | tokens 1.0M | |
| step 130 | loss 7.1999 | ppl 1339.3 | lr 2.96e-04 | grad_norm 1.720 | tok/s 146,313 | tokens 1.1M | |
| step 140 | loss 6.4805 | ppl 652.3 | lr 2.93e-04 | grad_norm 1.222 | tok/s 146,029 | tokens 1.1M | |
| step 150 | loss 6.5224 | ppl 680.2 | lr 2.90e-04 | grad_norm 1.249 | tok/s 143,938 | tokens 1.2M | |
| step 160 | loss 6.1923 | ppl 489.0 | lr 2.85e-04 | grad_norm 0.837 | tok/s 142,391 | tokens 1.3M | |
| step 170 | loss 5.7988 | ppl 329.9 | lr 2.80e-04 | grad_norm 1.159 | tok/s 138,682 | tokens 1.4M | |
| step 180 | loss 5.4822 | ppl 240.4 | lr 2.74e-04 | grad_norm 1.830 | tok/s 140,621 | tokens 1.5M | |
| step 190 | loss 5.7958 | ppl 328.9 | lr 2.68e-04 | grad_norm 2.011 | tok/s 138,092 | tokens 1.6M | |
| step 200 | loss 5.5855 | ppl 266.5 | lr 2.60e-04 | grad_norm 1.324 | tok/s 137,042 | tokens 1.6M | |
| step 210 | loss 5.5561 | ppl 258.8 | lr 2.53e-04 | grad_norm 2.345 | tok/s 135,569 | tokens 1.7M | |
| step 220 | loss 6.1112 | ppl 450.9 | lr 2.44e-04 | grad_norm 1.923 | tok/s 32,184 | tokens 1.8M | |
| step 230 | loss 5.7451 | ppl 312.7 | lr 2.36e-04 | grad_norm 1.185 | tok/s 141,392 | tokens 1.9M | |
| step 240 | loss 5.8963 | ppl 363.7 | lr 2.26e-04 | grad_norm 1.359 | tok/s 141,273 | tokens 2.0M | |
| step 250 | loss 5.6616 | ppl 287.6 | lr 2.17e-04 | grad_norm 1.267 | tok/s 138,583 | tokens 2.0M | |
| step 260 | loss 5.9460 | ppl 382.2 | lr 2.07e-04 | grad_norm 1.080 | tok/s 118,349 | tokens 2.1M | |
| step 270 | loss 5.8066 | ppl 332.5 | lr 1.97e-04 | grad_norm 1.272 | tok/s 139,363 | tokens 2.2M | |
| step 280 | loss 5.3027 | ppl 200.9 | lr 1.86e-04 | grad_norm 1.128 | tok/s 134,230 | tokens 2.3M | |
| step 290 | loss 5.2220 | ppl 185.3 | lr 1.76e-04 | grad_norm 1.407 | tok/s 135,357 | tokens 2.4M | |
| step 300 | loss 5.3298 | ppl 206.4 | lr 1.65e-04 | grad_norm 1.152 | tok/s 132,527 | tokens 2.5M | |
| step 310 | loss 5.4330 | ppl 228.8 | lr 1.54e-04 | grad_norm 1.025 | tok/s 131,576 | tokens 2.5M | |
| step 320 | loss 4.8691 | ppl 130.2 | lr 1.44e-04 | grad_norm 5.866 | tok/s 99,206 | tokens 2.6M | |
| step 330 | loss 3.9856 | ppl 53.8 | lr 1.33e-04 | grad_norm 3.827 | tok/s 127,265 | tokens 2.7M | |
| step 340 | loss 3.6698 | ppl 39.2 | lr 1.23e-04 | grad_norm 3.089 | tok/s 133,201 | tokens 2.8M | |
| step 350 | loss 3.3906 | ppl 29.7 | lr 1.13e-04 | grad_norm 2.439 | tok/s 137,736 | tokens 2.9M | |
| step 360 | loss 3.1164 | ppl 22.6 | lr 1.04e-04 | grad_norm 2.383 | tok/s 137,665 | tokens 2.9M | |
| step 370 | loss 6.2330 | ppl 509.3 | lr 9.45e-05 | grad_norm 1.745 | tok/s 131,404 | tokens 3.0M | |
| step 380 | loss 5.7636 | ppl 318.5 | lr 8.56e-05 | grad_norm 1.608 | tok/s 126,951 | tokens 3.1M | |
| step 390 | loss 5.7618 | ppl 317.9 | lr 7.73e-05 | grad_norm 1.380 | tok/s 128,218 | tokens 3.2M | |
| step 400 | loss 5.4140 | ppl 224.5 | lr 6.95e-05 | grad_norm 1.076 | tok/s 122,565 | tokens 3.3M | |
| step 410 | loss 5.7961 | ppl 329.0 | lr 6.23e-05 | grad_norm 1.375 | tok/s 128,397 | tokens 3.4M | |
| step 420 | loss 5.3759 | ppl 216.1 | lr 5.58e-05 | grad_norm 1.226 | tok/s 35,905 | tokens 3.4M | |
| step 430 | loss 5.1593 | ppl 174.0 | lr 4.99e-05 | grad_norm 2.204 | tok/s 130,237 | tokens 3.5M | |
| step 440 | loss 5.2702 | ppl 194.4 | lr 4.47e-05 | grad_norm 1.829 | tok/s 124,501 | tokens 3.6M | |
| step 450 | loss 5.8422 | ppl 344.5 | lr 4.03e-05 | grad_norm 2.139 | tok/s 129,253 | tokens 3.7M | |
| step 460 | loss 5.4688 | ppl 237.2 | lr 3.66e-05 | grad_norm 1.551 | tok/s 126,371 | tokens 3.8M | |
| step 470 | loss 6.1985 | ppl 492.0 | lr 3.37e-05 | grad_norm 1.626 | tok/s 123,476 | tokens 3.9M | |
| step 480 | loss 5.0077 | ppl 149.6 | lr 3.17e-05 | grad_norm 1.168 | tok/s 123,073 | tokens 3.9M | |
| step 490 | loss 5.2723 | ppl 194.9 | lr 3.04e-05 | grad_norm 1.290 | tok/s 129,078 | tokens 4.0M | |
| step 500 | loss 5.4226 | ppl 226.5 | lr 3.00e-05 | grad_norm 1.398 | tok/s 124,580 | tokens 4.1M | |
| Training complete. | |
| Total tokens seen: 0.004B | |
| Resuming from checkpoint: step_0000500 | |
| No checkpoint found � starting from scratch. | |
| ============================================================ | |
| PyCraft-1 Training Started | |
| Device : cuda | |
| Parameters : 55.3M | |
| Micro batch size : 4 | |
| Grad accumulation: 64 | |
| Effective batch : 256 | |
| Max steps : 100,000 | |
| Warmup steps : 2,000 | |
| Starting at step : 0 | |
| ============================================================ | |
| Training complete. | |
| Total tokens seen: 0.001B | |
| No checkpoint found � starting from scratch. | |
| ============================================================ | |
| PyCraft-1 Training Started | |
| Device : cuda | |
| Parameters : 55.3M | |
| Micro batch size : 4 | |
| Grad accumulation: 64 | |
| Effective batch : 256 | |
| Max steps : 100,000 | |
| Warmup steps : 2,000 | |
| Starting at step : 0 | |
| ============================================================ | |
| step 50 | loss 9.4613 | ppl 12852.1 | lr 7.50e-06 | grad 2.951 | tok/s 8 | seen 26.2M | |
| step 100 | loss 8.7782 | ppl 6490.9 | lr 1.50e-05 | grad 2.457 | tok/s 8 | seen 52.4M | |
| step 150 | loss 8.0846 | ppl 3244.2 | lr 2.25e-05 | grad 2.020 | tok/s 8 | seen 78.6M | |
| step 200 | loss 7.3689 | ppl 1585.9 | lr 3.00e-05 | grad 2.172 | tok/s 8 | seen 104.9M | |
| step 250 | loss 6.6136 | ppl 745.2 | lr 3.75e-05 | grad 2.493 | tok/s 8 | seen 131.1M | |
| step 300 | loss 6.0596 | ppl 428.2 | lr 4.50e-05 | grad 2.529 | tok/s 8 | seen 157.3M | |
| step 350 | loss 5.6656 | ppl 288.8 | lr 5.25e-05 | grad 2.855 | tok/s 7 | seen 183.5M | |
| step 400 | loss 5.4235 | ppl 226.7 | lr 6.00e-05 | grad 2.430 | tok/s 7 | seen 209.7M | |
| step 450 | loss 5.2467 | ppl 189.9 | lr 6.75e-05 | grad 3.429 | tok/s 7 | seen 235.9M | |
| step 500 | loss 4.6429 | ppl 103.8 | lr 7.50e-05 | grad 2.066 | tok/s 7 | seen 262.1M | |
| Interrupted at step 516 � saving checkpoint... | |
| Saving final checkpoint... | |
| Training complete. | |
| Final step : 516 | |
| Tokens seen : 0.2705B | |
| Best loss : 4.6429 | |
| No checkpoint found � starting from scratch. | |
| ============================================================ | |
| PyCraft-1 Training Started | |
| Device : cuda | |
| Parameters : 55.3M | |
| Micro batch size : 4 | |
| Grad accumulation: 64 | |
| Effective batch : 256 | |
| Max steps : 100,000 | |
| Warmup steps : 2,000 | |
| Starting at step : 0 | |
| ============================================================ | |
| Interrupted at step 39 � saving checkpoint... | |
| Saving final checkpoint... | |
| Training complete. | |
| Final step : 39 | |
| Tokens seen : 0.0204B | |
| Best loss : inf | |
| No checkpoint found � starting from scratch. | |
| ============================================================ | |
| PyCraft-1 Training Started | |
| Device : cuda | |
| Parameters : 55.3M | |
| Micro batch size : 4 | |
| Grad accumulation: 64 | |
| Effective batch : 256 | |
| Max steps : 100,000 | |
| Warmup steps : 2,000 | |
| Starting at step : 0 | |
| ============================================================ | |
| step 10 | loss 10.4490 | ppl 34508.1 | lr 1.50e-06 | grad 4.535 | tok/s 58 | seen 2.6M | |
| step 20 | loss 10.3129 | ppl 30117.4 | lr 3.00e-06 | grad 4.583 | tok/s 57 | seen 5.2M | |
| step 30 | loss 10.0833 | ppl 23938.8 | lr 4.50e-06 | grad 4.126 | tok/s 59 | seen 7.9M | |
| step 40 | loss 9.8450 | ppl 18863.7 | lr 6.00e-06 | grad 2.710 | tok/s 60 | seen 10.5M | |
| step 50 | loss 9.6893 | ppl 16143.3 | lr 7.50e-06 | grad 1.865 | tok/s 60 | seen 13.1M | |
| step 60 | loss 9.5579 | ppl 14155.6 | lr 9.00e-06 | grad 1.699 | tok/s 61 | seen 15.7M | |
| step 70 | loss 9.4685 | ppl 12945.8 | lr 1.05e-05 | grad 1.686 | tok/s 60 | seen 18.4M | |
| step 80 | loss 9.2852 | ppl 10777.0 | lr 1.20e-05 | grad 1.674 | tok/s 61 | seen 21.0M | |
| step 90 | loss 9.0739 | ppl 8724.5 | lr 1.35e-05 | grad 1.810 | tok/s 61 | seen 23.6M | |
| step 100 | loss 8.8261 | ppl 6809.6 | lr 1.50e-05 | grad 1.984 | tok/s 61 | seen 26.2M | |
| step 110 | loss 8.6343 | ppl 5621.0 | lr 1.65e-05 | grad 1.792 | tok/s 60 | seen 28.8M | |
| step 120 | loss 8.4026 | ppl 4458.7 | lr 1.80e-05 | grad 1.705 | tok/s 59 | seen 31.5M | |
| step 130 | loss 8.2071 | ppl 3666.7 | lr 1.95e-05 | grad 1.680 | tok/s 59 | seen 34.1M | |
| step 140 | loss 8.0191 | ppl 3038.4 | lr 2.10e-05 | grad 2.358 | tok/s 60 | seen 36.7M | |
| step 150 | loss 7.8321 | ppl 2520.2 | lr 2.25e-05 | grad 1.718 | tok/s 60 | seen 39.3M | |
| step 160 | loss 7.6757 | ppl 2155.3 | lr 2.40e-05 | grad 1.794 | tok/s 60 | seen 41.9M | |
| step 170 | loss 7.4599 | ppl 1737.0 | lr 2.55e-05 | grad 1.644 | tok/s 59 | seen 44.6M | |
| step 180 | loss 7.2797 | ppl 1450.6 | lr 2.70e-05 | grad 1.721 | tok/s 59 | seen 47.2M | |
| step 190 | loss 7.1400 | ppl 1261.5 | lr 2.85e-05 | grad 2.358 | tok/s 60 | seen 49.8M | |
| step 200 | loss 6.9297 | ppl 1022.2 | lr 3.00e-05 | grad 3.360 | tok/s 61 | seen 52.4M | |
| step 210 | loss 6.8119 | ppl 908.6 | lr 3.15e-05 | grad 2.030 | tok/s 61 | seen 55.1M | |
| step 220 | loss 6.6312 | ppl 758.4 | lr 3.30e-05 | grad 2.938 | tok/s 61 | seen 57.7M | |
| step 230 | loss 6.5064 | ppl 669.4 | lr 3.45e-05 | grad 2.072 | tok/s 61 | seen 60.3M | |
| step 240 | loss 6.3065 | ppl 548.1 | lr 3.60e-05 | grad 2.517 | tok/s 61 | seen 62.9M | |
| step 250 | loss 6.1578 | ppl 472.4 | lr 3.75e-05 | grad 2.960 | tok/s 61 | seen 65.5M | |
| step 260 | loss 6.0176 | ppl 410.6 | lr 3.90e-05 | grad 2.333 | tok/s 61 | seen 68.2M | |
| step 270 | loss 5.9160 | ppl 370.9 | lr 4.05e-05 | grad 3.197 | tok/s 61 | seen 70.8M | |
| step 280 | loss 5.7455 | ppl 312.8 | lr 4.20e-05 | grad 1.903 | tok/s 62 | seen 73.4M | |
| step 290 | loss 5.5727 | ppl 263.1 | lr 4.35e-05 | grad 3.205 | tok/s 59 | seen 76.0M | |
| step 300 | loss 5.4635 | ppl 235.9 | lr 4.50e-05 | grad 3.429 | tok/s 60 | seen 78.6M | |
| step 310 | loss 5.3731 | ppl 215.5 | lr 4.65e-05 | grad 2.904 | tok/s 60 | seen 81.3M | |
| step 320 | loss 5.3094 | ppl 202.2 | lr 4.80e-05 | grad 2.950 | tok/s 61 | seen 83.9M | |
| step 330 | loss 5.1342 | ppl 169.7 | lr 4.95e-05 | grad 3.316 | tok/s 61 | seen 86.5M | |
| step 340 | loss 5.0432 | ppl 155.0 | lr 5.10e-05 | grad 2.353 | tok/s 61 | seen 89.1M | |
| step 350 | loss 4.9551 | ppl 141.9 | lr 5.25e-05 | grad 3.384 | tok/s 62 | seen 91.8M | |
| step 360 | loss 4.7992 | ppl 121.4 | lr 5.40e-05 | grad 3.811 | tok/s 61 | seen 94.4M | |
| step 370 | loss 4.7974 | ppl 121.2 | lr 5.55e-05 | grad 3.617 | tok/s 61 | seen 97.0M | |
| step 380 | loss 4.7003 | ppl 110.0 | lr 5.70e-05 | grad 3.102 | tok/s 61 | seen 99.6M | |
| step 390 | loss 4.6559 | ppl 105.2 | lr 5.85e-05 | grad 3.267 | tok/s 62 | seen 102.2M | |
| step 400 | loss 4.4966 | ppl 89.7 | lr 6.00e-05 | grad 3.379 | tok/s 62 | seen 104.9M | |
| step 410 | loss 4.4830 | ppl 88.5 | lr 6.15e-05 | grad 2.970 | tok/s 62 | seen 107.5M | |
| step 420 | loss 4.4303 | ppl 84.0 | lr 6.30e-05 | grad 3.522 | tok/s 62 | seen 110.1M | |
| step 430 | loss 4.3359 | ppl 76.4 | lr 6.45e-05 | grad 4.247 | tok/s 62 | seen 112.7M | |
| step 440 | loss 4.2892 | ppl 72.9 | lr 6.60e-05 | grad 3.448 | tok/s 62 | seen 115.3M | |
| step 450 | loss 4.1724 | ppl 64.9 | lr 6.75e-05 | grad 3.333 | tok/s 62 | seen 118.0M | |
| step 460 | loss 4.1520 | ppl 63.6 | lr 6.90e-05 | grad 3.319 | tok/s 62 | seen 120.6M | |
| step 470 | loss 4.0679 | ppl 58.4 | lr 7.05e-05 | grad 4.136 | tok/s 62 | seen 123.2M | |
| step 480 | loss 4.0105 | ppl 55.2 | lr 7.20e-05 | grad 3.893 | tok/s 62 | seen 125.8M | |
| step 490 | loss 3.9564 | ppl 52.3 | lr 7.35e-05 | grad 3.888 | tok/s 62 | seen 128.5M | |
| step 500 | loss 3.9136 | ppl 50.1 | lr 7.50e-05 | grad 3.667 | tok/s 61 | seen 131.1M | |
| step 510 | loss 3.9078 | ppl 49.8 | lr 7.65e-05 | grad 3.179 | tok/s 61 | seen 133.7M | |
| step 520 | loss 3.7887 | ppl 44.2 | lr 7.80e-05 | grad 3.993 | tok/s 62 | seen 136.3M | |
| step 530 | loss 3.8218 | ppl 45.7 | lr 7.95e-05 | grad 3.302 | tok/s 62 | seen 138.9M | |
| step 540 | loss 3.7728 | ppl 43.5 | lr 8.10e-05 | grad 3.918 | tok/s 62 | seen 141.6M | |
| step 550 | loss 3.7233 | ppl 41.4 | lr 8.25e-05 | grad 3.469 | tok/s 62 | seen 144.2M | |
| step 560 | loss 3.6869 | ppl 39.9 | lr 8.40e-05 | grad 3.651 | tok/s 62 | seen 146.8M | |
| step 570 | loss 3.6061 | ppl 36.8 | lr 8.55e-05 | grad 3.574 | tok/s 62 | seen 149.4M | |
| step 580 | loss 3.6062 | ppl 36.8 | lr 8.70e-05 | grad 3.951 | tok/s 62 | seen 152.0M | |
| step 590 | loss 3.6136 | ppl 37.1 | lr 8.85e-05 | grad 3.335 | tok/s 62 | seen 154.7M | |
| step 600 | loss 3.5540 | ppl 35.0 | lr 9.00e-05 | grad 3.088 | tok/s 62 | seen 157.3M | |
| step 610 | loss 3.5245 | ppl 33.9 | lr 9.15e-05 | grad 3.727 | tok/s 62 | seen 159.9M | |
| step 620 | loss 3.4652 | ppl 32.0 | lr 9.30e-05 | grad 2.677 | tok/s 62 | seen 162.5M | |
| step 630 | loss 3.5270 | ppl 34.0 | lr 9.45e-05 | grad 3.213 | tok/s 62 | seen 165.2M | |
| step 640 | loss 3.4596 | ppl 31.8 | lr 9.60e-05 | grad 4.020 | tok/s 62 | seen 167.8M | |
| step 650 | loss 3.3859 | ppl 29.5 | lr 9.75e-05 | grad 3.493 | tok/s 62 | seen 170.4M | |
| step 660 | loss 3.3820 | ppl 29.4 | lr 9.90e-05 | grad 4.305 | tok/s 62 | seen 173.0M | |
| step 670 | loss 3.3203 | ppl 27.7 | lr 1.01e-04 | grad 2.714 | tok/s 62 | seen 175.6M | |
| step 680 | loss 3.3330 | ppl 28.0 | lr 1.02e-04 | grad 2.684 | tok/s 62 | seen 178.3M | |
| step 690 | loss 3.2408 | ppl 25.6 | lr 1.03e-04 | grad 2.748 | tok/s 62 | seen 180.9M | |
| step 700 | loss 3.2040 | ppl 24.6 | lr 1.05e-04 | grad 3.681 | tok/s 62 | seen 183.5M | |
| step 710 | loss 3.2320 | ppl 25.3 | lr 1.06e-04 | grad 3.225 | tok/s 62 | seen 186.1M | |
| step 720 | loss 3.1874 | ppl 24.2 | lr 1.08e-04 | grad 3.005 | tok/s 62 | seen 188.7M | |
| step 730 | loss 3.1258 | ppl 22.8 | lr 1.09e-04 | grad 3.105 | tok/s 62 | seen 191.4M | |
| step 740 | loss 3.1309 | ppl 22.9 | lr 1.11e-04 | grad 2.935 | tok/s 62 | seen 194.0M | |
| step 750 | loss 3.0680 | ppl 21.5 | lr 1.12e-04 | grad 2.580 | tok/s 62 | seen 196.6M | |
| step 760 | loss 3.0686 | ppl 21.5 | lr 1.14e-04 | grad 3.297 | tok/s 62 | seen 199.2M | |
| step 770 | loss 3.0904 | ppl 22.0 | lr 1.15e-04 | grad 2.188 | tok/s 62 | seen 201.9M | |
| step 780 | loss 2.9938 | ppl 20.0 | lr 1.17e-04 | grad 2.493 | tok/s 62 | seen 204.5M | |
| step 790 | loss 2.9467 | ppl 19.0 | lr 1.18e-04 | grad 2.523 | tok/s 62 | seen 207.1M | |
| step 800 | loss 2.9396 | ppl 18.9 | lr 1.20e-04 | grad 2.738 | tok/s 62 | seen 209.7M | |
| step 810 | loss 2.9950 | ppl 20.0 | lr 1.21e-04 | grad 3.004 | tok/s 63 | seen 212.3M | |
| step 820 | loss 2.9561 | ppl 19.2 | lr 1.23e-04 | grad 3.003 | tok/s 62 | seen 215.0M | |
| step 830 | loss 2.9605 | ppl 19.3 | lr 1.24e-04 | grad 2.962 | tok/s 62 | seen 217.6M | |
| step 840 | loss 2.9312 | ppl 18.7 | lr 1.26e-04 | grad 2.862 | tok/s 62 | seen 220.2M | |
| step 850 | loss 2.8155 | ppl 16.7 | lr 1.27e-04 | grad 2.646 | tok/s 62 | seen 222.8M | |
| step 860 | loss 2.8339 | ppl 17.0 | lr 1.29e-04 | grad 2.421 | tok/s 62 | seen 225.4M | |
| step 870 | loss 2.7365 | ppl 15.4 | lr 1.31e-04 | grad 2.453 | tok/s 62 | seen 228.1M | |
| step 880 | loss 2.8475 | ppl 17.2 | lr 1.32e-04 | grad 2.291 | tok/s 62 | seen 230.7M | |
| step 890 | loss 2.7599 | ppl 15.8 | lr 1.33e-04 | grad 2.094 | tok/s 63 | seen 233.3M | |
| step 900 | loss 2.8108 | ppl 16.6 | lr 1.35e-04 | grad 2.012 | tok/s 63 | seen 235.9M | |
| step 910 | loss 2.7658 | ppl 15.9 | lr 1.36e-04 | grad 2.045 | tok/s 62 | seen 238.6M | |
| step 920 | loss 2.6955 | ppl 14.8 | lr 1.38e-04 | grad 2.460 | tok/s 62 | seen 241.2M | |
| step 930 | loss 2.7042 | ppl 14.9 | lr 1.40e-04 | grad 2.337 | tok/s 62 | seen 243.8M | |
| step 940 | loss 2.6937 | ppl 14.8 | lr 1.41e-04 | grad 2.088 | tok/s 62 | seen 246.4M | |
| step 950 | loss 2.6568 | ppl 14.3 | lr 1.42e-04 | grad 2.562 | tok/s 62 | seen 249.0M | |
| step 960 | loss 2.6710 | ppl 14.5 | lr 1.44e-04 | grad 2.191 | tok/s 63 | seen 251.7M | |
| step 970 | loss 2.6961 | ppl 14.8 | lr 1.45e-04 | grad 2.023 | tok/s 63 | seen 254.3M | |
| step 980 | loss 2.5773 | ppl 13.2 | lr 1.47e-04 | grad 1.896 | tok/s 63 | seen 256.9M | |
| step 990 | loss 2.6115 | ppl 13.6 | lr 1.48e-04 | grad 2.380 | tok/s 63 | seen 259.5M | |
| step 1,000 | loss 2.5569 | ppl 12.9 | lr 1.50e-04 | grad 2.033 | tok/s 63 | seen 262.1M | |
| step 1,010 | loss 2.5382 | ppl 12.7 | lr 1.51e-04 | grad 2.219 | tok/s 63 | seen 264.8M | |
| step 1,020 | loss 2.5224 | ppl 12.5 | lr 1.53e-04 | grad 2.164 | tok/s 63 | seen 267.4M | |
| step 1,030 | loss 2.5360 | ppl 12.6 | lr 1.54e-04 | grad 1.924 | tok/s 62 | seen 270.0M | |
| step 1,040 | loss 2.4306 | ppl 11.4 | lr 1.56e-04 | grad 2.096 | tok/s 63 | seen 272.6M | |
| step 1,050 | loss 2.4811 | ppl 12.0 | lr 1.57e-04 | grad 2.001 | tok/s 63 | seen 275.3M | |
| step 1,060 | loss 2.4113 | ppl 11.1 | lr 1.59e-04 | grad 1.713 | tok/s 63 | seen 277.9M | |
| step 1,070 | loss 2.4479 | ppl 11.6 | lr 1.60e-04 | grad 2.032 | tok/s 63 | seen 280.5M | |
| step 1,080 | loss 2.3633 | ppl 10.6 | lr 1.62e-04 | grad 1.738 | tok/s 63 | seen 283.1M | |
| step 1,090 | loss 2.4176 | ppl 11.2 | lr 1.63e-04 | grad 1.932 | tok/s 63 | seen 285.7M | |
| step 1,100 | loss 2.4043 | ppl 11.1 | lr 1.65e-04 | grad 2.027 | tok/s 63 | seen 288.4M | |
| step 1,110 | loss 2.3005 | ppl 10.0 | lr 1.67e-04 | grad 1.805 | tok/s 63 | seen 291.0M | |
| step 1,120 | loss 2.3403 | ppl 10.4 | lr 1.68e-04 | grad 1.928 | tok/s 63 | seen 293.6M | |
| step 1,130 | loss 2.3166 | ppl 10.1 | lr 1.69e-04 | grad 1.748 | tok/s 63 | seen 296.2M | |
| step 1,140 | loss 2.3251 | ppl 10.2 | lr 1.71e-04 | grad 1.710 | tok/s 63 | seen 298.8M | |
| step 1,150 | loss 2.2780 | ppl 9.8 | lr 1.72e-04 | grad 1.720 | tok/s 63 | seen 301.5M | |
| step 1,160 | loss 2.2549 | ppl 9.5 | lr 1.74e-04 | grad 1.708 | tok/s 63 | seen 304.1M | |
| step 1,170 | loss 2.1659 | ppl 8.7 | lr 1.75e-04 | grad 1.929 | tok/s 63 | seen 306.7M | |
| step 1,180 | loss 2.2427 | ppl 9.4 | lr 1.77e-04 | grad 1.807 | tok/s 63 | seen 309.3M | |
| step 1,190 | loss 2.2096 | ppl 9.1 | lr 1.78e-04 | grad 1.682 | tok/s 63 | seen 312.0M | |
| step 1,200 | loss 2.2308 | ppl 9.3 | lr 1.80e-04 | grad 1.734 | tok/s 63 | seen 314.6M | |
| step 1,210 | loss 2.1796 | ppl 8.8 | lr 1.81e-04 | grad 1.510 | tok/s 63 | seen 317.2M | |
| step 1,220 | loss 2.1568 | ppl 8.6 | lr 1.83e-04 | grad 1.508 | tok/s 63 | seen 319.8M | |
| step 1,230 | loss 2.1347 | ppl 8.5 | lr 1.84e-04 | grad 1.632 | tok/s 63 | seen 322.4M | |
| step 1,240 | loss 2.1687 | ppl 8.7 | lr 1.86e-04 | grad 1.739 | tok/s 63 | seen 325.1M | |
| step 1,250 | loss 2.1187 | ppl 8.3 | lr 1.87e-04 | grad 1.867 | tok/s 63 | seen 327.7M | |
| step 1,260 | loss 2.1271 | ppl 8.4 | lr 1.89e-04 | grad 1.686 | tok/s 63 | seen 330.3M | |
| step 1,270 | loss 2.1035 | ppl 8.2 | lr 1.90e-04 | grad 1.623 | tok/s 63 | seen 332.9M | |
| step 1,280 | loss 2.1355 | ppl 8.5 | lr 1.92e-04 | grad 1.558 | tok/s 63 | seen 335.5M | |
| step 1,290 | loss 2.0740 | ppl 8.0 | lr 1.93e-04 | grad 1.429 | tok/s 63 | seen 338.2M | |
| step 1,300 | loss 2.0338 | ppl 7.6 | lr 1.95e-04 | grad 1.621 | tok/s 63 | seen 340.8M | |
| step 1,310 | loss 2.0089 | ppl 7.5 | lr 1.96e-04 | grad 1.306 | tok/s 63 | seen 343.4M | |
| step 1,320 | loss 2.0743 | ppl 8.0 | lr 1.98e-04 | grad 1.503 | tok/s 63 | seen 346.0M | |
| step 1,330 | loss 2.0003 | ppl 7.4 | lr 1.99e-04 | grad 1.552 | tok/s 63 | seen 348.7M | |
| step 1,340 | loss 1.9406 | ppl 7.0 | lr 2.01e-04 | grad 1.460 | tok/s 64 | seen 351.3M | |
| step 1,350 | loss 1.9834 | ppl 7.3 | lr 2.02e-04 | grad 1.368 | tok/s 64 | seen 353.9M | |
| step 1,360 | loss 2.0318 | ppl 7.6 | lr 2.04e-04 | grad 1.493 | tok/s 64 | seen 356.5M | |
| step 1,370 | loss 2.0265 | ppl 7.6 | lr 2.06e-04 | grad 1.488 | tok/s 64 | seen 359.1M | |
| step 1,380 | loss 1.9557 | ppl 7.1 | lr 2.07e-04 | grad 1.160 | tok/s 64 | seen 361.8M | |
| step 1,390 | loss 1.9612 | ppl 7.1 | lr 2.08e-04 | grad 1.297 | tok/s 64 | seen 364.4M | |
| step 1,400 | loss 1.9965 | ppl 7.4 | lr 2.10e-04 | grad 1.502 | tok/s 64 | seen 367.0M | |
| step 1,410 | loss 1.9459 | ppl 7.0 | lr 2.11e-04 | grad 1.385 | tok/s 64 | seen 369.6M | |
| step 1,420 | loss 1.9225 | ppl 6.8 | lr 2.13e-04 | grad 1.223 | tok/s 64 | seen 372.2M | |
| step 1,430 | loss 1.9616 | ppl 7.1 | lr 2.14e-04 | grad 1.363 | tok/s 64 | seen 374.9M | |
| step 1,440 | loss 1.9219 | ppl 6.8 | lr 2.16e-04 | grad 1.187 | tok/s 64 | seen 377.5M | |
| step 1,450 | loss 1.8887 | ppl 6.6 | lr 2.17e-04 | grad 1.135 | tok/s 64 | seen 380.1M | |
| step 1,460 | loss 1.9003 | ppl 6.7 | lr 2.19e-04 | grad 1.423 | tok/s 64 | seen 382.7M | |
| step 1,470 | loss 1.9138 | ppl 6.8 | lr 2.20e-04 | grad 1.261 | tok/s 64 | seen 385.4M | |
| step 1,480 | loss 1.8593 | ppl 6.4 | lr 2.22e-04 | grad 1.267 | tok/s 63 | seen 388.0M | |
| step 1,490 | loss 1.9467 | ppl 7.0 | lr 2.23e-04 | grad 1.285 | tok/s 63 | seen 390.6M | |
| step 1,500 | loss 1.9067 | ppl 6.7 | lr 2.25e-04 | grad 1.325 | tok/s 63 | seen 393.2M | |
| step 1,510 | loss 1.8412 | ppl 6.3 | lr 2.26e-04 | grad 1.224 | tok/s 63 | seen 395.8M | |
| step 1,520 | loss 1.8760 | ppl 6.5 | lr 2.28e-04 | grad 1.233 | tok/s 63 | seen 398.5M | |
| step 1,530 | loss 1.8651 | ppl 6.5 | lr 2.29e-04 | grad 1.291 | tok/s 63 | seen 401.1M | |
| step 1,540 | loss 1.8816 | ppl 6.6 | lr 2.31e-04 | grad 1.218 | tok/s 63 | seen 403.7M | |
| step 1,550 | loss 1.8030 | ppl 6.1 | lr 2.32e-04 | grad 1.240 | tok/s 63 | seen 406.3M | |
| step 1,560 | loss 1.7983 | ppl 6.0 | lr 2.34e-04 | grad 1.086 | tok/s 63 | seen 408.9M | |
| step 1,570 | loss 1.7998 | ppl 6.0 | lr 2.35e-04 | grad 1.295 | tok/s 63 | seen 411.6M | |
| step 1,580 | loss 1.8366 | ppl 6.3 | lr 2.37e-04 | grad 1.258 | tok/s 63 | seen 414.2M | |
| step 1,590 | loss 1.8160 | ppl 6.1 | lr 2.38e-04 | grad 1.277 | tok/s 62 | seen 416.8M | |
| step 1,600 | loss 1.8246 | ppl 6.2 | lr 2.40e-04 | grad 1.190 | tok/s 62 | seen 419.4M | |
| step 1,610 | loss 1.7614 | ppl 5.8 | lr 2.41e-04 | grad 1.187 | tok/s 62 | seen 422.1M | |
| step 1,620 | loss 1.7790 | ppl 5.9 | lr 2.43e-04 | grad 1.124 | tok/s 62 | seen 424.7M | |
| step 1,630 | loss 1.8163 | ppl 6.1 | lr 2.44e-04 | grad 1.082 | tok/s 62 | seen 427.3M | |
| step 1,640 | loss 1.7867 | ppl 6.0 | lr 2.46e-04 | grad 1.128 | tok/s 62 | seen 429.9M | |
| step 1,650 | loss 1.7226 | ppl 5.6 | lr 2.47e-04 | grad 1.070 | tok/s 62 | seen 432.5M | |
| step 1,660 | loss 1.7253 | ppl 5.6 | lr 2.49e-04 | grad 1.062 | tok/s 60 | seen 435.2M | |
| Interrupted at step 1,660 � saving checkpoint... | |
| Saving final checkpoint... | |
| Training complete. | |
| Final step : 1,660 | |
| Tokens seen : 0.4352B | |
| Best loss : 1.7226 | |
| No checkpoint found � starting from scratch. | |
| ============================================================ | |
| PyCraft-1 Training Started | |
| Device : cuda | |
| Parameters : 55.3M | |
| Micro batch size : 4 | |
| Grad accumulation: 64 | |
| Effective batch : 256 | |
| Max steps : 4,000 | |
| Warmup steps : 500 | |
| Starting at step : 0 | |
| ============================================================ | |
| step 10 | loss 10.3230 | ppl 30424.1 | lr 6.00e-06 | grad 5.035 | tok/s 66 | seen 2.6M | |
| step 20 | loss 9.8974 | ppl 19878.0 | lr 1.20e-05 | grad 3.066 | tok/s 61 | seen 5.2M | |
| step 30 | loss 9.5945 | ppl 14684.3 | lr 1.80e-05 | grad 1.865 | tok/s 59 | seen 7.9M | |
| step 40 | loss 9.4102 | ppl 12211.8 | lr 2.40e-05 | grad 1.777 | tok/s 58 | seen 10.5M | |
| step 50 | loss 9.0390 | ppl 8425.2 | lr 3.00e-05 | grad 1.872 | tok/s 58 | seen 13.1M | |
| step 60 | loss 8.6552 | ppl 5740.1 | lr 3.60e-05 | grad 1.762 | tok/s 57 | seen 15.7M | |
| step 70 | loss 8.1813 | ppl 3573.5 | lr 4.20e-05 | grad 1.681 | tok/s 57 | seen 18.4M | |
| step 80 | loss 7.8688 | ppl 2614.4 | lr 4.80e-05 | grad 1.699 | tok/s 57 | seen 21.0M | |
| step 90 | loss 7.4457 | ppl 1712.4 | lr 5.40e-05 | grad 1.442 | tok/s 57 | seen 23.6M | |
| step 100 | loss 7.1188 | ppl 1234.9 | lr 6.00e-05 | grad 1.951 | tok/s 57 | seen 26.2M | |
| step 110 | loss 6.7851 | ppl 884.6 | lr 6.60e-05 | grad 1.751 | tok/s 57 | seen 28.8M | |
| step 120 | loss 6.4157 | ppl 611.4 | lr 7.20e-05 | grad 1.636 | tok/s 57 | seen 31.5M | |
| step 130 | loss 6.1986 | ppl 492.1 | lr 7.80e-05 | grad 2.636 | tok/s 57 | seen 34.1M | |
| step 140 | loss 5.9078 | ppl 367.9 | lr 8.40e-05 | grad 2.430 | tok/s 57 | seen 36.7M | |
| step 150 | loss 5.6373 | ppl 280.7 | lr 9.00e-05 | grad 2.208 | tok/s 57 | seen 39.3M | |
| step 160 | loss 5.4770 | ppl 239.1 | lr 9.60e-05 | grad 1.756 | tok/s 57 | seen 41.9M | |
| step 170 | loss 5.2201 | ppl 185.0 | lr 1.02e-04 | grad 2.079 | tok/s 57 | seen 44.6M | |
| step 180 | loss 5.0313 | ppl 153.1 | lr 1.08e-04 | grad 2.149 | tok/s 56 | seen 47.2M | |
| step 190 | loss 4.8529 | ppl 128.1 | lr 1.14e-04 | grad 2.349 | tok/s 57 | seen 49.8M | |
| step 200 | loss 4.7704 | ppl 118.0 | lr 1.20e-04 | grad 2.118 | tok/s 57 | seen 52.4M | |
| step 210 | loss 4.5680 | ppl 96.4 | lr 1.26e-04 | grad 1.894 | tok/s 56 | seen 55.1M | |
| step 220 | loss 4.4398 | ppl 84.8 | lr 1.32e-04 | grad 1.801 | tok/s 56 | seen 57.7M | |
| step 230 | loss 4.3126 | ppl 74.6 | lr 1.38e-04 | grad 2.980 | tok/s 57 | seen 60.3M | |
| step 240 | loss 4.1675 | ppl 64.6 | lr 1.44e-04 | grad 2.058 | tok/s 57 | seen 62.9M | |
| step 250 | loss 4.0849 | ppl 59.4 | lr 1.50e-04 | grad 1.500 | tok/s 57 | seen 65.5M | |
| step 260 | loss 3.9831 | ppl 53.7 | lr 1.56e-04 | grad 3.392 | tok/s 57 | seen 68.2M | |
| step 270 | loss 3.9163 | ppl 50.2 | lr 1.62e-04 | grad 1.930 | tok/s 57 | seen 70.8M | |
| step 280 | loss 3.8393 | ppl 46.5 | lr 1.68e-04 | grad 3.109 | tok/s 57 | seen 73.4M | |
| step 290 | loss 3.7062 | ppl 40.7 | lr 1.74e-04 | grad 2.461 | tok/s 57 | seen 76.0M | |
| step 300 | loss 3.6852 | ppl 39.9 | lr 1.80e-04 | grad 2.153 | tok/s 57 | seen 78.6M | |
| step 310 | loss 3.5779 | ppl 35.8 | lr 1.86e-04 | grad 2.183 | tok/s 57 | seen 81.3M | |
| step 320 | loss 3.5259 | ppl 34.0 | lr 1.92e-04 | grad 2.756 | tok/s 57 | seen 83.9M | |
| step 330 | loss 3.3646 | ppl 28.9 | lr 1.98e-04 | grad 2.202 | tok/s 57 | seen 86.5M | |
| step 340 | loss 3.4045 | ppl 30.1 | lr 2.04e-04 | grad 2.430 | tok/s 57 | seen 89.1M | |
| step 350 | loss 3.3366 | ppl 28.1 | lr 2.10e-04 | grad 1.865 | tok/s 57 | seen 91.8M | |
| step 360 | loss 3.3037 | ppl 27.2 | lr 2.16e-04 | grad 2.192 | tok/s 57 | seen 94.4M | |
| step 370 | loss 3.2403 | ppl 25.5 | lr 2.22e-04 | grad 2.205 | tok/s 57 | seen 97.0M | |
| step 380 | loss 3.2429 | ppl 25.6 | lr 2.28e-04 | grad 1.899 | tok/s 57 | seen 99.6M | |
| step 390 | loss 3.1716 | ppl 23.8 | lr 2.34e-04 | grad 1.634 | tok/s 57 | seen 102.2M | |
| step 400 | loss 3.0589 | ppl 21.3 | lr 2.40e-04 | grad 1.636 | tok/s 57 | seen 104.9M | |
| step 410 | loss 3.0145 | ppl 20.4 | lr 2.46e-04 | grad 1.776 | tok/s 56 | seen 107.5M | |
| step 420 | loss 2.9357 | ppl 18.8 | lr 2.52e-04 | grad 1.861 | tok/s 57 | seen 110.1M | |
| step 430 | loss 2.9177 | ppl 18.5 | lr 2.58e-04 | grad 1.613 | tok/s 57 | seen 112.7M | |
| step 440 | loss 2.9252 | ppl 18.6 | lr 2.64e-04 | grad 2.105 | tok/s 57 | seen 115.3M | |
| step 450 | loss 2.8453 | ppl 17.2 | lr 2.70e-04 | grad 1.509 | tok/s 57 | seen 118.0M | |
| step 460 | loss 2.7901 | ppl 16.3 | lr 2.76e-04 | grad 1.697 | tok/s 57 | seen 120.6M | |
| step 470 | loss 2.7561 | ppl 15.7 | lr 2.82e-04 | grad 1.414 | tok/s 57 | seen 123.2M | |
| step 480 | loss 2.7792 | ppl 16.1 | lr 2.88e-04 | grad 1.693 | tok/s 57 | seen 125.8M | |
| step 490 | loss 2.7021 | ppl 14.9 | lr 2.94e-04 | grad 1.478 | tok/s 59 | seen 128.5M | |
| step 500 | loss 2.6292 | ppl 13.9 | lr 3.00e-04 | grad 1.454 | tok/s 57 | seen 131.1M | |
| step 510 | loss 2.5968 | ppl 13.4 | lr 3.00e-04 | grad 1.974 | tok/s 57 | seen 133.7M | |
| step 520 | loss 2.6079 | ppl 13.6 | lr 3.00e-04 | grad 1.365 | tok/s 57 | seen 136.3M | |
| step 530 | loss 2.5736 | ppl 13.1 | lr 3.00e-04 | grad 1.310 | tok/s 57 | seen 138.9M | |
| step 540 | loss 2.5335 | ppl 12.6 | lr 3.00e-04 | grad 1.282 | tok/s 57 | seen 141.6M | |
| step 550 | loss 2.4480 | ppl 11.6 | lr 3.00e-04 | grad 1.320 | tok/s 57 | seen 144.2M | |
| step 560 | loss 2.4519 | ppl 11.6 | lr 3.00e-04 | grad 1.305 | tok/s 57 | seen 146.8M | |
| step 570 | loss 2.3955 | ppl 11.0 | lr 3.00e-04 | grad 1.241 | tok/s 57 | seen 149.4M | |
| step 580 | loss 2.4332 | ppl 11.4 | lr 3.00e-04 | grad 1.240 | tok/s 57 | seen 152.0M | |
| step 590 | loss 2.2697 | ppl 9.7 | lr 3.00e-04 | grad 1.220 | tok/s 57 | seen 154.7M | |
| step 600 | loss 2.3130 | ppl 10.1 | lr 2.99e-04 | grad 1.275 | tok/s 57 | seen 157.3M | |
| step 610 | loss 2.2905 | ppl 9.9 | lr 2.99e-04 | grad 1.158 | tok/s 57 | seen 159.9M | |
| step 620 | loss 2.3017 | ppl 10.0 | lr 2.99e-04 | grad 1.180 | tok/s 57 | seen 162.5M | |
| step 630 | loss 2.2533 | ppl 9.5 | lr 2.99e-04 | grad 1.144 | tok/s 57 | seen 165.2M | |
| step 640 | loss 2.1677 | ppl 8.7 | lr 2.99e-04 | grad 1.105 | tok/s 57 | seen 167.8M | |
| step 650 | loss 2.1335 | ppl 8.4 | lr 2.99e-04 | grad 1.114 | tok/s 57 | seen 170.4M | |
| step 660 | loss 2.2124 | ppl 9.1 | lr 2.99e-04 | grad 1.085 | tok/s 57 | seen 173.0M | |
| step 670 | loss 2.0965 | ppl 8.1 | lr 2.98e-04 | grad 1.063 | tok/s 57 | seen 175.6M | |
| step 680 | loss 2.1743 | ppl 8.8 | lr 2.98e-04 | grad 1.046 | tok/s 57 | seen 178.3M | |
| step 690 | loss 2.0715 | ppl 7.9 | lr 2.98e-04 | grad 0.948 | tok/s 57 | seen 180.9M | |
| step 700 | loss 2.0708 | ppl 7.9 | lr 2.98e-04 | grad 0.948 | tok/s 57 | seen 183.5M | |
| step 710 | loss 2.0653 | ppl 7.9 | lr 2.98e-04 | grad 0.915 | tok/s 57 | seen 186.1M | |
| step 720 | loss 2.0174 | ppl 7.5 | lr 2.97e-04 | grad 1.009 | tok/s 57 | seen 188.7M | |
| step 730 | loss 2.0367 | ppl 7.7 | lr 2.97e-04 | grad 1.007 | tok/s 57 | seen 191.4M | |
| step 740 | loss 2.0274 | ppl 7.6 | lr 2.97e-04 | grad 0.976 | tok/s 57 | seen 194.0M | |
| step 750 | loss 1.9941 | ppl 7.3 | lr 2.97e-04 | grad 0.884 | tok/s 57 | seen 196.6M | |
| step 760 | loss 2.0177 | ppl 7.5 | lr 2.96e-04 | grad 0.920 | tok/s 57 | seen 199.2M | |
| step 770 | loss 1.9504 | ppl 7.0 | lr 2.96e-04 | grad 1.167 | tok/s 57 | seen 201.9M | |
| step 780 | loss 1.9398 | ppl 7.0 | lr 2.96e-04 | grad 0.958 | tok/s 57 | seen 204.5M | |
| step 790 | loss 1.9348 | ppl 6.9 | lr 2.95e-04 | grad 0.792 | tok/s 57 | seen 207.1M | |
| step 800 | loss 1.9556 | ppl 7.1 | lr 2.95e-04 | grad 0.818 | tok/s 57 | seen 209.7M | |
| step 810 | loss 1.8964 | ppl 6.7 | lr 2.95e-04 | grad 0.849 | tok/s 55 | seen 212.3M | |
| step 820 | loss 1.9287 | ppl 6.9 | lr 2.94e-04 | grad 0.928 | tok/s 57 | seen 215.0M | |
| step 830 | loss 1.8769 | ppl 6.5 | lr 2.94e-04 | grad 0.913 | tok/s 57 | seen 217.6M | |
| step 840 | loss 1.8397 | ppl 6.3 | lr 2.94e-04 | grad 0.796 | tok/s 57 | seen 220.2M | |
| step 850 | loss 1.8810 | ppl 6.6 | lr 2.93e-04 | grad 0.809 | tok/s 57 | seen 222.8M | |
| step 860 | loss 1.8435 | ppl 6.3 | lr 2.93e-04 | grad 0.754 | tok/s 57 | seen 225.4M | |
| step 870 | loss 1.8302 | ppl 6.2 | lr 2.93e-04 | grad 0.768 | tok/s 57 | seen 228.1M | |
| step 880 | loss 1.8024 | ppl 6.1 | lr 2.92e-04 | grad 0.768 | tok/s 57 | seen 230.7M | |
| step 890 | loss 1.8353 | ppl 6.3 | lr 2.92e-04 | grad 0.726 | tok/s 57 | seen 233.3M | |
| step 900 | loss 1.8259 | ppl 6.2 | lr 2.91e-04 | grad 0.745 | tok/s 57 | seen 235.9M | |
| step 910 | loss 1.8235 | ppl 6.2 | lr 2.91e-04 | grad 0.760 | tok/s 57 | seen 238.6M | |
| step 920 | loss 1.7810 | ppl 5.9 | lr 2.91e-04 | grad 0.717 | tok/s 57 | seen 241.2M | |
| step 930 | loss 1.7450 | ppl 5.7 | lr 2.90e-04 | grad 0.813 | tok/s 57 | seen 243.8M | |
| step 940 | loss 1.8123 | ppl 6.1 | lr 2.90e-04 | grad 0.781 | tok/s 57 | seen 246.4M | |
| step 950 | loss 1.8226 | ppl 6.2 | lr 2.89e-04 | grad 0.763 | tok/s 57 | seen 249.0M | |
| step 960 | loss 1.7667 | ppl 5.9 | lr 2.89e-04 | grad 0.798 | tok/s 57 | seen 251.7M | |
| step 970 | loss 1.7453 | ppl 5.7 | lr 2.88e-04 | grad 0.699 | tok/s 57 | seen 254.3M | |
| step 980 | loss 1.7293 | ppl 5.6 | lr 2.88e-04 | grad 0.824 | tok/s 57 | seen 256.9M | |
| step 990 | loss 1.7747 | ppl 5.9 | lr 2.87e-04 | grad 0.773 | tok/s 57 | seen 259.5M | |
| step 1,000 | loss 1.7364 | ppl 5.7 | lr 2.87e-04 | grad 0.689 | tok/s 57 | seen 262.1M | |
| step 1,010 | loss 1.7449 | ppl 5.7 | lr 2.86e-04 | grad 0.741 | tok/s 57 | seen 264.8M | |
| step 1,020 | loss 1.7394 | ppl 5.7 | lr 2.86e-04 | grad 0.694 | tok/s 57 | seen 267.4M | |
| step 1,030 | loss 1.7321 | ppl 5.7 | lr 2.85e-04 | grad 0.676 | tok/s 57 | seen 270.0M | |
| step 1,040 | loss 1.7541 | ppl 5.8 | lr 2.84e-04 | grad 0.741 | tok/s 57 | seen 272.6M | |
| step 1,050 | loss 1.7101 | ppl 5.5 | lr 2.84e-04 | grad 0.687 | tok/s 57 | seen 275.3M | |
| step 1,060 | loss 1.7167 | ppl 5.6 | lr 2.83e-04 | grad 0.690 | tok/s 57 | seen 277.9M | |
| step 1,070 | loss 1.6940 | ppl 5.4 | lr 2.83e-04 | grad 0.655 | tok/s 57 | seen 280.5M | |
| step 1,080 | loss 1.6732 | ppl 5.3 | lr 2.82e-04 | grad 0.852 | tok/s 57 | seen 283.1M | |
| step 1,090 | loss 1.6762 | ppl 5.3 | lr 2.82e-04 | grad 0.662 | tok/s 57 | seen 285.7M | |
| step 1,100 | loss 1.7399 | ppl 5.7 | lr 2.81e-04 | grad 0.688 | tok/s 57 | seen 288.4M | |
| step 1,110 | loss 1.7198 | ppl 5.6 | lr 2.80e-04 | grad 0.683 | tok/s 57 | seen 291.0M | |
| step 1,120 | loss 1.6600 | ppl 5.3 | lr 2.80e-04 | grad 0.636 | tok/s 57 | seen 293.6M | |
| step 1,130 | loss 1.6766 | ppl 5.3 | lr 2.79e-04 | grad 0.704 | tok/s 57 | seen 296.2M | |
| step 1,140 | loss 1.7074 | ppl 5.5 | lr 2.78e-04 | grad 0.641 | tok/s 57 | seen 298.8M | |
| step 1,150 | loss 1.6967 | ppl 5.5 | lr 2.78e-04 | grad 0.695 | tok/s 57 | seen 301.5M | |
| step 1,160 | loss 1.6265 | ppl 5.1 | lr 2.77e-04 | grad 0.663 | tok/s 57 | seen 304.1M | |
| step 1,170 | loss 1.6746 | ppl 5.3 | lr 2.76e-04 | grad 0.658 | tok/s 57 | seen 306.7M | |
| step 1,180 | loss 1.6714 | ppl 5.3 | lr 2.76e-04 | grad 0.700 | tok/s 57 | seen 309.3M | |
| step 1,190 | loss 1.6384 | ppl 5.1 | lr 2.75e-04 | grad 0.626 | tok/s 57 | seen 312.0M | |
| step 1,200 | loss 1.5998 | ppl 5.0 | lr 2.74e-04 | grad 0.635 | tok/s 57 | seen 314.6M | |
| step 1,210 | loss 1.6109 | ppl 5.0 | lr 2.74e-04 | grad 0.647 | tok/s 57 | seen 317.2M | |
| step 1,220 | loss 1.5746 | ppl 4.8 | lr 2.73e-04 | grad 0.666 | tok/s 57 | seen 319.8M | |
| step 1,230 | loss 1.5716 | ppl 4.8 | lr 2.72e-04 | grad 0.637 | tok/s 57 | seen 322.4M | |
| step 1,240 | loss 1.6181 | ppl 5.0 | lr 2.71e-04 | grad 0.627 | tok/s 58 | seen 325.1M | |
| step 1,250 | loss 1.5479 | ppl 4.7 | lr 2.71e-04 | grad 0.623 | tok/s 58 | seen 327.7M | |
| step 1,260 | loss 1.6376 | ppl 5.1 | lr 2.70e-04 | grad 0.628 | tok/s 58 | seen 330.3M | |
| step 1,270 | loss 1.5593 | ppl 4.8 | lr 2.69e-04 | grad 0.602 | tok/s 58 | seen 332.9M | |
| step 1,280 | loss 1.5509 | ppl 4.7 | lr 2.68e-04 | grad 0.586 | tok/s 58 | seen 335.5M | |
| step 1,290 | loss 1.6246 | ppl 5.1 | lr 2.67e-04 | grad 0.649 | tok/s 58 | seen 338.2M | |
| step 1,300 | loss 1.5674 | ppl 4.8 | lr 2.67e-04 | grad 0.599 | tok/s 58 | seen 340.8M | |
| step 1,310 | loss 1.6015 | ppl 5.0 | lr 2.66e-04 | grad 0.655 | tok/s 58 | seen 343.4M | |
| step 1,320 | loss 1.5756 | ppl 4.8 | lr 2.65e-04 | grad 0.597 | tok/s 58 | seen 346.0M | |
| step 1,330 | loss 1.5090 | ppl 4.5 | lr 2.64e-04 | grad 0.571 | tok/s 58 | seen 348.7M | |
| step 1,340 | loss 1.5669 | ppl 4.8 | lr 2.63e-04 | grad 0.585 | tok/s 58 | seen 351.3M | |
| step 1,350 | loss 1.5667 | ppl 4.8 | lr 2.63e-04 | grad 0.600 | tok/s 58 | seen 353.9M | |
| step 1,360 | loss 1.5854 | ppl 4.9 | lr 2.62e-04 | grad 0.597 | tok/s 58 | seen 356.5M | |
| step 1,370 | loss 1.5606 | ppl 4.8 | lr 2.61e-04 | grad 0.612 | tok/s 58 | seen 359.1M | |
| step 1,380 | loss 1.5639 | ppl 4.8 | lr 2.60e-04 | grad 0.603 | tok/s 58 | seen 361.8M | |
| step 1,390 | loss 1.5274 | ppl 4.6 | lr 2.59e-04 | grad 0.618 | tok/s 58 | seen 364.4M | |
| step 1,400 | loss 1.5404 | ppl 4.7 | lr 2.58e-04 | grad 0.638 | tok/s 58 | seen 367.0M | |
| step 1,410 | loss 1.5314 | ppl 4.6 | lr 2.57e-04 | grad 0.598 | tok/s 58 | seen 369.6M | |
| step 1,420 | loss 1.5640 | ppl 4.8 | lr 2.57e-04 | grad 0.622 | tok/s 58 | seen 372.2M | |
| step 1,430 | loss 1.5395 | ppl 4.7 | lr 2.56e-04 | grad 0.562 | tok/s 58 | seen 374.9M | |
| step 1,440 | loss 1.5255 | ppl 4.6 | lr 2.55e-04 | grad 0.596 | tok/s 58 | seen 377.5M | |
| step 1,450 | loss 1.5312 | ppl 4.6 | lr 2.54e-04 | grad 0.612 | tok/s 58 | seen 380.1M | |
| step 1,460 | loss 1.5239 | ppl 4.6 | lr 2.53e-04 | grad 0.584 | tok/s 58 | seen 382.7M | |
| step 1,470 | loss 1.5683 | ppl 4.8 | lr 2.52e-04 | grad 0.640 | tok/s 59 | seen 385.4M | |
| step 1,480 | loss 1.4998 | ppl 4.5 | lr 2.51e-04 | grad 0.621 | tok/s 59 | seen 388.0M | |
| step 1,490 | loss 1.5534 | ppl 4.7 | lr 2.50e-04 | grad 0.629 | tok/s 59 | seen 390.6M | |
| step 1,500 | loss 1.4911 | ppl 4.4 | lr 2.49e-04 | grad 0.553 | tok/s 58 | seen 393.2M | |
| step 1,510 | loss 1.5040 | ppl 4.5 | lr 2.48e-04 | grad 0.578 | tok/s 59 | seen 395.8M | |
| step 1,520 | loss 1.4723 | ppl 4.4 | lr 2.47e-04 | grad 0.563 | tok/s 59 | seen 398.5M | |
| step 1,530 | loss 1.5121 | ppl 4.5 | lr 2.46e-04 | grad 0.594 | tok/s 58 | seen 401.1M | |
| step 1,540 | loss 1.5332 | ppl 4.6 | lr 2.45e-04 | grad 0.663 | tok/s 58 | seen 403.7M | |
| step 1,550 | loss 1.5393 | ppl 4.7 | lr 2.44e-04 | grad 0.586 | tok/s 58 | seen 406.3M | |
| step 1,560 | loss 1.4987 | ppl 4.5 | lr 2.43e-04 | grad 0.589 | tok/s 59 | seen 408.9M | |
| step 1,570 | loss 1.5277 | ppl 4.6 | lr 2.42e-04 | grad 0.535 | tok/s 58 | seen 411.6M | |
| step 1,580 | loss 1.5285 | ppl 4.6 | lr 2.41e-04 | grad 0.588 | tok/s 58 | seen 414.2M | |
| step 1,590 | loss 1.4689 | ppl 4.3 | lr 2.40e-04 | grad 0.575 | tok/s 59 | seen 416.8M | |
| step 1,600 | loss 1.4010 | ppl 4.1 | lr 2.39e-04 | grad 0.552 | tok/s 59 | seen 419.4M | |
| step 1,610 | loss 1.4277 | ppl 4.2 | lr 2.38e-04 | grad 0.522 | tok/s 58 | seen 422.1M | |
| step 1,620 | loss 1.4426 | ppl 4.2 | lr 2.37e-04 | grad 0.550 | tok/s 59 | seen 424.7M | |
| step 1,630 | loss 1.4743 | ppl 4.4 | lr 2.36e-04 | grad 0.614 | tok/s 59 | seen 427.3M | |
| step 1,640 | loss 1.4862 | ppl 4.4 | lr 2.35e-04 | grad 0.610 | tok/s 58 | seen 429.9M | |
| step 1,650 | loss 1.4728 | ppl 4.4 | lr 2.34e-04 | grad 0.556 | tok/s 58 | seen 432.5M | |
| step 1,660 | loss 1.4623 | ppl 4.3 | lr 2.33e-04 | grad 0.554 | tok/s 58 | seen 435.2M | |
| step 1,670 | loss 1.4911 | ppl 4.4 | lr 2.32e-04 | grad 0.560 | tok/s 59 | seen 437.8M | |
| step 1,680 | loss 1.4195 | ppl 4.1 | lr 2.31e-04 | grad 0.569 | tok/s 59 | seen 440.4M | |
| step 1,690 | loss 1.4594 | ppl 4.3 | lr 2.30e-04 | grad 0.577 | tok/s 59 | seen 443.0M | |
| step 1,700 | loss 1.4602 | ppl 4.3 | lr 2.29e-04 | grad 0.533 | tok/s 59 | seen 445.6M | |
| step 1,710 | loss 1.4384 | ppl 4.2 | lr 2.28e-04 | grad 0.552 | tok/s 59 | seen 448.3M | |
| step 1,720 | loss 1.4193 | ppl 4.1 | lr 2.27e-04 | grad 0.559 | tok/s 59 | seen 450.9M | |
| step 1,730 | loss 1.4588 | ppl 4.3 | lr 2.26e-04 | grad 0.571 | tok/s 59 | seen 453.5M | |
| step 1,740 | loss 1.4739 | ppl 4.4 | lr 2.25e-04 | grad 0.548 | tok/s 59 | seen 456.1M | |
| step 1,750 | loss 1.4628 | ppl 4.3 | lr 2.24e-04 | grad 0.555 | tok/s 59 | seen 458.8M | |
| step 1,760 | loss 1.4391 | ppl 4.2 | lr 2.22e-04 | grad 0.539 | tok/s 59 | seen 461.4M | |
| step 1,770 | loss 1.4450 | ppl 4.2 | lr 2.21e-04 | grad 0.526 | tok/s 59 | seen 464.0M | |
| step 1,780 | loss 1.3943 | ppl 4.0 | lr 2.20e-04 | grad 0.534 | tok/s 59 | seen 466.6M | |
| step 1,790 | loss 1.4525 | ppl 4.3 | lr 2.19e-04 | grad 0.525 | tok/s 59 | seen 469.2M | |
| step 1,800 | loss 1.4265 | ppl 4.2 | lr 2.18e-04 | grad 0.567 | tok/s 59 | seen 471.9M | |
| step 1,810 | loss 1.4591 | ppl 4.3 | lr 2.17e-04 | grad 0.532 | tok/s 58 | seen 474.5M | |
| step 1,820 | loss 1.4127 | ppl 4.1 | lr 2.16e-04 | grad 0.534 | tok/s 59 | seen 477.1M | |
| step 1,830 | loss 1.4135 | ppl 4.1 | lr 2.15e-04 | grad 0.547 | tok/s 59 | seen 479.7M | |
| step 1,840 | loss 1.4651 | ppl 4.3 | lr 2.14e-04 | grad 0.534 | tok/s 59 | seen 482.3M | |
| step 1,850 | loss 1.4555 | ppl 4.3 | lr 2.12e-04 | grad 0.543 | tok/s 59 | seen 485.0M | |
| step 1,860 | loss 1.4228 | ppl 4.1 | lr 2.11e-04 | grad 0.527 | tok/s 59 | seen 487.6M | |
| step 1,870 | loss 1.4268 | ppl 4.2 | lr 2.10e-04 | grad 0.546 | tok/s 59 | seen 490.2M | |
| step 1,880 | loss 1.4224 | ppl 4.1 | lr 2.09e-04 | grad 0.556 | tok/s 59 | seen 492.8M | |
| step 1,890 | loss 1.4267 | ppl 4.2 | lr 2.08e-04 | grad 0.536 | tok/s 59 | seen 495.5M | |
| step 1,900 | loss 1.4466 | ppl 4.2 | lr 2.07e-04 | grad 0.535 | tok/s 59 | seen 498.1M | |
| step 1,910 | loss 1.4374 | ppl 4.2 | lr 2.06e-04 | grad 0.570 | tok/s 59 | seen 500.7M | |
| step 1,920 | loss 1.4744 | ppl 4.4 | lr 2.04e-04 | grad 0.543 | tok/s 59 | seen 503.3M | |
| step 1,930 | loss 1.4350 | ppl 4.2 | lr 2.03e-04 | grad 0.520 | tok/s 59 | seen 505.9M | |
| step 1,940 | loss 1.4270 | ppl 4.2 | lr 2.02e-04 | grad 0.525 | tok/s 59 | seen 508.6M | |
| step 1,950 | loss 1.4058 | ppl 4.1 | lr 2.01e-04 | grad 0.537 | tok/s 59 | seen 511.2M | |
| step 1,960 | loss 1.3876 | ppl 4.0 | lr 2.00e-04 | grad 0.535 | tok/s 59 | seen 513.8M | |
| step 1,970 | loss 1.4374 | ppl 4.2 | lr 1.99e-04 | grad 0.534 | tok/s 59 | seen 516.4M | |
| step 1,980 | loss 1.4111 | ppl 4.1 | lr 1.97e-04 | grad 0.526 | tok/s 59 | seen 519.0M | |
| step 1,990 | loss 1.3803 | ppl 4.0 | lr 1.96e-04 | grad 0.543 | tok/s 61 | seen 521.7M | |
| step 2,000 | loss 1.3959 | ppl 4.0 | lr 1.95e-04 | grad 0.545 | tok/s 62 | seen 524.3M | |
| step 2,010 | loss 1.3872 | ppl 4.0 | lr 1.94e-04 | grad 0.545 | tok/s 62 | seen 526.9M | |
| step 2,020 | loss 1.3285 | ppl 3.8 | lr 1.93e-04 | grad 0.521 | tok/s 62 | seen 529.5M | |
| step 2,030 | loss 1.3978 | ppl 4.0 | lr 1.91e-04 | grad 0.533 | tok/s 62 | seen 532.2M | |
| step 2,040 | loss 1.3695 | ppl 3.9 | lr 1.90e-04 | grad 0.520 | tok/s 63 | seen 534.8M | |
| step 2,050 | loss 1.3594 | ppl 3.9 | lr 1.89e-04 | grad 0.529 | tok/s 63 | seen 537.4M | |
| step 2,060 | loss 1.3945 | ppl 4.0 | lr 1.88e-04 | grad 0.527 | tok/s 63 | seen 540.0M | |
| step 2,070 | loss 1.3783 | ppl 4.0 | lr 1.87e-04 | grad 0.526 | tok/s 63 | seen 542.6M | |
| step 2,080 | loss 1.3849 | ppl 4.0 | lr 1.86e-04 | grad 0.522 | tok/s 63 | seen 545.3M | |
| step 2,090 | loss 1.3592 | ppl 3.9 | lr 1.84e-04 | grad 0.528 | tok/s 63 | seen 547.9M | |
| step 2,100 | loss 1.3811 | ppl 4.0 | lr 1.83e-04 | grad 0.543 | tok/s 63 | seen 550.5M | |
| step 2,110 | loss 1.3528 | ppl 3.9 | lr 1.82e-04 | grad 0.511 | tok/s 63 | seen 553.1M | |
| step 2,120 | loss 1.4071 | ppl 4.1 | lr 1.81e-04 | grad 0.534 | tok/s 63 | seen 555.7M | |
| step 2,130 | loss 1.3711 | ppl 3.9 | lr 1.80e-04 | grad 0.518 | tok/s 63 | seen 558.4M | |
| step 2,140 | loss 1.3426 | ppl 3.8 | lr 1.78e-04 | grad 0.525 | tok/s 63 | seen 561.0M | |
| step 2,150 | loss 1.3526 | ppl 3.9 | lr 1.77e-04 | grad 0.527 | tok/s 63 | seen 563.6M | |
| step 2,160 | loss 1.3453 | ppl 3.8 | lr 1.76e-04 | grad 0.538 | tok/s 63 | seen 566.2M | |
| step 2,170 | loss 1.3362 | ppl 3.8 | lr 1.75e-04 | grad 0.516 | tok/s 63 | seen 568.9M | |
| step 2,180 | loss 1.3613 | ppl 3.9 | lr 1.73e-04 | grad 0.549 | tok/s 63 | seen 571.5M | |
| step 2,190 | loss 1.3799 | ppl 4.0 | lr 1.72e-04 | grad 0.517 | tok/s 63 | seen 574.1M | |
| step 2,200 | loss 1.3588 | ppl 3.9 | lr 1.71e-04 | grad 0.525 | tok/s 63 | seen 576.7M | |
| step 2,210 | loss 1.4289 | ppl 4.2 | lr 1.70e-04 | grad 0.552 | tok/s 63 | seen 579.3M | |
| step 2,220 | loss 1.3540 | ppl 3.9 | lr 1.69e-04 | grad 0.522 | tok/s 63 | seen 582.0M | |
| step 2,230 | loss 1.3525 | ppl 3.9 | lr 1.67e-04 | grad 0.537 | tok/s 63 | seen 584.6M | |
| step 2,240 | loss 1.3547 | ppl 3.9 | lr 1.66e-04 | grad 0.521 | tok/s 63 | seen 587.2M | |
| step 2,250 | loss 1.3762 | ppl 4.0 | lr 1.65e-04 | grad 0.510 | tok/s 63 | seen 589.8M | |
| step 2,260 | loss 1.3351 | ppl 3.8 | lr 1.64e-04 | grad 0.511 | tok/s 63 | seen 592.4M | |
| step 2,270 | loss 1.3659 | ppl 3.9 | lr 1.63e-04 | grad 0.519 | tok/s 63 | seen 595.1M | |
| step 2,280 | loss 1.3509 | ppl 3.9 | lr 1.61e-04 | grad 0.520 | tok/s 63 | seen 597.7M | |
| step 2,290 | loss 1.3708 | ppl 3.9 | lr 1.60e-04 | grad 0.526 | tok/s 63 | seen 600.3M | |
| step 2,300 | loss 1.3411 | ppl 3.8 | lr 1.59e-04 | grad 0.497 | tok/s 63 | seen 602.9M | |
| step 2,310 | loss 1.3410 | ppl 3.8 | lr 1.58e-04 | grad 0.528 | tok/s 63 | seen 605.6M | |
| step 2,320 | loss 1.3388 | ppl 3.8 | lr 1.57e-04 | grad 0.511 | tok/s 64 | seen 608.2M | |
| step 2,330 | loss 1.3340 | ppl 3.8 | lr 1.55e-04 | grad 0.499 | tok/s 63 | seen 610.8M | |
| step 2,340 | loss 1.3439 | ppl 3.8 | lr 1.54e-04 | grad 0.512 | tok/s 63 | seen 613.4M | |
| step 2,350 | loss 1.3647 | ppl 3.9 | lr 1.53e-04 | grad 0.519 | tok/s 63 | seen 616.0M | |
| step 2,360 | loss 1.3385 | ppl 3.8 | lr 1.52e-04 | grad 0.490 | tok/s 63 | seen 618.7M | |
| step 2,370 | loss 1.3095 | ppl 3.7 | lr 1.50e-04 | grad 0.506 | tok/s 63 | seen 621.3M | |
| step 2,380 | loss 1.3752 | ppl 4.0 | lr 1.49e-04 | grad 0.510 | tok/s 63 | seen 623.9M | |
| step 2,390 | loss 1.3228 | ppl 3.8 | lr 1.48e-04 | grad 0.510 | tok/s 64 | seen 626.5M | |
| step 2,400 | loss 1.3089 | ppl 3.7 | lr 1.47e-04 | grad 0.535 | tok/s 64 | seen 629.1M | |
| step 2,410 | loss 1.2863 | ppl 3.6 | lr 1.46e-04 | grad 0.499 | tok/s 63 | seen 631.8M | |
| step 2,420 | loss 1.3000 | ppl 3.7 | lr 1.44e-04 | grad 0.517 | tok/s 64 | seen 634.4M | |
| step 2,430 | loss 1.2946 | ppl 3.6 | lr 1.43e-04 | grad 0.512 | tok/s 64 | seen 637.0M | |
| step 2,440 | loss 1.2932 | ppl 3.6 | lr 1.42e-04 | grad 0.558 | tok/s 64 | seen 639.6M | |
| step 2,450 | loss 1.3527 | ppl 3.9 | lr 1.41e-04 | grad 0.523 | tok/s 64 | seen 642.3M | |
| step 2,460 | loss 1.3005 | ppl 3.7 | lr 1.40e-04 | grad 0.522 | tok/s 64 | seen 644.9M | |
| step 2,470 | loss 1.2616 | ppl 3.5 | lr 1.39e-04 | grad 0.507 | tok/s 64 | seen 647.5M | |
| step 2,480 | loss 1.3170 | ppl 3.7 | lr 1.37e-04 | grad 0.543 | tok/s 64 | seen 650.1M | |
| step 2,490 | loss 1.3067 | ppl 3.7 | lr 1.36e-04 | grad 0.529 | tok/s 64 | seen 652.7M | |
| step 2,500 | loss 1.3279 | ppl 3.8 | lr 1.35e-04 | grad 0.519 | tok/s 64 | seen 655.4M | |
| step 2,510 | loss 1.3272 | ppl 3.8 | lr 1.34e-04 | grad 0.501 | tok/s 64 | seen 658.0M | |
| step 2,520 | loss 1.2764 | ppl 3.6 | lr 1.33e-04 | grad 0.502 | tok/s 64 | seen 660.6M | |
| step 2,530 | loss 1.2892 | ppl 3.6 | lr 1.31e-04 | grad 0.546 | tok/s 64 | seen 663.2M | |
| step 2,540 | loss 1.3309 | ppl 3.8 | lr 1.30e-04 | grad 0.517 | tok/s 64 | seen 665.8M | |
| step 2,550 | loss 1.3253 | ppl 3.8 | lr 1.29e-04 | grad 0.551 | tok/s 64 | seen 668.5M | |
| step 2,560 | loss 1.3246 | ppl 3.8 | lr 1.28e-04 | grad 0.512 | tok/s 64 | seen 671.1M | |
| step 2,570 | loss 1.3187 | ppl 3.7 | lr 1.27e-04 | grad 0.521 | tok/s 64 | seen 673.7M | |
| step 2,580 | loss 1.3006 | ppl 3.7 | lr 1.26e-04 | grad 0.513 | tok/s 64 | seen 676.3M | |
| step 2,590 | loss 1.3258 | ppl 3.8 | lr 1.24e-04 | grad 0.519 | tok/s 63 | seen 679.0M | |
| step 2,600 | loss 1.2560 | ppl 3.5 | lr 1.23e-04 | grad 0.508 | tok/s 64 | seen 681.6M | |
| step 2,610 | loss 1.3475 | ppl 3.8 | lr 1.22e-04 | grad 0.511 | tok/s 64 | seen 684.2M | |
| step 2,620 | loss 1.2918 | ppl 3.6 | lr 1.21e-04 | grad 0.522 | tok/s 64 | seen 686.8M | |
| step 2,630 | loss 1.3070 | ppl 3.7 | lr 1.20e-04 | grad 0.497 | tok/s 64 | seen 689.4M | |
| step 2,640 | loss 1.3027 | ppl 3.7 | lr 1.19e-04 | grad 0.513 | tok/s 64 | seen 692.1M | |
| step 2,650 | loss 1.2837 | ppl 3.6 | lr 1.18e-04 | grad 0.518 | tok/s 64 | seen 694.7M | |
| step 2,660 | loss 1.3582 | ppl 3.9 | lr 1.16e-04 | grad 0.519 | tok/s 64 | seen 697.3M | |
| step 2,670 | loss 1.3361 | ppl 3.8 | lr 1.15e-04 | grad 0.523 | tok/s 64 | seen 699.9M | |
| step 2,680 | loss 1.2806 | ppl 3.6 | lr 1.14e-04 | grad 0.506 | tok/s 64 | seen 702.5M | |
| step 2,690 | loss 1.2838 | ppl 3.6 | lr 1.13e-04 | grad 0.530 | tok/s 64 | seen 705.2M | |
| step 2,700 | loss 1.2396 | ppl 3.5 | lr 1.12e-04 | grad 0.499 | tok/s 64 | seen 707.8M | |
| step 2,710 | loss 1.2650 | ppl 3.5 | lr 1.11e-04 | grad 0.511 | tok/s 64 | seen 710.4M | |
| step 2,720 | loss 1.2956 | ppl 3.7 | lr 1.10e-04 | grad 0.524 | tok/s 64 | seen 713.0M | |
| step 2,730 | loss 1.2879 | ppl 3.6 | lr 1.09e-04 | grad 0.509 | tok/s 64 | seen 715.7M | |
| step 2,740 | loss 1.2923 | ppl 3.6 | lr 1.08e-04 | grad 0.504 | tok/s 64 | seen 718.3M | |
| step 2,750 | loss 1.2578 | ppl 3.5 | lr 1.06e-04 | grad 0.503 | tok/s 64 | seen 720.9M | |
| step 2,760 | loss 1.2621 | ppl 3.5 | lr 1.05e-04 | grad 0.525 | tok/s 64 | seen 723.5M | |
| step 2,770 | loss 1.3129 | ppl 3.7 | lr 1.04e-04 | grad 0.523 | tok/s 64 | seen 726.1M | |
| step 2,780 | loss 1.2588 | ppl 3.5 | lr 1.03e-04 | grad 0.513 | tok/s 64 | seen 728.8M | |
| step 2,790 | loss 1.2878 | ppl 3.6 | lr 1.02e-04 | grad 0.501 | tok/s 64 | seen 731.4M | |
| step 2,800 | loss 1.3137 | ppl 3.7 | lr 1.01e-04 | grad 0.524 | tok/s 64 | seen 734.0M | |
| step 2,810 | loss 1.2240 | ppl 3.4 | lr 1.00e-04 | grad 0.487 | tok/s 63 | seen 736.6M | |
| step 2,820 | loss 1.2051 | ppl 3.3 | lr 9.89e-05 | grad 0.509 | tok/s 64 | seen 739.2M | |
| step 2,830 | loss 1.2754 | ppl 3.6 | lr 9.79e-05 | grad 0.524 | tok/s 64 | seen 741.9M | |
| step 2,840 | loss 1.2595 | ppl 3.5 | lr 9.68e-05 | grad 0.514 | tok/s 64 | seen 744.5M | |
| step 2,850 | loss 1.2563 | ppl 3.5 | lr 9.58e-05 | grad 0.516 | tok/s 64 | seen 747.1M | |
| step 2,860 | loss 1.2484 | ppl 3.5 | lr 9.47e-05 | grad 0.521 | tok/s 64 | seen 749.7M | |
| step 2,870 | loss 1.2693 | ppl 3.6 | lr 9.37e-05 | grad 0.530 | tok/s 64 | seen 752.4M | |
| step 2,880 | loss 1.2784 | ppl 3.6 | lr 9.27e-05 | grad 0.523 | tok/s 64 | seen 755.0M | |
| step 2,890 | loss 1.2588 | ppl 3.5 | lr 9.16e-05 | grad 0.519 | tok/s 64 | seen 757.6M | |
| step 2,900 | loss 1.2644 | ppl 3.5 | lr 9.06e-05 | grad 0.511 | tok/s 64 | seen 760.2M | |
| step 2,910 | loss 1.2606 | ppl 3.5 | lr 8.96e-05 | grad 0.516 | tok/s 64 | seen 762.8M | |
| step 2,920 | loss 1.2443 | ppl 3.5 | lr 8.86e-05 | grad 0.513 | tok/s 64 | seen 765.5M | |
| step 2,930 | loss 1.2588 | ppl 3.5 | lr 8.76e-05 | grad 0.525 | tok/s 64 | seen 768.1M | |
| step 2,940 | loss 1.2786 | ppl 3.6 | lr 8.66e-05 | grad 0.510 | tok/s 64 | seen 770.7M | |
| step 2,950 | loss 1.3157 | ppl 3.7 | lr 8.56e-05 | grad 0.528 | tok/s 64 | seen 773.3M | |
| step 2,960 | loss 1.2825 | ppl 3.6 | lr 8.47e-05 | grad 0.511 | tok/s 64 | seen 775.9M | |
| step 2,970 | loss 1.2700 | ppl 3.6 | lr 8.37e-05 | grad 0.508 | tok/s 64 | seen 778.6M | |
| step 2,980 | loss 1.2741 | ppl 3.6 | lr 8.27e-05 | grad 0.522 | tok/s 64 | seen 781.2M | |
| step 2,990 | loss 1.2295 | ppl 3.4 | lr 8.18e-05 | grad 0.506 | tok/s 64 | seen 783.8M | |
| step 3,000 | loss 1.2523 | ppl 3.5 | lr 8.08e-05 | grad 0.510 | tok/s 64 | seen 786.4M | |
| step 3,010 | loss 1.2198 | ppl 3.4 | lr 7.99e-05 | grad 0.497 | tok/s 63 | seen 789.1M | |
| step 3,020 | loss 1.2821 | ppl 3.6 | lr 7.89e-05 | grad 0.514 | tok/s 64 | seen 791.7M | |
| step 3,030 | loss 1.2176 | ppl 3.4 | lr 7.80e-05 | grad 0.521 | tok/s 64 | seen 794.3M | |
| step 3,040 | loss 1.2382 | ppl 3.4 | lr 7.71e-05 | grad 0.508 | tok/s 64 | seen 796.9M | |
| step 3,050 | loss 1.2500 | ppl 3.5 | lr 7.62e-05 | grad 0.512 | tok/s 64 | seen 799.5M | |
| step 3,060 | loss 1.2412 | ppl 3.5 | lr 7.53e-05 | grad 0.501 | tok/s 64 | seen 802.2M | |
| step 3,070 | loss 1.2596 | ppl 3.5 | lr 7.44e-05 | grad 0.505 | tok/s 64 | seen 804.8M | |
| step 3,080 | loss 1.2248 | ppl 3.4 | lr 7.35e-05 | grad 0.502 | tok/s 64 | seen 807.4M | |
| step 3,090 | loss 1.2279 | ppl 3.4 | lr 7.26e-05 | grad 0.506 | tok/s 64 | seen 810.0M | |
| step 3,100 | loss 1.2373 | ppl 3.4 | lr 7.17e-05 | grad 0.503 | tok/s 64 | seen 812.6M | |
| step 3,110 | loss 1.2502 | ppl 3.5 | lr 7.08e-05 | grad 0.523 | tok/s 64 | seen 815.3M | |
| step 3,120 | loss 1.2799 | ppl 3.6 | lr 7.00e-05 | grad 0.502 | tok/s 64 | seen 817.9M | |
| step 3,130 | loss 1.2858 | ppl 3.6 | lr 6.91e-05 | grad 0.511 | tok/s 64 | seen 820.5M | |
| step 3,140 | loss 1.2492 | ppl 3.5 | lr 6.83e-05 | grad 0.512 | tok/s 64 | seen 823.1M | |
| step 3,150 | loss 1.2335 | ppl 3.4 | lr 6.74e-05 | grad 0.501 | tok/s 64 | seen 825.8M | |
| step 3,160 | loss 1.2362 | ppl 3.4 | lr 6.66e-05 | grad 0.547 | tok/s 64 | seen 828.4M | |
| step 3,170 | loss 1.2696 | ppl 3.6 | lr 6.58e-05 | grad 0.519 | tok/s 63 | seen 831.0M | |
| step 3,180 | loss 1.2343 | ppl 3.4 | lr 6.49e-05 | grad 0.516 | tok/s 63 | seen 833.6M | |
| step 3,190 | loss 1.1984 | ppl 3.3 | lr 6.41e-05 | grad 0.502 | tok/s 63 | seen 836.2M | |
| step 3,200 | loss 1.2428 | ppl 3.5 | lr 6.33e-05 | grad 0.513 | tok/s 63 | seen 838.9M | |
| step 3,210 | loss 1.2191 | ppl 3.4 | lr 6.25e-05 | grad 0.523 | tok/s 63 | seen 841.5M | |
| step 3,220 | loss 1.1735 | ppl 3.2 | lr 6.18e-05 | grad 0.505 | tok/s 63 | seen 844.1M | |
| step 3,230 | loss 1.2099 | ppl 3.4 | lr 6.10e-05 | grad 0.512 | tok/s 63 | seen 846.7M | |
| step 3,240 | loss 1.2177 | ppl 3.4 | lr 6.02e-05 | grad 0.514 | tok/s 63 | seen 849.3M | |
| step 3,250 | loss 1.2414 | ppl 3.5 | lr 5.95e-05 | grad 0.518 | tok/s 63 | seen 852.0M | |
| step 3,260 | loss 1.1982 | ppl 3.3 | lr 5.87e-05 | grad 0.502 | tok/s 63 | seen 854.6M | |
| step 3,270 | loss 1.1889 | ppl 3.3 | lr 5.80e-05 | grad 0.511 | tok/s 63 | seen 857.2M | |
| step 3,280 | loss 1.2147 | ppl 3.4 | lr 5.72e-05 | grad 0.518 | tok/s 63 | seen 859.8M | |
| step 3,290 | loss 1.2687 | ppl 3.6 | lr 5.65e-05 | grad 0.515 | tok/s 64 | seen 862.5M | |
| step 3,300 | loss 1.2240 | ppl 3.4 | lr 5.58e-05 | grad 0.505 | tok/s 64 | seen 865.1M | |
| step 3,310 | loss 1.2034 | ppl 3.3 | lr 5.51e-05 | grad 0.502 | tok/s 63 | seen 867.7M | |
| step 3,320 | loss 1.2319 | ppl 3.4 | lr 5.44e-05 | grad 0.515 | tok/s 63 | seen 870.3M | |
| step 3,330 | loss 1.2278 | ppl 3.4 | lr 5.37e-05 | grad 0.516 | tok/s 63 | seen 872.9M | |
| step 3,340 | loss 1.1976 | ppl 3.3 | lr 5.30e-05 | grad 0.521 | tok/s 63 | seen 875.6M | |
| step 3,350 | loss 1.2107 | ppl 3.4 | lr 5.23e-05 | grad 0.511 | tok/s 63 | seen 878.2M | |
| step 3,360 | loss 1.2313 | ppl 3.4 | lr 5.17e-05 | grad 0.505 | tok/s 63 | seen 880.8M | |
| step 3,370 | loss 1.2654 | ppl 3.5 | lr 5.10e-05 | grad 0.517 | tok/s 63 | seen 883.4M | |
| step 3,380 | loss 1.1916 | ppl 3.3 | lr 5.04e-05 | grad 0.505 | tok/s 63 | seen 886.0M | |
| step 3,390 | loss 1.2010 | ppl 3.3 | lr 4.97e-05 | grad 0.511 | tok/s 63 | seen 888.7M | |
| step 3,400 | loss 1.1997 | ppl 3.3 | lr 4.91e-05 | grad 0.533 | tok/s 63 | seen 891.3M | |
| step 3,410 | loss 1.2209 | ppl 3.4 | lr 4.85e-05 | grad 0.523 | tok/s 63 | seen 893.9M | |
| step 3,420 | loss 1.2224 | ppl 3.4 | lr 4.79e-05 | grad 0.508 | tok/s 63 | seen 896.5M | |
| step 3,430 | loss 1.2278 | ppl 3.4 | lr 4.73e-05 | grad 0.505 | tok/s 63 | seen 899.2M | |
| step 3,440 | loss 1.2105 | ppl 3.4 | lr 4.67e-05 | grad 0.504 | tok/s 63 | seen 901.8M | |
| step 3,450 | loss 1.1947 | ppl 3.3 | lr 4.61e-05 | grad 0.507 | tok/s 63 | seen 904.4M | |
| step 3,460 | loss 1.1913 | ppl 3.3 | lr 4.56e-05 | grad 0.508 | tok/s 63 | seen 907.0M | |
| step 3,470 | loss 1.2388 | ppl 3.5 | lr 4.50e-05 | grad 0.515 | tok/s 63 | seen 909.6M | |
| step 3,480 | loss 1.2580 | ppl 3.5 | lr 4.44e-05 | grad 0.515 | tok/s 63 | seen 912.3M | |
| step 3,490 | loss 1.2398 | ppl 3.5 | lr 4.39e-05 | grad 0.521 | tok/s 63 | seen 914.9M | |
| step 3,500 | loss 1.2240 | ppl 3.4 | lr 4.34e-05 | grad 0.509 | tok/s 63 | seen 917.5M | |
| step 3,510 | loss 1.2183 | ppl 3.4 | lr 4.28e-05 | grad 0.506 | tok/s 63 | seen 920.1M | |
| step 3,520 | loss 1.2292 | ppl 3.4 | lr 4.23e-05 | grad 0.503 | tok/s 63 | seen 922.7M | |
| step 3,530 | loss 1.1830 | ppl 3.3 | lr 4.18e-05 | grad 0.506 | tok/s 63 | seen 925.4M | |
| step 3,540 | loss 1.2100 | ppl 3.4 | lr 4.13e-05 | grad 0.521 | tok/s 63 | seen 928.0M | |
| step 3,550 | loss 1.2232 | ppl 3.4 | lr 4.09e-05 | grad 0.510 | tok/s 62 | seen 930.6M | |
| step 3,560 | loss 1.2414 | ppl 3.5 | lr 4.04e-05 | grad 0.521 | tok/s 63 | seen 933.2M | |
| step 3,570 | loss 1.2217 | ppl 3.4 | lr 3.99e-05 | grad 0.520 | tok/s 63 | seen 935.9M | |
| step 3,580 | loss 1.1973 | ppl 3.3 | lr 3.95e-05 | grad 0.506 | tok/s 63 | seen 938.5M | |
| step 3,590 | loss 1.2089 | ppl 3.3 | lr 3.90e-05 | grad 0.508 | tok/s 63 | seen 941.1M | |
| step 3,600 | loss 1.2372 | ppl 3.4 | lr 3.86e-05 | grad 0.522 | tok/s 63 | seen 943.7M | |
| step 3,610 | loss 1.2045 | ppl 3.3 | lr 3.82e-05 | grad 0.524 | tok/s 63 | seen 946.3M | |
| step 3,620 | loss 1.1954 | ppl 3.3 | lr 3.78e-05 | grad 0.517 | tok/s 63 | seen 949.0M | |
| step 3,630 | loss 1.2038 | ppl 3.3 | lr 3.74e-05 | grad 0.508 | tok/s 63 | seen 951.6M | |
| step 3,640 | loss 1.1971 | ppl 3.3 | lr 3.70e-05 | grad 0.511 | tok/s 63 | seen 954.2M | |
| step 3,650 | loss 1.2432 | ppl 3.5 | lr 3.66e-05 | grad 0.521 | tok/s 63 | seen 956.8M | |
| step 3,660 | loss 1.1866 | ppl 3.3 | lr 3.62e-05 | grad 0.517 | tok/s 63 | seen 959.4M | |
| step 3,670 | loss 1.2110 | ppl 3.4 | lr 3.59e-05 | grad 0.515 | tok/s 63 | seen 962.1M | |
| step 3,680 | loss 1.2066 | ppl 3.3 | lr 3.55e-05 | grad 0.522 | tok/s 63 | seen 964.7M | |
| step 3,690 | loss 1.1887 | ppl 3.3 | lr 3.52e-05 | grad 0.525 | tok/s 63 | seen 967.3M | |
| step 3,700 | loss 1.1960 | ppl 3.3 | lr 3.49e-05 | grad 0.509 | tok/s 63 | seen 969.9M | |
| step 3,710 | loss 1.1898 | ppl 3.3 | lr 3.45e-05 | grad 0.518 | tok/s 63 | seen 972.6M | |
| step 3,720 | loss 1.1965 | ppl 3.3 | lr 3.42e-05 | grad 0.518 | tok/s 63 | seen 975.2M | |
| step 3,730 | loss 1.2207 | ppl 3.4 | lr 3.39e-05 | grad 0.517 | tok/s 63 | seen 977.8M | |
| step 3,740 | loss 1.1882 | ppl 3.3 | lr 3.37e-05 | grad 0.527 | tok/s 63 | seen 980.4M | |
| step 3,750 | loss 1.2241 | ppl 3.4 | lr 3.34e-05 | grad 0.519 | tok/s 63 | seen 983.0M | |
| step 3,760 | loss 1.1975 | ppl 3.3 | lr 3.31e-05 | grad 0.512 | tok/s 62 | seen 985.7M | |
| step 3,770 | loss 1.1929 | ppl 3.3 | lr 3.29e-05 | grad 0.509 | tok/s 63 | seen 988.3M | |
| step 3,780 | loss 1.2036 | ppl 3.3 | lr 3.26e-05 | grad 0.516 | tok/s 62 | seen 990.9M | |
| step 3,790 | loss 1.1773 | ppl 3.2 | lr 3.24e-05 | grad 0.536 | tok/s 63 | seen 993.5M | |
| step 3,800 | loss 1.2273 | ppl 3.4 | lr 3.22e-05 | grad 0.519 | tok/s 63 | seen 996.1M | |
| step 3,810 | loss 1.1736 | ppl 3.2 | lr 3.20e-05 | grad 0.519 | tok/s 62 | seen 998.8M | |
| step 3,820 | loss 1.1894 | ppl 3.3 | lr 3.18e-05 | grad 0.515 | tok/s 63 | seen 1001.4M | |
| step 3,830 | loss 1.2044 | ppl 3.3 | lr 3.16e-05 | grad 0.516 | tok/s 63 | seen 1004.0M | |
| step 3,840 | loss 1.2339 | ppl 3.4 | lr 3.14e-05 | grad 0.523 | tok/s 63 | seen 1006.6M | |
| step 3,850 | loss 1.1948 | ppl 3.3 | lr 3.12e-05 | grad 0.516 | tok/s 63 | seen 1009.3M | |
| step 3,860 | loss 1.2080 | ppl 3.3 | lr 3.11e-05 | grad 0.532 | tok/s 63 | seen 1011.9M | |
| step 3,870 | loss 1.1916 | ppl 3.3 | lr 3.09e-05 | grad 0.522 | tok/s 62 | seen 1014.5M | |
| step 3,880 | loss 1.2145 | ppl 3.4 | lr 3.08e-05 | grad 0.526 | tok/s 62 | seen 1017.1M | |
| step 3,890 | loss 1.2064 | ppl 3.3 | lr 3.07e-05 | grad 0.525 | tok/s 62 | seen 1019.7M | |
| step 3,900 | loss 1.2042 | ppl 3.3 | lr 3.05e-05 | grad 0.522 | tok/s 62 | seen 1022.4M | |
| step 3,910 | loss 1.1845 | ppl 3.3 | lr 3.04e-05 | grad 0.514 | tok/s 62 | seen 1025.0M | |
| step 3,920 | loss 1.1864 | ppl 3.3 | lr 3.03e-05 | grad 0.516 | tok/s 62 | seen 1027.6M | |
| step 3,930 | loss 1.1912 | ppl 3.3 | lr 3.03e-05 | grad 0.515 | tok/s 62 | seen 1030.2M | |
| step 3,940 | loss 1.2038 | ppl 3.3 | lr 3.02e-05 | grad 0.521 | tok/s 62 | seen 1032.8M | |
| step 3,950 | loss 1.2111 | ppl 3.4 | lr 3.01e-05 | grad 0.518 | tok/s 62 | seen 1035.5M | |
| step 3,960 | loss 1.2355 | ppl 3.4 | lr 3.01e-05 | grad 0.531 | tok/s 62 | seen 1038.1M | |
| step 3,970 | loss 1.1690 | ppl 3.2 | lr 3.00e-05 | grad 0.515 | tok/s 62 | seen 1040.7M | |
| step 3,980 | loss 1.1651 | ppl 3.2 | lr 3.00e-05 | grad 0.518 | tok/s 62 | seen 1043.3M | |
| step 3,990 | loss 1.1813 | ppl 3.3 | lr 3.00e-05 | grad 0.516 | tok/s 62 | seen 1046.0M | |
| step 4,000 | loss 1.1611 | ppl 3.2 | lr 3.00e-05 | grad 0.517 | tok/s 62 | seen 1048.6M | |
| Training complete. | |
| Final step : 4,000 | |
| Tokens seen : 1.0486B | |
| Best loss : 1.1611 | |