| Loaded dataset with 100,000,000 tokens |
|
|
| Model Report |
| ββββββββββββββββββββββββ |
| Total parameters: 16,013,568 |
| Embedding parameters: 12,865,792 |
| Parameters per layer: 786,944 |
|
|
| Training Report |
| ββββββββββββββββββββββββ |
| Tokens per step: 512 |
| Total training steps: 1,953 |
| Target tokens: 1,000,000 |
|
|
| Memory Report |
| ββββββββββββββββββββββββ |
| Parameter memory: 0.03 GB |
| Optimizer memory: 0.13 GB |
| Gradient memory: 0.03 GB |
| Estimated total: 0.19 GB |
|
|
|
|
| ============================================================ |
| Training for 488 optimizer steps |
| Effective tokens per step: 512 |
| ============================================================ |
|
|
| STEP PROGRESS β LOSS PPL GNORM β LR β TOK/S MFU β SEEN REMAINING ETA |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| * 10 2.0% β 10.8392 51.0K n/a β β 3.00e-05 β 2,241 0.014% β 5.1K 994.3K 7m23s |
| * 20 4.1% β 10.6394 41.7K n/a β β 6.00e-05 β 14,593 0.089% β 10.2K 989.2K 1m07s |
| * 30 6.1% β 10.3197 30.3K n/a β β 9.00e-05 β 18,688 0.114% β 15.4K 984.1K 52s |
| * 40 8.2% β 10.0608 23.4K n/a β β 1.20e-04 β 20,733 0.126% β 20.5K 978.9K 47s |
| [checkpoint] saving step 40 |
| [checkpoint] saving step 40 |
| [checkpoint] saved step 40 |
| * 50 10.2% β 9.8320 18.6K n/a β β 1.50e-04 β 19,617 0.120% β 25.6K 973.8K 49s |
| * 60 12.3% β 9.4081 12.2K n/a β β 1.80e-04 β 20,566 0.125% β 30.7K 968.7K 47s |
| * 70 14.3% β 8.9761 7912 n/a β β 2.10e-04 β 21,564 0.131% β 35.8K 963.6K 44s |
| * 80 16.4% β 8.6242 5565 n/a β β 2.40e-04 β 22,321 0.136% β 41.0K 958.5K 42s |
| [checkpoint] saving step 80 |
| [checkpoint] saving step 80 |
| [checkpoint] saved step 80 |
| * 90 18.4% β 8.0923 3269 n/a β β 2.70e-04 β 21,204 0.129% β 46.1K 953.3K 44s |
| * 100 20.5% β 7.8742 2629 n/a β β 3.00e-04 β 21,187 0.129% β 51.2K 948.2K 44s |
| 110 22.5% β 8.2624 3875 n/a β β 3.00e-04 β 21,539 0.131% β 56.3K 943.1K 43s |
| 120 24.6% β 8.8018 6646 n/a β β 2.98e-04 β 21,909 0.134% β 61.4K 938.0K 42s |
| [checkpoint] saving step 120 |
| [checkpoint] saving step 120 |
| [checkpoint] saved step 120 |
| * 130 26.6% β 7.6350 2069 n/a β β 2.96e-04 β 21,479 0.131% β 66.6K 932.9K 43s |
| * 140 28.7% β 7.2791 1450 n/a β β 2.93e-04 β 21,780 0.133% β 71.7K 927.7K 42s |
| 150 30.7% β 7.5348 1872 n/a β β 2.89e-04 β 22,120 0.135% β 76.8K 922.6K 41s |
| * 160 32.8% β 7.2704 1437 n/a β β 2.84e-04 β 22,391 0.137% β 81.9K 917.5K 40s |
| [checkpoint] saving step 160 |
| [checkpoint] saving step 160 |
| [checkpoint] saved step 160 |
| 170 34.8% β 7.6200 2039 n/a β β 2.79e-04 β 21,774 0.133% β 87.0K 912.4K 41s |
| * 180 36.9% β 7.1109 1225 n/a β β 2.73e-04 β 21,763 0.133% β 92.2K 907.3K 41s |
| * 190 38.9% β 6.5831 722.8 n/a β β 2.66e-04 β 21,957 0.134% β 97.3K 902.1K 41s |
| 200 41.0% β 7.8899 2670 n/a β β 2.58e-04 β 22,053 0.134% β 102.4K 897.0K 40s |
| [checkpoint] saving step 200 |
| [checkpoint] saving step 200 |
| [checkpoint] saved step 200 |
| 210 43.0% β 7.4864 1784 n/a β β 2.50e-04 β 21,606 0.132% β 107.5K 891.9K 41s |
| 220 45.1% β 7.7538 2330 n/a β β 2.41e-04 β 21,455 0.131% β 112.6K 886.8K 41s |
| 230 47.1% β 7.0994 1211 n/a β β 2.32e-04 β 21,550 0.131% β 117.8K 881.7K 40s |
| 240 49.2% β 6.9114 1004 n/a β β 2.22e-04 β 21,638 0.132% β 122.9K 876.5K 40s |
| [checkpoint] saving step 240 |
| [checkpoint] saving step 240 |
| [checkpoint] saved step 240 |
| 250 51.2% β 7.7004 2209 n/a β β 2.12e-04 β 21,159 0.129% β 128.0K 871.4K 41s |
| 260 53.3% β 7.1510 1275 n/a β β 2.02e-04 β 21,109 0.129% β 133.1K 866.3K 41s |
| 270 55.3% β 7.4216 1672 n/a β β 1.91e-04 β 21,189 0.129% β 138.2K 861.2K 40s |
| 280 57.4% β 7.2410 1395 n/a β β 1.80e-04 β 21,361 0.130% β 143.4K 856.1K 40s |
| [checkpoint] saving step 280 |
| [checkpoint] saving step 280 |
| [checkpoint] saved step 280 |
| 290 59.4% β 7.3611 1574 n/a β β 1.69e-04 β 21,116 0.129% β 148.5K 850.9K 40s |
| 300 61.5% β 7.0222 1121 n/a β β 1.58e-04 β 21,209 0.129% β 153.6K 845.8K 39s |
| 310 63.5% β 6.6481 771.3 n/a β β 1.48e-04 β 21,365 0.130% β 158.7K 840.7K 39s |
| 320 65.6% β 7.1535 1279 n/a β β 1.37e-04 β 21,495 0.131% β 163.8K 835.6K 38s |
| [checkpoint] saving step 320 |
| [checkpoint] saving step 320 |
| [checkpoint] saved step 320 |
| 330 67.6% β 6.9375 1030 n/a β β 1.26e-04 β 21,330 0.130% β 169.0K 830.5K 38s |
| 340 69.7% β 7.0372 1138 n/a β β 1.16e-04 β 21,440 0.131% β 174.1K 825.3K 38s |
| 350 71.7% β 7.3495 1555 n/a β β 1.06e-04 β 21,521 0.131% β 179.2K 820.2K 38s |
| 360 73.8% β 6.9398 1033 n/a β β 9.62e-05 β 21,580 0.132% β 184.3K 815.1K 37s |
| [checkpoint] saving step 360 |
| [checkpoint] saving step 360 |
| [checkpoint] saved step 360 |
| 370 75.8% β 7.0623 1167 n/a β β 8.71e-05 β 21,329 0.130% β 189.4K 810.0K 37s |
| * 380 77.9% β 6.5629 708.3 n/a β β 7.84e-05 β 21,287 0.130% β 194.6K 804.9K 37s |
| 390 79.9% β 6.6633 783.1 n/a β β 7.03e-05 β 21,421 0.131% β 199.7K 799.7K 37s |
| 400 82.0% β 6.8306 925.7 n/a β β 6.28e-05 β 21,535 0.131% β 204.8K 794.6K 36s |
| [checkpoint] saving step 400 |
| [checkpoint] saving step 400 |
| [checkpoint] saved step 400 |
| 410 84.0% β 7.2384 1392 n/a β β 5.60e-05 β 21,414 0.131% β 209.9K 789.5K 36s |
| 420 86.1% β 7.4031 1641 n/a β β 5.00e-05 β 21,476 0.131% β 215.0K 784.4K 36s |
|
|