Loaded dataset with 100,000,000 tokens Model Report ──────────────────────── Total parameters: 16,013,568 Embedding parameters: 12,865,792 Parameters per layer: 786,944 Training Report ──────────────────────── Tokens per step: 512 Total training steps: 1,953 Target tokens: 1,000,000 Memory Report ──────────────────────── Parameter memory: 0.03 GB Optimizer memory: 0.13 GB Gradient memory: 0.03 GB Estimated total: 0.19 GB ============================================================ Training for 488 optimizer steps Effective tokens per step: 512 ============================================================ STEP PROGRESS │ LOSS PPL GNORM │ LR │ TOK/S MFU │ SEEN REMAINING ETA ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── * 10 2.0% │ 10.8392 51.0K n/a │ ↑ 3.00e-05 │ 2,241 0.014% │ 5.1K 994.3K 7m23s * 20 4.1% │ 10.6394 41.7K n/a │ ↑ 6.00e-05 │ 14,593 0.089% │ 10.2K 989.2K 1m07s * 30 6.1% │ 10.3197 30.3K n/a │ ↑ 9.00e-05 │ 18,688 0.114% │ 15.4K 984.1K 52s * 40 8.2% │ 10.0608 23.4K n/a │ ↑ 1.20e-04 │ 20,733 0.126% │ 20.5K 978.9K 47s [checkpoint] saving step 40 [checkpoint] saving step 40 [checkpoint] saved step 40 * 50 10.2% │ 9.8320 18.6K n/a │ ↑ 1.50e-04 │ 19,617 0.120% │ 25.6K 973.8K 49s * 60 12.3% │ 9.4081 12.2K n/a │ ↑ 1.80e-04 │ 20,566 0.125% │ 30.7K 968.7K 47s * 70 14.3% │ 8.9761 7912 n/a │ ↑ 2.10e-04 │ 21,564 0.131% │ 35.8K 963.6K 44s * 80 16.4% │ 8.6242 5565 n/a │ ↑ 2.40e-04 │ 22,321 0.136% │ 41.0K 958.5K 42s [checkpoint] saving step 80 [checkpoint] saving step 80 [checkpoint] saved step 80 * 90 18.4% │ 8.0923 3269 n/a │ ↑ 2.70e-04 │ 21,204 0.129% │ 46.1K 953.3K 44s * 100 20.5% │ 7.8742 2629 n/a │ ↑ 3.00e-04 │ 21,187 0.129% │ 51.2K 948.2K 44s 110 22.5% │ 8.2624 3875 n/a │ — 3.00e-04 │ 21,539 0.131% │ 56.3K 943.1K 43s 120 24.6% │ 8.8018 6646 n/a │ — 2.98e-04 │ 21,909 0.134% │ 61.4K 938.0K 42s [checkpoint] saving step 120 [checkpoint] saving step 120 [checkpoint] saved step 120 * 130 26.6% │ 7.6350 2069 n/a │ — 2.96e-04 │ 21,479 0.131% │ 66.6K 932.9K 43s * 140 28.7% │ 7.2791 1450 n/a │ — 2.93e-04 │ 21,780 0.133% │ 71.7K 927.7K 42s 150 30.7% │ 7.5348 1872 n/a │ — 2.89e-04 │ 22,120 0.135% │ 76.8K 922.6K 41s * 160 32.8% │ 7.2704 1437 n/a │ — 2.84e-04 │ 22,391 0.137% │ 81.9K 917.5K 40s [checkpoint] saving step 160 [checkpoint] saving step 160 [checkpoint] saved step 160 170 34.8% │ 7.6200 2039 n/a │ — 2.79e-04 │ 21,774 0.133% │ 87.0K 912.4K 41s * 180 36.9% │ 7.1109 1225 n/a │ — 2.73e-04 │ 21,763 0.133% │ 92.2K 907.3K 41s * 190 38.9% │ 6.5831 722.8 n/a │ — 2.66e-04 │ 21,957 0.134% │ 97.3K 902.1K 41s 200 41.0% │ 7.8899 2670 n/a │ — 2.58e-04 │ 22,053 0.134% │ 102.4K 897.0K 40s [checkpoint] saving step 200 [checkpoint] saving step 200 [checkpoint] saved step 200 210 43.0% │ 7.4864 1784 n/a │ — 2.50e-04 │ 21,606 0.132% │ 107.5K 891.9K 41s 220 45.1% │ 7.7538 2330 n/a │ — 2.41e-04 │ 21,455 0.131% │ 112.6K 886.8K 41s 230 47.1% │ 7.0994 1211 n/a │ — 2.32e-04 │ 21,550 0.131% │ 117.8K 881.7K 40s 240 49.2% │ 6.9114 1004 n/a │ — 2.22e-04 │ 21,638 0.132% │ 122.9K 876.5K 40s [checkpoint] saving step 240 [checkpoint] saving step 240 [checkpoint] saved step 240 250 51.2% │ 7.7004 2209 n/a │ — 2.12e-04 │ 21,159 0.129% │ 128.0K 871.4K 41s 260 53.3% │ 7.1510 1275 n/a │ — 2.02e-04 │ 21,109 0.129% │ 133.1K 866.3K 41s 270 55.3% │ 7.4216 1672 n/a │ — 1.91e-04 │ 21,189 0.129% │ 138.2K 861.2K 40s 280 57.4% │ 7.2410 1395 n/a │ — 1.80e-04 │ 21,361 0.130% │ 143.4K 856.1K 40s [checkpoint] saving step 280 [checkpoint] saving step 280 [checkpoint] saved step 280 290 59.4% │ 7.3611 1574 n/a │ — 1.69e-04 │ 21,116 0.129% │ 148.5K 850.9K 40s 300 61.5% │ 7.0222 1121 n/a │ — 1.58e-04 │ 21,209 0.129% │ 153.6K 845.8K 39s 310 63.5% │ 6.6481 771.3 n/a │ — 1.48e-04 │ 21,365 0.130% │ 158.7K 840.7K 39s 320 65.6% │ 7.1535 1279 n/a │ — 1.37e-04 │ 21,495 0.131% │ 163.8K 835.6K 38s [checkpoint] saving step 320 [checkpoint] saving step 320 [checkpoint] saved step 320 330 67.6% │ 6.9375 1030 n/a │ — 1.26e-04 │ 21,330 0.130% │ 169.0K 830.5K 38s 340 69.7% │ 7.0372 1138 n/a │ — 1.16e-04 │ 21,440 0.131% │ 174.1K 825.3K 38s 350 71.7% │ 7.3495 1555 n/a │ — 1.06e-04 │ 21,521 0.131% │ 179.2K 820.2K 38s 360 73.8% │ 6.9398 1033 n/a │ — 9.62e-05 │ 21,580 0.132% │ 184.3K 815.1K 37s [checkpoint] saving step 360 [checkpoint] saving step 360 [checkpoint] saved step 360 370 75.8% │ 7.0623 1167 n/a │ — 8.71e-05 │ 21,329 0.130% │ 189.4K 810.0K 37s * 380 77.9% │ 6.5629 708.3 n/a │ — 7.84e-05 │ 21,287 0.130% │ 194.6K 804.9K 37s 390 79.9% │ 6.6633 783.1 n/a │ — 7.03e-05 │ 21,421 0.131% │ 199.7K 799.7K 37s 400 82.0% │ 6.8306 925.7 n/a │ — 6.28e-05 │ 21,535 0.131% │ 204.8K 794.6K 36s [checkpoint] saving step 400 [checkpoint] saving step 400 [checkpoint] saved step 400 410 84.0% │ 7.2384 1392 n/a │ — 5.60e-05 │ 21,414 0.131% │ 209.9K 789.5K 36s 420 86.1% │ 7.4031 1641 n/a │ — 5.00e-05 │ 21,476 0.131% │ 215.0K 784.4K 36s