Instructions to use AlexWortega/moe100m-physics-tinybpe with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AlexWortega/moe100m-physics-tinybpe with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AlexWortega/moe100m-physics-tinybpe", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| step 0 loss=6.3216 lr=1.20e-06 scale=16384 cv=0.734 tok=8192 tok/s=6783 elapsed=0.01h | |
| step 1 loss=6.3316 lr=2.40e-06 scale=16384 cv=0.847 tok=16384 tok/s=9593 elapsed=0.01h | |
| step 2 loss=6.2914 lr=3.60e-06 scale=16384 cv=0.852 tok=24576 tok/s=9943 elapsed=0.01h | |
| step 3 loss=6.3074 lr=4.80e-06 scale=16384 cv=0.797 tok=32768 tok/s=11057 elapsed=0.01h | |
| step 4 loss=6.2747 lr=6.00e-06 scale=16384 cv=0.687 tok=40960 tok/s=11521 elapsed=0.01h | |
| step 20 loss=5.3724 lr=2.52e-05 scale=16384 cv=0.732 tok=172032 tok/s=15812 elapsed=0.01h | |
| step 40 loss=4.0021 lr=4.92e-05 scale=16384 cv=0.821 tok=335872 tok/s=16528 elapsed=0.02h | |
| step 60 loss=3.3498 lr=7.32e-05 scale=16384 cv=0.740 tok=499712 tok/s=16496 elapsed=0.02h | |
| step 80 loss=3.0195 lr=9.72e-05 scale=16384 cv=0.647 tok=663552 tok/s=16597 elapsed=0.02h | |
| step 100 loss=2.8593 lr=1.21e-04 scale=16384 cv=0.764 tok=827392 tok/s=16513 elapsed=0.03h | |
| step 120 loss=2.6296 lr=1.45e-04 scale=16384 cv=0.754 tok=991232 tok/s=16656 elapsed=0.03h | |
| step 140 loss=2.2051 lr=1.69e-04 scale=16384 cv=0.421 tok=1155072 tok/s=16925 elapsed=0.03h | |
| step 160 loss=2.4748 lr=1.93e-04 scale=16384 cv=0.570 tok=1318912 tok/s=16811 elapsed=0.03h | |
| step 180 loss=1.9776 lr=2.17e-04 scale=16384 cv=0.500 tok=1482752 tok/s=16797 elapsed=0.04h | |
| step 200 loss=3.0876 lr=2.41e-04 scale=32768 cv=0.858 tok=1646592 tok/s=16923 elapsed=0.04h | |
| step 220 loss=2.2373 lr=2.65e-04 scale=32768 cv=0.824 tok=1810432 tok/s=17079 elapsed=0.04h | |
| step 240 loss=1.5503 lr=2.89e-04 scale=32768 cv=0.693 tok=1974272 tok/s=16861 elapsed=0.05h | |
| step 260 loss=1.3633 lr=3.13e-04 scale=32768 cv=0.775 tok=2138112 tok/s=17024 elapsed=0.05h | |
| step 280 loss=2.3023 lr=3.37e-04 scale=32768 cv=0.912 tok=2301952 tok/s=17106 elapsed=0.05h | |
| step 300 loss=2.1010 lr=3.61e-04 scale=32768 cv=0.767 tok=2465792 tok/s=17045 elapsed=0.05h | |
| step 320 loss=2.1059 lr=3.85e-04 scale=32768 cv=0.845 tok=2629632 tok/s=17021 elapsed=0.06h | |
| step 340 loss=1.7111 lr=4.09e-04 scale=32768 cv=0.684 tok=2793472 tok/s=16956 elapsed=0.06h | |
| step 360 loss=2.0405 lr=4.33e-04 scale=32768 cv=0.812 tok=2957312 tok/s=16791 elapsed=0.06h | |
| step 380 loss=1.5877 lr=4.57e-04 scale=32768 cv=0.697 tok=3121152 tok/s=16685 elapsed=0.07h | |
| step 400 loss=1.3777 lr=4.81e-04 scale=65536 cv=0.676 tok=3284992 tok/s=16672 elapsed=0.07h | |
| step 420 loss=1.9997 lr=5.05e-04 scale=65536 cv=0.871 tok=3448832 tok/s=16722 elapsed=0.07h | |
| step 440 loss=1.9936 lr=5.29e-04 scale=65536 cv=0.711 tok=3612672 tok/s=16610 elapsed=0.07h | |
| step 460 loss=1.6167 lr=5.53e-04 scale=65536 cv=0.713 tok=3776512 tok/s=16536 elapsed=0.08h | |
| step 480 loss=1.5321 lr=5.77e-04 scale=65536 cv=0.567 tok=3940352 tok/s=16509 elapsed=0.08h | |
| step 500 loss=1.8837 lr=6.00e-04 scale=65536 cv=0.730 tok=4104192 tok/s=16656 elapsed=0.08h | |
| step 520 loss=1.7478 lr=6.00e-04 scale=65536 cv=0.636 tok=4268032 tok/s=16889 elapsed=0.09h | |
| step 540 loss=2.0262 lr=6.00e-04 scale=65536 cv=0.618 tok=4431872 tok/s=16921 elapsed=0.09h | |
| step 560 loss=2.1649 lr=6.00e-04 scale=65536 cv=0.602 tok=4595712 tok/s=16781 elapsed=0.09h | |
| step 580 loss=2.4664 lr=6.00e-04 scale=65536 cv=0.594 tok=4759552 tok/s=16876 elapsed=0.09h | |
| step 600 loss=1.9217 lr=6.00e-04 scale=131072 cv=0.642 tok=4923392 tok/s=16621 elapsed=0.10h | |
| step 620 loss=1.8453 lr=6.00e-04 scale=131072 cv=0.620 tok=5087232 tok/s=16558 elapsed=0.10h | |
| step 640 loss=1.9432 lr=6.00e-04 scale=131072 cv=0.554 tok=5251072 tok/s=16517 elapsed=0.10h | |
| step 660 loss=1.8606 lr=6.00e-04 scale=131072 cv=0.571 tok=5414912 tok/s=16796 elapsed=0.10h | |
| step 680 loss=2.0971 lr=6.00e-04 scale=131072 cv=0.603 tok=5578752 tok/s=15878 elapsed=0.11h | |
| step 700 loss=1.8432 lr=6.00e-04 scale=131072 cv=0.574 tok=5742592 tok/s=16288 elapsed=0.11h | |
| step 720 loss=1.0548 lr=6.00e-04 scale=131072 cv=0.543 tok=5906432 tok/s=16235 elapsed=0.11h | |
| step 740 loss=1.9154 lr=6.00e-04 scale=131072 cv=0.599 tok=6070272 tok/s=15567 elapsed=0.12h | |
| step 760 loss=1.4569 lr=6.00e-04 scale=131072 cv=0.666 tok=6234112 tok/s=15851 elapsed=0.12h | |
| step 780 loss=2.2947 lr=6.00e-04 scale=131072 cv=0.653 tok=6397952 tok/s=16267 elapsed=0.12h | |
| step 800 loss=1.7238 lr=6.00e-04 scale=262144 cv=0.594 tok=6561792 tok/s=15746 elapsed=0.13h | |
| step 820 loss=1.4620 lr=6.00e-04 scale=262144 cv=0.585 tok=6725632 tok/s=16376 elapsed=0.13h | |
| step 840 loss=1.6442 lr=6.00e-04 scale=262144 cv=0.522 tok=6889472 tok/s=16181 elapsed=0.13h | |
| step 860 loss=1.5553 lr=6.00e-04 scale=262144 cv=0.644 tok=7053312 tok/s=16031 elapsed=0.13h | |
| step 880 loss=1.5625 lr=6.00e-04 scale=262144 cv=0.532 tok=7217152 tok/s=16256 elapsed=0.14h | |
| step 900 loss=1.7961 lr=6.00e-04 scale=262144 cv=0.530 tok=7380992 tok/s=16040 elapsed=0.14h | |
| step 920 loss=1.9146 lr=6.00e-04 scale=262144 cv=0.745 tok=7544832 tok/s=16629 elapsed=0.14h | |
| step 940 loss=1.8375 lr=6.00e-04 scale=262144 cv=0.536 tok=7708672 tok/s=16818 elapsed=0.15h | |
| step 960 loss=1.7170 lr=6.00e-04 scale=262144 cv=0.476 tok=7872512 tok/s=16559 elapsed=0.15h | |
| step 980 loss=1.7127 lr=6.00e-04 scale=262144 cv=0.593 tok=8036352 tok/s=16425 elapsed=0.15h | |
| step 1000 loss=1.3178 lr=6.00e-04 scale=524288 cv=0.512 tok=8200192 tok/s=16560 elapsed=0.16h | |
| step 1020 loss=1.7007 lr=6.00e-04 scale=524288 cv=0.570 tok=8364032 tok/s=17062 elapsed=0.17h | |
| step 1040 loss=1.3744 lr=6.00e-04 scale=524288 cv=0.488 tok=8527872 tok/s=17181 elapsed=0.17h | |
| step 1060 loss=1.8751 lr=6.00e-04 scale=524288 cv=0.644 tok=8691712 tok/s=17074 elapsed=0.18h | |
| step 1080 loss=1.9774 lr=6.00e-04 scale=524288 cv=0.544 tok=8855552 tok/s=16966 elapsed=0.18h | |
| step 1100 loss=1.7595 lr=6.00e-04 scale=524288 cv=0.453 tok=9019392 tok/s=17035 elapsed=0.18h | |
| step 1120 loss=1.3240 lr=6.00e-04 scale=524288 cv=0.487 tok=9183232 tok/s=16874 elapsed=0.18h | |
| step 1140 loss=1.6388 lr=6.00e-04 scale=524288 cv=0.621 tok=9347072 tok/s=14916 elapsed=0.19h | |
| step 1160 loss=1.8835 lr=6.00e-04 scale=524288 cv=0.613 tok=9510912 tok/s=12888 elapsed=0.19h | |
| step 1180 loss=1.4623 lr=6.00e-04 scale=524288 cv=0.507 tok=9674752 tok/s=12864 elapsed=0.19h | |
| step 1200 loss=1.5054 lr=6.00e-04 scale=1048576 cv=0.679 tok=9838592 tok/s=12824 elapsed=0.20h | |
| step 1220 loss=1.5370 lr=6.00e-04 scale=1048576 cv=0.545 tok=10002432 tok/s=12816 elapsed=0.20h | |
| step 1240 loss=1.8455 lr=6.00e-04 scale=1048576 cv=0.659 tok=10166272 tok/s=11974 elapsed=0.21h | |
| step 1260 loss=1.6698 lr=6.00e-04 scale=1048576 cv=0.494 tok=10330112 tok/s=12036 elapsed=0.21h | |
| step 1280 loss=1.9114 lr=6.00e-04 scale=1048576 cv=0.620 tok=10493952 tok/s=12536 elapsed=0.21h | |
| step 1300 loss=2.0210 lr=6.00e-04 scale=1048576 cv=0.663 tok=10657792 tok/s=12478 elapsed=0.22h | |
| step 1320 loss=1.8005 lr=6.00e-04 scale=1048576 cv=0.567 tok=10821632 tok/s=12488 elapsed=0.22h | |
| step 1340 loss=1.9519 lr=6.00e-04 scale=1048576 cv=0.603 tok=10985472 tok/s=12437 elapsed=0.23h | |
| step 1360 loss=1.6195 lr=6.00e-04 scale=1048576 cv=0.479 tok=11149312 tok/s=12382 elapsed=0.23h | |
| step 1380 loss=1.8023 lr=6.00e-04 scale=1048576 cv=0.514 tok=11313152 tok/s=12274 elapsed=0.23h | |
| step 1400 loss=1.8614 lr=6.00e-04 scale=2097152 cv=0.619 tok=11476992 tok/s=12219 elapsed=0.24h | |
| step 1420 loss=1.5916 lr=6.00e-04 scale=2097152 cv=0.541 tok=11640832 tok/s=13978 elapsed=0.24h | |
| step 1440 loss=1.7542 lr=6.00e-04 scale=2097152 cv=0.636 tok=11804672 tok/s=16681 elapsed=0.24h | |
| step 1460 loss=1.6513 lr=6.00e-04 scale=2097152 cv=0.563 tok=11968512 tok/s=16669 elapsed=0.25h | |
| step 1471 NaN/Inf grad -> skip; scale=1048576 (nan#1) | |
| step 1480 loss=1.2209 lr=6.00e-04 scale=1048576 cv=0.561 tok=12124160 tok/s=16581 elapsed=0.25h | |
| step 1500 loss=1.9946 lr=6.00e-04 scale=1048576 cv=0.638 tok=12288000 tok/s=16496 elapsed=0.25h | |
| step 1520 loss=1.9879 lr=6.00e-04 scale=1048576 cv=0.622 tok=12451840 tok/s=16506 elapsed=0.25h | |
| step 1540 loss=1.7972 lr=6.00e-04 scale=1048576 cv=0.485 tok=12615680 tok/s=16926 elapsed=0.26h | |
| step 1560 loss=1.7992 lr=6.00e-04 scale=1048576 cv=0.580 tok=12779520 tok/s=16845 elapsed=0.26h | |
| step 1580 loss=1.8146 lr=6.00e-04 scale=1048576 cv=0.689 tok=12943360 tok/s=17100 elapsed=0.26h | |
| step 1600 loss=1.9341 lr=6.00e-04 scale=1048576 cv=0.574 tok=13107200 tok/s=16823 elapsed=0.27h | |
| step 1620 loss=1.7553 lr=6.00e-04 scale=1048576 cv=0.535 tok=13271040 tok/s=16715 elapsed=0.27h | |
| step 1640 loss=1.5950 lr=6.00e-04 scale=1048576 cv=0.557 tok=13434880 tok/s=16425 elapsed=0.27h | |
| step 1660 loss=2.2846 lr=6.00e-04 scale=1048576 cv=0.649 tok=13598720 tok/s=16482 elapsed=0.27h | |
| step 1680 loss=1.4226 lr=6.00e-04 scale=2097152 cv=0.635 tok=13762560 tok/s=14083 elapsed=0.28h | |
| step 1700 loss=1.5163 lr=6.00e-04 scale=2097152 cv=0.584 tok=13926400 tok/s=12776 elapsed=0.28h | |
| step 1712 NaN/Inf grad -> skip; scale=1048576 (nan#2) | |
| step 1713 NaN/Inf grad -> skip; scale=524288 (nan#3) | |
| step 1720 loss=1.6761 lr=6.00e-04 scale=524288 cv=0.600 tok=14073856 tok/s=12899 elapsed=0.29h | |
| step 1740 loss=1.9600 lr=6.00e-04 scale=524288 cv=0.654 tok=14237696 tok/s=12749 elapsed=0.29h | |
| step 1760 loss=1.8619 lr=6.00e-04 scale=524288 cv=0.611 tok=14401536 tok/s=12534 elapsed=0.29h | |
| step 1780 loss=2.0155 lr=6.00e-04 scale=524288 cv=0.570 tok=14565376 tok/s=12600 elapsed=0.30h | |
| step 1800 loss=1.4750 lr=6.00e-04 scale=524288 cv=0.518 tok=14729216 tok/s=12491 elapsed=0.30h | |
| step 1820 loss=1.6176 lr=6.00e-04 scale=524288 cv=0.514 tok=14893056 tok/s=12464 elapsed=0.30h | |
| step 1840 loss=1.0225 lr=6.00e-04 scale=524288 cv=0.498 tok=15056896 tok/s=12378 elapsed=0.31h | |
| step 1860 loss=1.7882 lr=6.00e-04 scale=524288 cv=0.595 tok=15220736 tok/s=12315 elapsed=0.31h | |
| step 1880 loss=1.8028 lr=6.00e-04 scale=524288 cv=0.648 tok=15384576 tok/s=12298 elapsed=0.32h | |
| step 1900 loss=2.0790 lr=6.00e-04 scale=524288 cv=0.645 tok=15548416 tok/s=12256 elapsed=0.32h | |
| step 1920 loss=1.6917 lr=6.00e-04 scale=1048576 cv=0.639 tok=15712256 tok/s=12555 elapsed=0.32h | |
| step 1940 loss=1.2776 lr=6.00e-04 scale=1048576 cv=0.564 tok=15876096 tok/s=12704 elapsed=0.33h | |
| step 1960 loss=1.9289 lr=6.00e-04 scale=1048576 cv=0.601 tok=16039936 tok/s=12587 elapsed=0.33h | |
| step 1980 loss=1.7536 lr=6.00e-04 scale=1048576 cv=0.679 tok=16203776 tok/s=12513 elapsed=0.33h | |
| step 2000 loss=1.6234 lr=6.00e-04 scale=1048576 cv=0.600 tok=16367616 tok/s=12463 elapsed=0.34h | |
| step 2020 loss=1.7304 lr=6.00e-04 scale=1048576 cv=0.560 tok=16531456 tok/s=12486 elapsed=0.36h | |
| step 2040 loss=1.9196 lr=6.00e-04 scale=1048576 cv=0.603 tok=16695296 tok/s=12490 elapsed=0.36h | |
| step 2060 loss=1.6380 lr=6.00e-04 scale=1048576 cv=0.541 tok=16859136 tok/s=12395 elapsed=0.36h | |
| step 2080 loss=1.8508 lr=6.00e-04 scale=1048576 cv=0.615 tok=17022976 tok/s=12489 elapsed=0.37h | |
| step 2100 loss=1.5877 lr=6.00e-04 scale=1048576 cv=0.594 tok=17186816 tok/s=12483 elapsed=0.37h | |
| step 2120 loss=1.7593 lr=5.99e-04 scale=2097152 cv=0.681 tok=17350656 tok/s=12397 elapsed=0.38h | |
| step 2140 loss=1.3744 lr=5.99e-04 scale=2097152 cv=0.605 tok=17514496 tok/s=12422 elapsed=0.38h | |
| step 2160 loss=1.9319 lr=5.99e-04 scale=2097152 cv=0.617 tok=17678336 tok/s=12518 elapsed=0.38h | |
| step 2180 loss=1.7640 lr=5.99e-04 scale=2097152 cv=0.621 tok=17842176 tok/s=12427 elapsed=0.39h | |
| step 2200 loss=1.9715 lr=5.99e-04 scale=2097152 cv=0.736 tok=18006016 tok/s=12374 elapsed=0.39h | |
| step 2220 loss=1.6881 lr=5.99e-04 scale=2097152 cv=0.622 tok=18169856 tok/s=12515 elapsed=0.39h | |
| step 2240 loss=1.8614 lr=5.99e-04 scale=2097152 cv=0.573 tok=18333696 tok/s=12426 elapsed=0.40h | |
| step 2250 NaN/Inf grad -> skip; scale=1048576 (nan#4) | |
| step 2260 loss=1.4815 lr=5.99e-04 scale=1048576 cv=0.523 tok=18489344 tok/s=12330 elapsed=0.40h | |
| step 2280 loss=1.9072 lr=5.99e-04 scale=1048576 cv=0.726 tok=18653184 tok/s=12553 elapsed=0.41h | |
| step 2300 loss=1.9415 lr=5.99e-04 scale=1048576 cv=0.722 tok=18817024 tok/s=12450 elapsed=0.41h | |
| step 2320 loss=1.8309 lr=5.99e-04 scale=1048576 cv=0.682 tok=18980864 tok/s=12322 elapsed=0.41h | |
| step 2340 loss=1.7877 lr=5.99e-04 scale=1048576 cv=0.670 tok=19144704 tok/s=12518 elapsed=0.42h | |
| step 2360 loss=1.8659 lr=5.99e-04 scale=1048576 cv=0.761 tok=19308544 tok/s=12442 elapsed=0.42h | |
| step 2380 loss=1.7983 lr=5.99e-04 scale=1048576 cv=0.692 tok=19472384 tok/s=12351 elapsed=0.42h | |
| step 2400 loss=1.9224 lr=5.99e-04 scale=1048576 cv=0.635 tok=19636224 tok/s=12516 elapsed=0.43h | |
| step 2420 loss=1.8887 lr=5.99e-04 scale=1048576 cv=0.676 tok=19800064 tok/s=12453 elapsed=0.43h | |
| step 2440 loss=1.8232 lr=5.99e-04 scale=1048576 cv=0.713 tok=19963904 tok/s=12319 elapsed=0.44h | |
| step 2460 loss=1.3209 lr=5.99e-04 scale=2097152 cv=0.663 tok=20127744 tok/s=12538 elapsed=0.44h | |
| step 2480 loss=1.9841 lr=5.99e-04 scale=2097152 cv=0.696 tok=20291584 tok/s=12460 elapsed=0.44h | |
| step 2500 loss=1.6886 lr=5.99e-04 scale=2097152 cv=0.679 tok=20455424 tok/s=12352 elapsed=0.45h | |
| step 2520 loss=1.5107 lr=5.99e-04 scale=2097152 cv=0.540 tok=20619264 tok/s=12425 elapsed=0.45h | |
| step 2540 loss=1.7952 lr=5.99e-04 scale=2097152 cv=0.625 tok=20783104 tok/s=12460 elapsed=0.45h | |
| step 2541 NaN/Inf grad -> skip; scale=1048576 (nan#5) | |
| step 2560 loss=1.6652 lr=5.99e-04 scale=1048576 cv=0.588 tok=20938752 tok/s=12407 elapsed=0.46h | |
| step 2580 loss=1.8816 lr=5.99e-04 scale=1048576 cv=0.701 tok=21102592 tok/s=12429 elapsed=0.46h | |
| step 2600 loss=1.7867 lr=5.99e-04 scale=1048576 cv=0.609 tok=21266432 tok/s=12490 elapsed=0.47h | |
| step 2620 loss=2.1709 lr=5.99e-04 scale=1048576 cv=0.569 tok=21430272 tok/s=12379 elapsed=0.47h | |
| step 2621 NaN/Inf grad -> skip; scale=524288 (nan#6) | |
| step 2640 loss=1.2292 lr=5.99e-04 scale=524288 cv=0.550 tok=21585920 tok/s=12345 elapsed=0.47h | |
| step 2660 loss=1.5940 lr=5.99e-04 scale=524288 cv=0.710 tok=21749760 tok/s=12524 elapsed=0.48h | |
| step 2680 loss=1.8622 lr=5.99e-04 scale=524288 cv=0.678 tok=21913600 tok/s=12441 elapsed=0.48h | |
| step 2700 loss=1.6721 lr=5.99e-04 scale=524288 cv=0.616 tok=22077440 tok/s=12309 elapsed=0.49h | |
| step 2720 loss=1.2964 lr=5.99e-04 scale=524288 cv=0.563 tok=22241280 tok/s=12653 elapsed=0.49h | |
| step 2740 loss=1.9051 lr=5.99e-04 scale=524288 cv=0.693 tok=22405120 tok/s=12495 elapsed=0.49h | |
| step 2760 loss=1.7360 lr=5.99e-04 scale=524288 cv=0.705 tok=22568960 tok/s=12478 elapsed=0.50h | |
| step 2780 loss=1.3948 lr=5.99e-04 scale=524288 cv=0.660 tok=22732800 tok/s=12293 elapsed=0.50h | |
| step 2800 loss=1.6900 lr=5.99e-04 scale=524288 cv=0.671 tok=22896640 tok/s=15285 elapsed=0.50h | |
| step 2820 loss=1.8143 lr=5.99e-04 scale=524288 cv=0.829 tok=23060480 tok/s=16794 elapsed=0.51h | |
| step 2840 loss=1.6421 lr=5.99e-04 scale=1048576 cv=0.664 tok=23224320 tok/s=14567 elapsed=0.51h | |
| step 2860 loss=0.9686 lr=5.99e-04 scale=1048576 cv=0.594 tok=23388160 tok/s=15517 elapsed=0.51h | |
| step 2867 NaN/Inf grad -> skip; scale=524288 (nan#7) | |
| step 2880 loss=1.9176 lr=5.99e-04 scale=524288 cv=0.729 tok=23543808 tok/s=16418 elapsed=0.52h | |
| step 2900 loss=1.6320 lr=5.99e-04 scale=524288 cv=0.579 tok=23707648 tok/s=16512 elapsed=0.52h | |
| step 2920 loss=1.3309 lr=5.99e-04 scale=524288 cv=0.541 tok=23871488 tok/s=16506 elapsed=0.52h | |
| step 2940 loss=1.4468 lr=5.99e-04 scale=524288 cv=0.580 tok=24035328 tok/s=16388 elapsed=0.52h | |
| step 2960 loss=1.4032 lr=5.99e-04 scale=524288 cv=0.589 tok=24199168 tok/s=16666 elapsed=0.53h | |
| step 2980 loss=1.3630 lr=5.99e-04 scale=524288 cv=0.567 tok=24363008 tok/s=17063 elapsed=0.53h | |
| step 3000 loss=1.7925 lr=5.99e-04 scale=524288 cv=0.644 tok=24526848 tok/s=16893 elapsed=0.53h | |
| step 3020 loss=1.4607 lr=5.99e-04 scale=524288 cv=0.596 tok=24690688 tok/s=17018 elapsed=0.55h | |
| step 3040 loss=0.8825 lr=5.99e-04 scale=524288 cv=0.597 tok=24854528 tok/s=16950 elapsed=0.55h | |
| step 3060 loss=1.7627 lr=5.99e-04 scale=524288 cv=0.585 tok=25018368 tok/s=17042 elapsed=0.56h | |
| step 3080 loss=1.6114 lr=5.99e-04 scale=1048576 cv=0.590 tok=25182208 tok/s=16750 elapsed=0.56h | |
| step 3100 loss=1.5344 lr=5.99e-04 scale=1048576 cv=0.770 tok=25346048 tok/s=16770 elapsed=0.56h | |
| step 3120 loss=1.6353 lr=5.99e-04 scale=1048576 cv=0.648 tok=25509888 tok/s=16820 elapsed=0.56h | |
| step 3140 loss=1.6957 lr=5.99e-04 scale=1048576 cv=0.605 tok=25673728 tok/s=16805 elapsed=0.57h | |
| step 3160 loss=1.2052 lr=5.99e-04 scale=1048576 cv=0.576 tok=25837568 tok/s=16833 elapsed=0.57h | |
| step 3180 loss=1.2867 lr=5.99e-04 scale=1048576 cv=0.626 tok=26001408 tok/s=16941 elapsed=0.57h | |
| step 3200 loss=1.7531 lr=5.99e-04 scale=1048576 cv=0.715 tok=26165248 tok/s=16680 elapsed=0.58h | |
| step 3220 loss=1.3698 lr=5.99e-04 scale=1048576 cv=0.627 tok=26329088 tok/s=16640 elapsed=0.58h | |
| step 3240 loss=1.3669 lr=5.99e-04 scale=1048576 cv=0.591 tok=26492928 tok/s=15882 elapsed=0.58h | |
| step 3260 loss=1.8160 lr=5.99e-04 scale=1048576 cv=0.652 tok=26656768 tok/s=16471 elapsed=0.58h | |
| step 3280 loss=1.6378 lr=5.98e-04 scale=2097152 cv=0.630 tok=26820608 tok/s=16620 elapsed=0.59h | |
| step 3300 loss=1.7808 lr=5.98e-04 scale=2097152 cv=0.580 tok=26984448 tok/s=16450 elapsed=0.59h | |
| step 3320 loss=1.8652 lr=5.98e-04 scale=2097152 cv=0.651 tok=27148288 tok/s=16611 elapsed=0.59h | |
| step 3334 NaN/Inf grad -> skip; scale=1048576 (nan#8) | |
| step 3340 loss=1.9555 lr=5.98e-04 scale=1048576 cv=0.749 tok=27303936 tok/s=16810 elapsed=0.60h | |
| step 3360 loss=1.6746 lr=5.98e-04 scale=1048576 cv=0.568 tok=27467776 tok/s=16876 elapsed=0.60h | |
| step 3380 loss=1.3509 lr=5.98e-04 scale=1048576 cv=0.581 tok=27631616 tok/s=17125 elapsed=0.60h | |
| step 3400 loss=1.4293 lr=5.98e-04 scale=1048576 cv=0.592 tok=27795456 tok/s=16830 elapsed=0.60h | |
| step 3420 loss=1.9755 lr=5.98e-04 scale=1048576 cv=0.658 tok=27959296 tok/s=16850 elapsed=0.61h | |
| step 3440 loss=1.6982 lr=5.98e-04 scale=1048576 cv=0.575 tok=28123136 tok/s=16908 elapsed=0.61h | |
| step 3460 loss=1.7961 lr=5.98e-04 scale=1048576 cv=0.645 tok=28286976 tok/s=16942 elapsed=0.61h | |
| step 3480 loss=1.8034 lr=5.98e-04 scale=1048576 cv=0.584 tok=28450816 tok/s=16963 elapsed=0.62h | |
| step 3500 loss=1.7417 lr=5.98e-04 scale=1048576 cv=0.560 tok=28614656 tok/s=16924 elapsed=0.62h | |
| step 3520 loss=1.1235 lr=5.98e-04 scale=1048576 cv=0.663 tok=28778496 tok/s=16879 elapsed=0.62h | |
| step 3540 loss=1.7085 lr=5.98e-04 scale=2097152 cv=0.544 tok=28942336 tok/s=16977 elapsed=0.62h | |
| step 3560 loss=1.4954 lr=5.98e-04 scale=2097152 cv=0.651 tok=29106176 tok/s=16920 elapsed=0.63h | |
| step 3580 loss=1.6071 lr=5.98e-04 scale=2097152 cv=0.592 tok=29270016 tok/s=17027 elapsed=0.63h | |
| step 3600 loss=1.3671 lr=5.98e-04 scale=2097152 cv=0.508 tok=29433856 tok/s=17026 elapsed=0.63h | |
| step 3620 loss=1.1122 lr=5.98e-04 scale=2097152 cv=0.492 tok=29597696 tok/s=16917 elapsed=0.64h | |
| step 3640 loss=1.3696 lr=5.98e-04 scale=2097152 cv=0.492 tok=29761536 tok/s=16880 elapsed=0.64h | |
| step 3660 loss=1.7994 lr=5.98e-04 scale=2097152 cv=0.581 tok=29925376 tok/s=16700 elapsed=0.64h | |
| step 3680 loss=1.7279 lr=5.98e-04 scale=2097152 cv=0.569 tok=30089216 tok/s=16748 elapsed=0.64h | |
| step 3700 loss=1.6572 lr=5.98e-04 scale=2097152 cv=0.588 tok=30253056 tok/s=16721 elapsed=0.65h | |
| step 3720 loss=1.9131 lr=5.98e-04 scale=2097152 cv=0.649 tok=30416896 tok/s=16691 elapsed=0.65h | |
| step 3737 NaN/Inf grad -> skip; scale=2097152 (nan#9) | |
| step 3740 loss=1.7658 lr=5.98e-04 scale=2097152 cv=0.635 tok=30572544 tok/s=16788 elapsed=0.65h | |
| step 3760 loss=1.3139 lr=5.98e-04 scale=2097152 cv=0.640 tok=30736384 tok/s=16701 elapsed=0.66h | |
| step 3780 loss=1.3549 lr=5.98e-04 scale=2097152 cv=0.541 tok=30900224 tok/s=16603 elapsed=0.66h | |
| step 3800 loss=1.2483 lr=5.98e-04 scale=2097152 cv=0.681 tok=31064064 tok/s=16468 elapsed=0.66h | |
| step 3820 loss=1.4353 lr=5.98e-04 scale=2097152 cv=0.534 tok=31227904 tok/s=16515 elapsed=0.66h | |
| step 3840 loss=1.4866 lr=5.98e-04 scale=2097152 cv=0.519 tok=31391744 tok/s=16514 elapsed=0.67h | |
| step 3854 NaN/Inf grad -> skip; scale=1048576 (nan#10) | |
| step 3860 loss=1.7787 lr=5.98e-04 scale=1048576 cv=0.710 tok=31547392 tok/s=16495 elapsed=0.67h | |
| step 3880 loss=1.1737 lr=5.98e-04 scale=1048576 cv=0.631 tok=31711232 tok/s=16581 elapsed=0.67h | |
| step 3900 loss=1.7902 lr=5.98e-04 scale=1048576 cv=0.690 tok=31875072 tok/s=16323 elapsed=0.68h | |
| step 3920 loss=1.2182 lr=5.98e-04 scale=1048576 cv=0.641 tok=32038912 tok/s=16374 elapsed=0.68h | |
| step 3940 loss=1.8114 lr=5.98e-04 scale=1048576 cv=0.646 tok=32202752 tok/s=16436 elapsed=0.68h | |
| step 3960 loss=1.4763 lr=5.98e-04 scale=1048576 cv=0.652 tok=32366592 tok/s=16473 elapsed=0.69h | |
| step 3980 loss=1.2773 lr=5.98e-04 scale=1048576 cv=0.654 tok=32530432 tok/s=16470 elapsed=0.69h | |
| step 4000 loss=1.7615 lr=5.98e-04 scale=1048576 cv=0.647 tok=32694272 tok/s=16508 elapsed=0.69h | |
| step 4020 loss=1.1238 lr=5.98e-04 scale=1048576 cv=0.724 tok=32858112 tok/s=16473 elapsed=0.71h | |
| step 4040 loss=1.8557 lr=5.98e-04 scale=1048576 cv=0.674 tok=33021952 tok/s=16539 elapsed=0.71h | |
| step 4060 loss=1.8920 lr=5.98e-04 scale=2097152 cv=0.610 tok=33185792 tok/s=16522 elapsed=0.72h | |
| step 4080 loss=1.0971 lr=5.98e-04 scale=2097152 cv=0.677 tok=33349632 tok/s=16662 elapsed=0.72h | |
| step 4100 loss=1.2060 lr=5.97e-04 scale=2097152 cv=0.616 tok=33513472 tok/s=16483 elapsed=0.72h | |
| step 4120 loss=0.9686 lr=5.97e-04 scale=2097152 cv=0.592 tok=33677312 tok/s=16448 elapsed=0.73h | |
| step 4140 loss=1.4025 lr=5.97e-04 scale=2097152 cv=0.588 tok=33841152 tok/s=16460 elapsed=0.73h | |
| step 4160 loss=1.3178 lr=5.97e-04 scale=2097152 cv=0.616 tok=34004992 tok/s=16598 elapsed=0.73h | |
| step 4180 loss=1.0863 lr=5.97e-04 scale=2097152 cv=0.563 tok=34168832 tok/s=16659 elapsed=0.74h | |
| step 4200 loss=1.9432 lr=5.97e-04 scale=2097152 cv=0.712 tok=34332672 tok/s=16613 elapsed=0.74h | |
| step 4220 loss=1.1577 lr=5.97e-04 scale=2097152 cv=0.595 tok=34496512 tok/s=16503 elapsed=0.74h | |
| step 4240 loss=1.8034 lr=5.97e-04 scale=2097152 cv=0.761 tok=34660352 tok/s=16741 elapsed=0.74h | |
| step 4260 loss=1.4381 lr=5.97e-04 scale=4194304 cv=0.621 tok=34824192 tok/s=16631 elapsed=0.75h | |
| step 4280 loss=1.6343 lr=5.97e-04 scale=4194304 cv=0.574 tok=34988032 tok/s=16658 elapsed=0.75h | |
| step 4300 loss=1.5374 lr=5.97e-04 scale=4194304 cv=0.638 tok=35151872 tok/s=16471 elapsed=0.75h | |
| step 4320 loss=1.6199 lr=5.97e-04 scale=4194304 cv=0.647 tok=35315712 tok/s=16517 elapsed=0.76h | |
| step 4340 loss=1.3760 lr=5.97e-04 scale=4194304 cv=0.593 tok=35479552 tok/s=16493 elapsed=0.76h | |
| step 4360 loss=1.7765 lr=5.97e-04 scale=4194304 cv=0.618 tok=35643392 tok/s=16505 elapsed=0.76h | |
| step 4380 loss=1.7942 lr=5.97e-04 scale=4194304 cv=0.712 tok=35807232 tok/s=16536 elapsed=0.76h | |
| step 4400 loss=1.7810 lr=5.97e-04 scale=4194304 cv=0.580 tok=35971072 tok/s=16522 elapsed=0.77h | |
| step 4420 loss=1.6380 lr=5.97e-04 scale=4194304 cv=0.489 tok=36134912 tok/s=16466 elapsed=0.77h | |
| step 4440 loss=1.8000 lr=5.97e-04 scale=4194304 cv=0.566 tok=36298752 tok/s=16487 elapsed=0.77h | |
| step 4460 loss=1.7391 lr=5.97e-04 scale=8388608 cv=0.568 tok=36462592 tok/s=16526 elapsed=0.78h | |
| step 4475 NaN/Inf grad -> skip; scale=4194304 (nan#11) | |
| step 4476 NaN/Inf grad -> skip; scale=2097152 (nan#12) | |
| step 4480 loss=1.0064 lr=5.97e-04 scale=2097152 cv=0.590 tok=36610048 tok/s=16464 elapsed=0.78h | |
| step 4500 loss=0.6248 lr=5.97e-04 scale=2097152 cv=0.717 tok=36773888 tok/s=16456 elapsed=0.78h | |
| step 4520 loss=1.4792 lr=5.97e-04 scale=2097152 cv=0.583 tok=36937728 tok/s=16657 elapsed=0.78h | |
| step 4540 loss=1.1978 lr=5.97e-04 scale=2097152 cv=0.684 tok=37101568 tok/s=16971 elapsed=0.79h | |
| step 4560 loss=1.8814 lr=5.97e-04 scale=2097152 cv=0.755 tok=37265408 tok/s=16976 elapsed=0.79h | |
| step 4580 loss=1.3235 lr=5.97e-04 scale=2097152 cv=0.577 tok=37429248 tok/s=16878 elapsed=0.79h | |
| step 4600 loss=1.9293 lr=5.97e-04 scale=2097152 cv=0.675 tok=37593088 tok/s=16936 elapsed=0.80h | |
| step 4609 NaN/Inf grad -> skip; scale=1048576 (nan#13) | |
| step 4620 loss=1.4251 lr=5.97e-04 scale=1048576 cv=0.535 tok=37748736 tok/s=16983 elapsed=0.80h | |
| step 4640 loss=1.6672 lr=5.97e-04 scale=1048576 cv=0.579 tok=37912576 tok/s=16768 elapsed=0.80h | |
| step 4660 loss=1.6533 lr=5.97e-04 scale=1048576 cv=0.682 tok=38076416 tok/s=16921 elapsed=0.80h | |
| step 4680 loss=1.4521 lr=5.97e-04 scale=1048576 cv=0.734 tok=38240256 tok/s=16954 elapsed=0.81h | |
| step 4700 loss=1.6805 lr=5.97e-04 scale=1048576 cv=0.642 tok=38404096 tok/s=16851 elapsed=0.81h | |
| step 4720 loss=1.6319 lr=5.97e-04 scale=1048576 cv=0.597 tok=38567936 tok/s=16869 elapsed=0.81h | |
| step 4740 loss=1.6209 lr=5.97e-04 scale=1048576 cv=0.535 tok=38731776 tok/s=16519 elapsed=0.82h | |
| step 4760 loss=1.9060 lr=5.96e-04 scale=1048576 cv=0.634 tok=38895616 tok/s=16425 elapsed=0.82h | |
| step 4780 loss=1.6260 lr=5.96e-04 scale=1048576 cv=0.576 tok=39059456 tok/s=16533 elapsed=0.82h | |
| step 4800 loss=1.3516 lr=5.96e-04 scale=1048576 cv=0.529 tok=39223296 tok/s=16469 elapsed=0.83h | |
| step 4820 loss=0.8469 lr=5.96e-04 scale=2097152 cv=0.553 tok=39387136 tok/s=16399 elapsed=0.83h | |
| step 4840 loss=1.3225 lr=5.96e-04 scale=2097152 cv=0.577 tok=39550976 tok/s=16458 elapsed=0.83h | |
| step 4860 loss=1.6898 lr=5.96e-04 scale=2097152 cv=0.568 tok=39714816 tok/s=16508 elapsed=0.83h | |
| step 4880 loss=1.0248 lr=5.96e-04 scale=2097152 cv=0.707 tok=39878656 tok/s=16499 elapsed=0.84h | |
| step 4900 loss=1.5611 lr=5.96e-04 scale=2097152 cv=0.549 tok=40042496 tok/s=16691 elapsed=0.84h | |
| step 4911 NaN/Inf grad -> skip; scale=1048576 (nan#14) | |
| step 4920 loss=1.3647 lr=5.96e-04 scale=1048576 cv=0.712 tok=40198144 tok/s=16928 elapsed=0.84h | |
| step 4940 loss=1.1804 lr=5.96e-04 scale=1048576 cv=0.787 tok=40361984 tok/s=16864 elapsed=0.85h | |
| step 4960 loss=1.4058 lr=5.96e-04 scale=1048576 cv=0.602 tok=40525824 tok/s=16845 elapsed=0.85h | |
| step 4980 loss=1.6238 lr=5.96e-04 scale=1048576 cv=0.663 tok=40689664 tok/s=16885 elapsed=0.85h | |
| step 5000 loss=1.7602 lr=5.96e-04 scale=1048576 cv=0.657 tok=40853504 tok/s=16900 elapsed=0.85h | |
| step 5020 loss=1.7731 lr=5.96e-04 scale=1048576 cv=0.628 tok=41017344 tok/s=16587 elapsed=0.87h | |
| step 5040 loss=1.0175 lr=5.96e-04 scale=1048576 cv=0.495 tok=41181184 tok/s=16616 elapsed=0.87h | |
| step 5060 loss=1.1411 lr=5.96e-04 scale=1048576 cv=0.549 tok=41345024 tok/s=16718 elapsed=0.88h | |
| step 5080 loss=1.2673 lr=5.96e-04 scale=1048576 cv=0.485 tok=41508864 tok/s=16821 elapsed=0.88h | |
| step 5100 loss=1.7256 lr=5.96e-04 scale=1048576 cv=0.635 tok=41672704 tok/s=16432 elapsed=0.88h | |
| step 5118 NaN/Inf grad -> skip; scale=1048576 (nan#15) | |
| step 5120 loss=0.7711 lr=5.96e-04 scale=1048576 cv=0.580 tok=41828352 tok/s=16562 elapsed=0.88h | |
| step 5140 loss=1.8680 lr=5.96e-04 scale=1048576 cv=0.606 tok=41992192 tok/s=16535 elapsed=0.89h | |
| step 5160 loss=1.5701 lr=5.96e-04 scale=1048576 cv=0.640 tok=42156032 tok/s=16510 elapsed=0.89h | |
| step 5180 loss=1.8318 lr=5.96e-04 scale=1048576 cv=0.671 tok=42319872 tok/s=16546 elapsed=0.89h | |
| step 5200 loss=0.8308 lr=5.96e-04 scale=1048576 cv=0.662 tok=42483712 tok/s=16738 elapsed=0.90h | |
| step 5220 loss=0.9371 lr=5.96e-04 scale=1048576 cv=0.678 tok=42647552 tok/s=16625 elapsed=0.90h | |
| step 5240 loss=0.8712 lr=5.96e-04 scale=1048576 cv=0.670 tok=42811392 tok/s=16607 elapsed=0.90h | |
| step 5260 loss=1.2865 lr=5.96e-04 scale=1048576 cv=0.551 tok=42975232 tok/s=16488 elapsed=0.91h | |
| step 5280 loss=1.2664 lr=5.96e-04 scale=1048576 cv=0.570 tok=43139072 tok/s=16432 elapsed=0.91h | |
| step 5300 loss=0.9733 lr=5.96e-04 scale=1048576 cv=0.581 tok=43302912 tok/s=16466 elapsed=0.91h | |
| step 5320 loss=1.7909 lr=5.95e-04 scale=2097152 cv=0.680 tok=43466752 tok/s=16531 elapsed=0.91h | |
| step 5340 loss=1.6090 lr=5.95e-04 scale=2097152 cv=0.671 tok=43630592 tok/s=16658 elapsed=0.92h | |
| step 5360 loss=1.6508 lr=5.95e-04 scale=2097152 cv=0.646 tok=43794432 tok/s=16573 elapsed=0.92h | |
| step 5380 loss=1.1391 lr=5.95e-04 scale=2097152 cv=0.655 tok=43958272 tok/s=16455 elapsed=0.92h | |
| step 5400 loss=1.7529 lr=5.95e-04 scale=2097152 cv=0.757 tok=44122112 tok/s=16483 elapsed=0.93h | |
| step 5420 loss=0.8010 lr=5.95e-04 scale=2097152 cv=0.772 tok=44285952 tok/s=16539 elapsed=0.93h | |
| step 5440 loss=1.0516 lr=5.95e-04 scale=2097152 cv=0.641 tok=44449792 tok/s=16448 elapsed=0.93h | |
| step 5460 loss=1.9268 lr=5.95e-04 scale=2097152 cv=0.604 tok=44613632 tok/s=16475 elapsed=0.93h | |
| step 5480 loss=1.5866 lr=5.95e-04 scale=2097152 cv=0.501 tok=44777472 tok/s=16479 elapsed=0.94h | |
| step 5500 loss=1.7272 lr=5.95e-04 scale=2097152 cv=0.574 tok=44941312 tok/s=16628 elapsed=0.94h | |
| step 5520 loss=1.4257 lr=5.95e-04 scale=4194304 cv=0.540 tok=45105152 tok/s=16578 elapsed=0.94h | |
| step 5540 loss=1.3284 lr=5.95e-04 scale=4194304 cv=0.489 tok=45268992 tok/s=16543 elapsed=0.95h | |
| step 5560 loss=1.0973 lr=5.95e-04 scale=4194304 cv=0.517 tok=45432832 tok/s=16451 elapsed=0.95h | |
| step 5580 loss=1.6825 lr=5.95e-04 scale=4194304 cv=0.601 tok=45596672 tok/s=16519 elapsed=0.95h | |
| step 5583 NaN/Inf grad -> skip; scale=2097152 (nan#16) | |
| step 5600 loss=1.6256 lr=5.95e-04 scale=2097152 cv=0.614 tok=45752320 tok/s=16871 elapsed=0.96h | |
| step 5620 loss=0.6656 lr=5.95e-04 scale=2097152 cv=0.689 tok=45916160 tok/s=16730 elapsed=0.96h | |
| step 5640 loss=1.7282 lr=5.95e-04 scale=2097152 cv=0.581 tok=46080000 tok/s=16981 elapsed=0.96h | |
| step 5660 loss=1.2027 lr=5.95e-04 scale=2097152 cv=0.531 tok=46243840 tok/s=16847 elapsed=0.96h | |
| step 5680 loss=1.1994 lr=5.95e-04 scale=2097152 cv=0.717 tok=46407680 tok/s=16798 elapsed=0.97h | |
| step 5700 loss=1.4809 lr=5.95e-04 scale=2097152 cv=0.600 tok=46571520 tok/s=16898 elapsed=0.97h | |
| step 5720 loss=1.3426 lr=5.95e-04 scale=2097152 cv=0.564 tok=46735360 tok/s=16930 elapsed=0.97h | |
| step 5740 loss=1.7531 lr=5.95e-04 scale=2097152 cv=0.686 tok=46899200 tok/s=16625 elapsed=0.98h | |
| step 5760 loss=1.7045 lr=5.95e-04 scale=2097152 cv=0.621 tok=47063040 tok/s=16206 elapsed=0.98h | |
| step 5780 loss=0.5637 lr=5.95e-04 scale=2097152 cv=0.576 tok=47226880 tok/s=16369 elapsed=0.98h | |
| step 5795 NaN/Inf grad -> skip; scale=2097152 (nan#17) | |
| step 5800 loss=1.7696 lr=5.95e-04 scale=2097152 cv=0.715 tok=47382528 tok/s=16376 elapsed=0.98h | |
| step 5820 loss=1.8519 lr=5.95e-04 scale=2097152 cv=0.679 tok=47546368 tok/s=16475 elapsed=0.99h | |
| step 5840 loss=1.7651 lr=5.94e-04 scale=2097152 cv=0.606 tok=47710208 tok/s=16462 elapsed=0.99h | |
| step 5860 loss=1.6883 lr=5.94e-04 scale=2097152 cv=0.562 tok=47874048 tok/s=16400 elapsed=0.99h | |
| step 5880 loss=1.7750 lr=5.94e-04 scale=2097152 cv=0.537 tok=48037888 tok/s=16423 elapsed=1.00h | |
| step 5900 loss=1.5458 lr=5.94e-04 scale=2097152 cv=0.557 tok=48201728 tok/s=16503 elapsed=1.00h | |
| step 5920 loss=1.7582 lr=5.94e-04 scale=2097152 cv=0.605 tok=48365568 tok/s=16425 elapsed=1.00h | |
| step 5940 loss=1.7078 lr=5.94e-04 scale=2097152 cv=0.676 tok=48529408 tok/s=16396 elapsed=1.00h | |
| step 5960 loss=1.8043 lr=5.94e-04 scale=2097152 cv=0.615 tok=48693248 tok/s=16474 elapsed=1.01h | |
| step 5980 loss=1.5165 lr=5.94e-04 scale=2097152 cv=0.550 tok=48857088 tok/s=16443 elapsed=1.01h | |
| step 6000 loss=1.1632 lr=5.94e-04 scale=4194304 cv=0.512 tok=49020928 tok/s=16539 elapsed=1.01h | |
| step 6009 NaN/Inf grad -> skip; scale=2097152 (nan#18) | |
| step 6020 loss=1.7098 lr=5.94e-04 scale=2097152 cv=0.632 tok=49176576 tok/s=16595 elapsed=1.03h | |
| step 6040 loss=0.6740 lr=5.94e-04 scale=2097152 cv=0.564 tok=49340416 tok/s=16503 elapsed=1.04h | |
| step 6060 loss=1.6517 lr=5.94e-04 scale=2097152 cv=0.744 tok=49504256 tok/s=16699 elapsed=1.04h | |
| step 6080 loss=1.7669 lr=5.94e-04 scale=2097152 cv=0.642 tok=49668096 tok/s=16379 elapsed=1.04h | |
| step 6100 loss=1.6608 lr=5.94e-04 scale=2097152 cv=0.752 tok=49831936 tok/s=16543 elapsed=1.05h | |
| step 6120 loss=1.2045 lr=5.94e-04 scale=2097152 cv=0.570 tok=49995776 tok/s=16694 elapsed=1.05h | |
| step 6140 loss=1.4646 lr=5.94e-04 scale=2097152 cv=0.589 tok=50159616 tok/s=16796 elapsed=1.05h | |
| step 6160 loss=1.5230 lr=5.94e-04 scale=2097152 cv=0.520 tok=50323456 tok/s=16977 elapsed=1.05h | |
| step 6180 loss=0.4850 lr=5.94e-04 scale=2097152 cv=0.694 tok=50487296 tok/s=16936 elapsed=1.06h | |
| step 6200 loss=1.4431 lr=5.94e-04 scale=2097152 cv=0.667 tok=50651136 tok/s=16496 elapsed=1.06h | |
| step 6215 NaN/Inf grad -> skip; scale=2097152 (nan#19) | |
| step 6220 loss=1.8690 lr=5.94e-04 scale=2097152 cv=0.626 tok=50806784 tok/s=16538 elapsed=1.06h | |
| step 6240 loss=1.5669 lr=5.94e-04 scale=2097152 cv=0.608 tok=50970624 tok/s=16432 elapsed=1.07h | |
| step 6260 loss=1.6127 lr=5.94e-04 scale=2097152 cv=0.551 tok=51134464 tok/s=16662 elapsed=1.07h | |
| step 6264 NaN/Inf grad -> skip; scale=1048576 (nan#20) | |
| step 6280 loss=1.9136 lr=5.94e-04 scale=1048576 cv=0.611 tok=51290112 tok/s=16624 elapsed=1.07h | |
| step 6300 loss=1.7623 lr=5.93e-04 scale=1048576 cv=0.616 tok=51453952 tok/s=16444 elapsed=1.08h | |
| step 6320 loss=0.8029 lr=5.93e-04 scale=1048576 cv=0.575 tok=51617792 tok/s=16547 elapsed=1.08h | |
| step 6340 loss=1.4429 lr=5.93e-04 scale=1048576 cv=0.557 tok=51781632 tok/s=16867 elapsed=1.08h | |
| step 6360 loss=1.2907 lr=5.93e-04 scale=1048576 cv=0.557 tok=51945472 tok/s=16868 elapsed=1.08h | |
| step 6380 loss=1.2795 lr=5.93e-04 scale=1048576 cv=0.512 tok=52109312 tok/s=16828 elapsed=1.09h | |
| step 6400 loss=1.6258 lr=5.93e-04 scale=1048576 cv=0.619 tok=52273152 tok/s=16885 elapsed=1.09h | |
| step 6420 loss=1.6130 lr=5.93e-04 scale=1048576 cv=0.597 tok=52436992 tok/s=17375 elapsed=1.09h | |
| step 6440 loss=1.0615 lr=5.93e-04 scale=1048576 cv=0.538 tok=52600832 tok/s=17365 elapsed=1.10h | |
| step 6460 loss=1.7535 lr=5.93e-04 scale=1048576 cv=0.748 tok=52764672 tok/s=17354 elapsed=1.10h | |
| step 6480 loss=1.0498 lr=5.93e-04 scale=2097152 cv=0.563 tok=52928512 tok/s=17331 elapsed=1.10h | |
| step 6500 loss=1.8048 lr=5.93e-04 scale=2097152 cv=0.623 tok=53092352 tok/s=16960 elapsed=1.10h | |
| step 6520 loss=1.8079 lr=5.93e-04 scale=2097152 cv=0.657 tok=53256192 tok/s=16705 elapsed=1.11h | |
| step 6540 loss=1.5149 lr=5.93e-04 scale=2097152 cv=0.651 tok=53420032 tok/s=16677 elapsed=1.11h | |
| step 6560 loss=1.3345 lr=5.93e-04 scale=2097152 cv=0.578 tok=53583872 tok/s=16761 elapsed=1.11h | |
| step 6580 loss=1.5787 lr=5.93e-04 scale=2097152 cv=0.598 tok=53747712 tok/s=16724 elapsed=1.12h | |
| step 6600 loss=1.5599 lr=5.93e-04 scale=2097152 cv=0.495 tok=53911552 tok/s=16544 elapsed=1.12h | |
| step 6620 loss=1.7231 lr=5.93e-04 scale=2097152 cv=0.620 tok=54075392 tok/s=16380 elapsed=1.12h | |
| step 6640 loss=0.9672 lr=5.93e-04 scale=2097152 cv=0.570 tok=54239232 tok/s=16450 elapsed=1.12h | |
| step 6660 loss=1.6704 lr=5.93e-04 scale=2097152 cv=0.652 tok=54403072 tok/s=16946 elapsed=1.13h | |
| step 6680 loss=1.8990 lr=5.93e-04 scale=4194304 cv=0.601 tok=54566912 tok/s=16972 elapsed=1.13h | |
| step 6700 loss=1.6520 lr=5.93e-04 scale=4194304 cv=0.470 tok=54730752 tok/s=16782 elapsed=1.13h | |
| step 6720 loss=1.7745 lr=5.92e-04 scale=4194304 cv=0.594 tok=54894592 tok/s=16462 elapsed=1.14h | |
| step 6731 NaN/Inf grad -> skip; scale=2097152 (nan#21) | |
| step 6740 loss=1.7556 lr=5.92e-04 scale=2097152 cv=0.551 tok=55050240 tok/s=16433 elapsed=1.14h | |
| step 6760 loss=0.6817 lr=5.92e-04 scale=2097152 cv=0.660 tok=55214080 tok/s=16495 elapsed=1.14h | |
| step 6780 loss=1.8204 lr=5.92e-04 scale=2097152 cv=0.641 tok=55377920 tok/s=16365 elapsed=1.14h | |
| step 6800 loss=1.4742 lr=5.92e-04 scale=2097152 cv=0.593 tok=55541760 tok/s=16350 elapsed=1.15h | |
| step 6820 loss=1.5987 lr=5.92e-04 scale=2097152 cv=0.610 tok=55705600 tok/s=16440 elapsed=1.15h | |
| step 6840 loss=1.8143 lr=5.92e-04 scale=2097152 cv=0.627 tok=55869440 tok/s=16472 elapsed=1.15h | |
| step 6860 loss=1.7195 lr=5.92e-04 scale=2097152 cv=0.563 tok=56033280 tok/s=16461 elapsed=1.16h | |
| step 6880 loss=1.6433 lr=5.92e-04 scale=2097152 cv=0.635 tok=56197120 tok/s=16441 elapsed=1.16h | |
| step 6900 loss=1.4985 lr=5.92e-04 scale=2097152 cv=0.560 tok=56360960 tok/s=16480 elapsed=1.16h | |
| step 6920 loss=1.0155 lr=5.92e-04 scale=2097152 cv=0.539 tok=56524800 tok/s=16439 elapsed=1.16h | |
| step 6940 loss=1.8419 lr=5.92e-04 scale=4194304 cv=0.595 tok=56688640 tok/s=16485 elapsed=1.17h | |
| step 6960 loss=1.3481 lr=5.92e-04 scale=4194304 cv=0.555 tok=56852480 tok/s=16501 elapsed=1.17h | |
| step 6980 loss=1.0243 lr=5.92e-04 scale=4194304 cv=0.640 tok=57016320 tok/s=16522 elapsed=1.17h | |
| step 7000 loss=0.8455 lr=5.92e-04 scale=4194304 cv=0.660 tok=57180160 tok/s=16408 elapsed=1.18h | |
| step 7001 NaN/Inf grad -> skip; scale=2097152 (nan#22) | |
| step 7020 loss=1.4952 lr=5.92e-04 scale=2097152 cv=0.584 tok=57335808 tok/s=16605 elapsed=1.19h | |
| step 7040 loss=1.5306 lr=5.92e-04 scale=2097152 cv=0.554 tok=57499648 tok/s=16655 elapsed=1.20h | |
| step 7060 loss=0.9047 lr=5.92e-04 scale=2097152 cv=0.548 tok=57663488 tok/s=16965 elapsed=1.20h | |
| step 7080 loss=1.3666 lr=5.92e-04 scale=2097152 cv=0.561 tok=57827328 tok/s=16876 elapsed=1.20h | |
| step 7100 loss=1.6332 lr=5.92e-04 scale=2097152 cv=0.508 tok=57991168 tok/s=16695 elapsed=1.20h | |
| step 7113 NaN/Inf grad -> skip; scale=1048576 (nan#23) | |
| step 7114 NaN/Inf grad -> skip; scale=524288 (nan#24) | |
| step 7120 loss=1.6144 lr=5.92e-04 scale=524288 cv=0.533 tok=58138624 tok/s=16811 elapsed=1.21h | |
| step 7140 loss=1.1928 lr=5.91e-04 scale=524288 cv=0.607 tok=58302464 tok/s=16798 elapsed=1.21h | |
| step 7160 loss=1.4724 lr=5.91e-04 scale=524288 cv=0.592 tok=58466304 tok/s=16723 elapsed=1.21h | |
| step 7180 loss=1.7312 lr=5.91e-04 scale=524288 cv=0.742 tok=58630144 tok/s=16673 elapsed=1.22h | |
| step 7200 loss=0.9075 lr=5.91e-04 scale=524288 cv=0.547 tok=58793984 tok/s=16635 elapsed=1.22h | |
| step 7220 loss=0.7870 lr=5.91e-04 scale=524288 cv=0.575 tok=58957824 tok/s=16619 elapsed=1.22h | |
| step 7240 loss=1.6552 lr=5.91e-04 scale=524288 cv=0.682 tok=59121664 tok/s=16625 elapsed=1.22h | |
| step 7260 loss=1.6407 lr=5.91e-04 scale=524288 cv=0.737 tok=59285504 tok/s=17001 elapsed=1.23h | |
| step 7280 loss=1.9146 lr=5.91e-04 scale=524288 cv=0.719 tok=59449344 tok/s=16785 elapsed=1.23h | |
| step 7300 loss=1.7535 lr=5.91e-04 scale=524288 cv=0.670 tok=59613184 tok/s=16998 elapsed=1.23h | |
| step 7320 loss=0.8155 lr=5.91e-04 scale=1048576 cv=0.694 tok=59777024 tok/s=16979 elapsed=1.24h | |
| step 7340 loss=1.6280 lr=5.91e-04 scale=1048576 cv=0.591 tok=59940864 tok/s=16824 elapsed=1.24h | |
| step 7360 loss=1.2476 lr=5.91e-04 scale=1048576 cv=0.541 tok=60104704 tok/s=16754 elapsed=1.24h | |
| step 7380 loss=1.0191 lr=5.91e-04 scale=1048576 cv=0.548 tok=60268544 tok/s=16806 elapsed=1.24h | |
| step 7400 loss=1.2108 lr=5.91e-04 scale=1048576 cv=0.568 tok=60432384 tok/s=16731 elapsed=1.25h | |
| step 7420 loss=1.6435 lr=5.91e-04 scale=1048576 cv=0.676 tok=60596224 tok/s=16578 elapsed=1.25h | |
| step 7440 loss=1.6614 lr=5.91e-04 scale=1048576 cv=0.525 tok=60760064 tok/s=16568 elapsed=1.25h | |
| step 7460 loss=0.9850 lr=5.91e-04 scale=1048576 cv=0.688 tok=60923904 tok/s=16574 elapsed=1.26h | |
| step 7480 loss=1.4782 lr=5.91e-04 scale=1048576 cv=0.465 tok=61087744 tok/s=16570 elapsed=1.26h | |
| step 7500 loss=1.0806 lr=5.91e-04 scale=1048576 cv=0.572 tok=61251584 tok/s=16477 elapsed=1.26h | |
| step 7520 loss=1.2949 lr=5.90e-04 scale=2097152 cv=0.639 tok=61415424 tok/s=16467 elapsed=1.26h | |
| step 7540 loss=0.9820 lr=5.90e-04 scale=2097152 cv=0.636 tok=61579264 tok/s=16556 elapsed=1.27h | |
| step 7560 loss=1.8247 lr=5.90e-04 scale=2097152 cv=0.553 tok=61743104 tok/s=16634 elapsed=1.27h | |
| step 7580 loss=1.1407 lr=5.90e-04 scale=2097152 cv=0.475 tok=61906944 tok/s=16632 elapsed=1.27h | |
| step 7600 loss=0.8767 lr=5.90e-04 scale=2097152 cv=0.647 tok=62070784 tok/s=16545 elapsed=1.28h | |
| step 7620 loss=1.2217 lr=5.90e-04 scale=2097152 cv=0.751 tok=62234624 tok/s=16461 elapsed=1.28h | |
| step 7640 loss=1.7342 lr=5.90e-04 scale=2097152 cv=0.628 tok=62398464 tok/s=16504 elapsed=1.28h | |
| step 7660 loss=1.0999 lr=5.90e-04 scale=2097152 cv=0.563 tok=62562304 tok/s=16456 elapsed=1.28h | |
| step 7680 loss=1.3024 lr=5.90e-04 scale=2097152 cv=0.533 tok=62726144 tok/s=16431 elapsed=1.29h | |
| step 7700 loss=0.7318 lr=5.90e-04 scale=2097152 cv=0.565 tok=62889984 tok/s=16451 elapsed=1.29h | |
| step 7720 loss=0.6104 lr=5.90e-04 scale=4194304 cv=0.535 tok=63053824 tok/s=16376 elapsed=1.29h | |
| step 7724 NaN/Inf grad -> skip; scale=2097152 (nan#25) | |
| step 7740 loss=1.8154 lr=5.90e-04 scale=2097152 cv=0.670 tok=63209472 tok/s=16691 elapsed=1.30h | |
| step 7760 loss=1.5778 lr=5.90e-04 scale=2097152 cv=0.579 tok=63373312 tok/s=16433 elapsed=1.30h | |
| step 7780 loss=1.2424 lr=5.90e-04 scale=2097152 cv=0.593 tok=63537152 tok/s=16426 elapsed=1.30h | |
| step 7800 loss=1.8056 lr=5.90e-04 scale=2097152 cv=0.636 tok=63700992 tok/s=16576 elapsed=1.31h | |
| step 7820 loss=1.7552 lr=5.90e-04 scale=2097152 cv=0.533 tok=63864832 tok/s=16550 elapsed=1.31h | |
| step 7840 loss=1.5473 lr=5.90e-04 scale=2097152 cv=0.596 tok=64028672 tok/s=16374 elapsed=1.31h | |
| step 7860 loss=0.5448 lr=5.90e-04 scale=2097152 cv=0.627 tok=64192512 tok/s=16461 elapsed=1.31h | |
| step 7880 loss=0.6113 lr=5.89e-04 scale=2097152 cv=0.544 tok=64356352 tok/s=16408 elapsed=1.32h | |
| step 7900 loss=1.5997 lr=5.89e-04 scale=2097152 cv=0.652 tok=64520192 tok/s=16623 elapsed=1.32h | |
| step 7920 loss=0.2932 lr=5.89e-04 scale=2097152 cv=0.525 tok=64684032 tok/s=16403 elapsed=1.32h | |
| step 7940 loss=1.3015 lr=5.89e-04 scale=4194304 cv=0.534 tok=64847872 tok/s=16377 elapsed=1.33h | |
| step 7955 NaN/Inf grad -> skip; scale=2097152 (nan#26) | |
| step 7960 loss=1.7828 lr=5.89e-04 scale=2097152 cv=0.594 tok=65003520 tok/s=16360 elapsed=1.33h | |
| step 7980 loss=0.2883 lr=5.89e-04 scale=2097152 cv=0.563 tok=65167360 tok/s=16457 elapsed=1.33h | |
| step 8000 loss=1.6318 lr=5.89e-04 scale=2097152 cv=0.676 tok=65331200 tok/s=16462 elapsed=1.33h | |
| step 8020 loss=0.4600 lr=5.89e-04 scale=2097152 cv=0.745 tok=65495040 tok/s=16624 elapsed=1.35h | |
| step 8040 loss=1.2574 lr=5.89e-04 scale=2097152 cv=0.585 tok=65658880 tok/s=16618 elapsed=1.36h | |
| step 8060 loss=0.9544 lr=5.89e-04 scale=2097152 cv=0.743 tok=65822720 tok/s=16630 elapsed=1.36h | |
| step 8080 loss=1.4552 lr=5.89e-04 scale=2097152 cv=0.571 tok=65986560 tok/s=16662 elapsed=1.36h | |
| step 8100 loss=1.6482 lr=5.89e-04 scale=2097152 cv=0.596 tok=66150400 tok/s=16608 elapsed=1.36h | |
| step 8120 loss=0.8214 lr=5.89e-04 scale=2097152 cv=0.544 tok=66314240 tok/s=16492 elapsed=1.37h | |
| step 8140 loss=1.6445 lr=5.89e-04 scale=2097152 cv=0.573 tok=66478080 tok/s=16508 elapsed=1.37h | |
| step 8160 loss=1.8032 lr=5.89e-04 scale=4194304 cv=0.655 tok=66641920 tok/s=16456 elapsed=1.37h | |
| step 8180 loss=1.2268 lr=5.89e-04 scale=4194304 cv=0.767 tok=66805760 tok/s=16474 elapsed=1.38h | |
| step 8193 NaN/Inf grad -> skip; scale=2097152 (nan#27) | |
| step 8200 loss=1.3198 lr=5.89e-04 scale=2097152 cv=0.513 tok=66961408 tok/s=16478 elapsed=1.38h | |
| step 8220 loss=1.8123 lr=5.88e-04 scale=2097152 cv=0.587 tok=67125248 tok/s=16445 elapsed=1.38h | |
| step 8240 loss=1.1084 lr=5.88e-04 scale=2097152 cv=0.629 tok=67289088 tok/s=16455 elapsed=1.38h | |
| step 8260 loss=1.2188 lr=5.88e-04 scale=2097152 cv=0.549 tok=67452928 tok/s=16454 elapsed=1.39h | |
| step 8280 loss=0.4074 lr=5.88e-04 scale=2097152 cv=0.556 tok=67616768 tok/s=16594 elapsed=1.39h | |
| step 8300 loss=1.4657 lr=5.88e-04 scale=2097152 cv=0.558 tok=67780608 tok/s=16573 elapsed=1.39h | |
| step 8320 loss=0.9942 lr=5.88e-04 scale=2097152 cv=0.715 tok=67944448 tok/s=16474 elapsed=1.40h | |
| step 8340 loss=1.1377 lr=5.88e-04 scale=2097152 cv=0.588 tok=68108288 tok/s=16467 elapsed=1.40h | |
| step 8360 loss=1.4281 lr=5.88e-04 scale=2097152 cv=0.588 tok=68272128 tok/s=16494 elapsed=1.40h | |
| step 8380 loss=0.5599 lr=5.88e-04 scale=2097152 cv=0.557 tok=68435968 tok/s=16399 elapsed=1.41h | |
| step 8400 loss=1.8262 lr=5.88e-04 scale=4194304 cv=0.606 tok=68599808 tok/s=16415 elapsed=1.41h | |
| step 8418 NaN/Inf grad -> skip; scale=2097152 (nan#28) | |
| step 8420 loss=0.7837 lr=5.88e-04 scale=2097152 cv=0.540 tok=68755456 tok/s=16747 elapsed=1.41h | |
| step 8440 loss=0.7820 lr=5.88e-04 scale=2097152 cv=0.551 tok=68919296 tok/s=17095 elapsed=1.41h | |
| step 8460 loss=1.3485 lr=5.88e-04 scale=2097152 cv=0.578 tok=69083136 tok/s=17041 elapsed=1.42h | |
| step 8480 loss=1.8501 lr=5.88e-04 scale=2097152 cv=0.603 tok=69246976 tok/s=16902 elapsed=1.42h | |
| step 8500 loss=1.3549 lr=5.88e-04 scale=2097152 cv=0.494 tok=69410816 tok/s=16527 elapsed=1.42h | |
| step 8520 loss=0.6406 lr=5.88e-04 scale=2097152 cv=0.493 tok=69574656 tok/s=16494 elapsed=1.43h | |
| step 8540 loss=1.0950 lr=5.87e-04 scale=2097152 cv=0.493 tok=69738496 tok/s=16582 elapsed=1.43h | |
| step 8560 loss=1.6099 lr=5.87e-04 scale=2097152 cv=0.552 tok=69902336 tok/s=16470 elapsed=1.43h | |
| step 8580 loss=1.0175 lr=5.87e-04 scale=2097152 cv=0.544 tok=70066176 tok/s=16663 elapsed=1.43h | |
| step 8600 loss=1.7091 lr=5.87e-04 scale=2097152 cv=0.503 tok=70230016 tok/s=16919 elapsed=1.44h | |
| step 8620 loss=1.6913 lr=5.87e-04 scale=4194304 cv=0.563 tok=70393856 tok/s=16857 elapsed=1.44h | |
| step 8640 loss=1.8543 lr=5.87e-04 scale=4194304 cv=0.540 tok=70557696 tok/s=16818 elapsed=1.44h | |
| step 8660 loss=1.7198 lr=5.87e-04 scale=4194304 cv=0.491 tok=70721536 tok/s=16432 elapsed=1.45h | |
| step 8680 loss=1.4343 lr=5.87e-04 scale=4194304 cv=0.568 tok=70885376 tok/s=16367 elapsed=1.45h | |
| step 8684 NaN/Inf grad -> skip; scale=2097152 (nan#29) | |
| step 8700 loss=1.5114 lr=5.87e-04 scale=2097152 cv=0.575 tok=71041024 tok/s=16336 elapsed=1.45h | |
| step 8720 loss=1.7519 lr=5.87e-04 scale=2097152 cv=0.744 tok=71204864 tok/s=16403 elapsed=1.45h | |
| step 8740 loss=1.5714 lr=5.87e-04 scale=2097152 cv=0.581 tok=71368704 tok/s=16404 elapsed=1.46h | |
| step 8760 loss=0.7839 lr=5.87e-04 scale=2097152 cv=0.561 tok=71532544 tok/s=16374 elapsed=1.46h | |
| step 8780 loss=1.2667 lr=5.87e-04 scale=2097152 cv=0.551 tok=71696384 tok/s=16432 elapsed=1.46h | |
| step 8800 loss=1.1358 lr=5.87e-04 scale=2097152 cv=0.576 tok=71860224 tok/s=16575 elapsed=1.47h | |
| step 8820 loss=1.1594 lr=5.87e-04 scale=2097152 cv=0.529 tok=72024064 tok/s=16502 elapsed=1.47h | |
| step 8840 loss=1.2850 lr=5.87e-04 scale=2097152 cv=0.583 tok=72187904 tok/s=16307 elapsed=1.47h | |
| step 8860 loss=1.5485 lr=5.86e-04 scale=2097152 cv=0.534 tok=72351744 tok/s=16378 elapsed=1.48h | |
| step 8880 loss=1.2475 lr=5.86e-04 scale=2097152 cv=0.570 tok=72515584 tok/s=16385 elapsed=1.48h | |
| step 8885 NaN/Inf grad -> skip; scale=2097152 (nan#30) | |
| step 8900 loss=1.0925 lr=5.86e-04 scale=2097152 cv=0.529 tok=72671232 tok/s=16549 elapsed=1.48h | |
| step 8920 loss=0.6097 lr=5.86e-04 scale=2097152 cv=0.589 tok=72835072 tok/s=16519 elapsed=1.48h | |
| step 8940 loss=1.5132 lr=5.86e-04 scale=2097152 cv=0.579 tok=72998912 tok/s=16692 elapsed=1.49h | |
| step 8960 loss=1.6881 lr=5.86e-04 scale=2097152 cv=0.584 tok=73162752 tok/s=16677 elapsed=1.49h | |
| step 8980 loss=1.8213 lr=5.86e-04 scale=2097152 cv=0.520 tok=73326592 tok/s=16291 elapsed=1.49h | |
| step 9000 loss=1.6874 lr=5.86e-04 scale=2097152 cv=0.455 tok=73490432 tok/s=16281 elapsed=1.50h | |
| step 9020 loss=0.8240 lr=5.86e-04 scale=2097152 cv=0.428 tok=73654272 tok/s=16315 elapsed=1.51h | |
| step 9040 loss=1.5216 lr=5.86e-04 scale=2097152 cv=0.474 tok=73818112 tok/s=16469 elapsed=1.51h | |
| step 9060 loss=1.3298 lr=5.86e-04 scale=2097152 cv=0.540 tok=73981952 tok/s=16574 elapsed=1.52h | |
| step 9080 loss=1.7566 lr=5.86e-04 scale=2097152 cv=0.532 tok=74145792 tok/s=16124 elapsed=1.52h | |
| step 9092 NaN/Inf grad -> skip; scale=2097152 (nan#31) | |
| step 9100 loss=1.6177 lr=5.86e-04 scale=2097152 cv=0.525 tok=74301440 tok/s=13721 elapsed=1.52h | |
| step 9120 loss=1.8137 lr=5.86e-04 scale=2097152 cv=0.535 tok=74465280 tok/s=12940 elapsed=1.53h | |
| step 9140 loss=1.7336 lr=5.86e-04 scale=2097152 cv=0.451 tok=74629120 tok/s=12863 elapsed=1.53h | |
| step 9160 loss=1.6416 lr=5.86e-04 scale=2097152 cv=0.487 tok=74792960 tok/s=12792 elapsed=1.53h | |
| step 9180 loss=1.3193 lr=5.85e-04 scale=2097152 cv=0.398 tok=74956800 tok/s=12838 elapsed=1.54h | |
| step 9200 loss=1.1421 lr=5.85e-04 scale=2097152 cv=0.436 tok=75120640 tok/s=12666 elapsed=1.54h | |
| step 9220 loss=0.4568 lr=5.85e-04 scale=2097152 cv=0.471 tok=75284480 tok/s=8135 elapsed=1.55h | |
| step 9240 loss=1.0312 lr=5.85e-04 scale=2097152 cv=0.503 tok=75448320 tok/s=7229 elapsed=1.55h | |
| step 9260 loss=1.3888 lr=5.85e-04 scale=2097152 cv=0.569 tok=75612160 tok/s=7211 elapsed=1.56h | |
| step 9280 loss=0.8275 lr=5.85e-04 scale=2097152 cv=0.629 tok=75776000 tok/s=7061 elapsed=1.57h | |
| step 9300 NaN/Inf grad -> skip; scale=2097152 (nan#32) | |
| step 9320 loss=1.1206 lr=5.85e-04 scale=2097152 cv=0.489 tok=76095488 tok/s=6887 elapsed=1.58h | |
| step 9340 loss=1.6210 lr=5.85e-04 scale=2097152 cv=0.694 tok=76259328 tok/s=7301 elapsed=1.59h | |
| step 9360 loss=1.0556 lr=5.85e-04 scale=2097152 cv=0.587 tok=76423168 tok/s=7250 elapsed=1.59h | |
| step 9380 loss=1.1282 lr=5.85e-04 scale=2097152 cv=0.728 tok=76587008 tok/s=7185 elapsed=1.60h | |
| step 9400 loss=0.3135 lr=5.85e-04 scale=2097152 cv=0.515 tok=76750848 tok/s=7171 elapsed=1.61h | |
| step 9420 loss=1.4276 lr=5.85e-04 scale=2097152 cv=0.536 tok=76914688 tok/s=7297 elapsed=1.61h | |
| step 9440 loss=0.9835 lr=5.85e-04 scale=2097152 cv=0.439 tok=77078528 tok/s=7369 elapsed=1.62h | |
| step 9460 loss=1.7269 lr=5.84e-04 scale=2097152 cv=0.524 tok=77242368 tok/s=7238 elapsed=1.63h | |
| step 9480 loss=0.7757 lr=5.84e-04 scale=2097152 cv=0.553 tok=77406208 tok/s=7161 elapsed=1.63h | |
| step 9500 loss=0.1707 lr=5.84e-04 scale=4194304 cv=0.538 tok=77570048 tok/s=7114 elapsed=1.64h | |
| step 9519 NaN/Inf grad -> skip; scale=2097152 (nan#33) | |
| step 9520 loss=1.2550 lr=5.84e-04 scale=2097152 cv=0.449 tok=77725696 tok/s=7112 elapsed=1.65h | |
| step 9522 NaN/Inf grad -> skip; scale=1048576 (nan#34) | |
| step 9540 loss=1.0698 lr=5.84e-04 scale=1048576 cv=0.528 tok=77881344 tok/s=7314 elapsed=1.65h | |
| step 9560 loss=0.2440 lr=5.84e-04 scale=1048576 cv=0.701 tok=78045184 tok/s=7085 elapsed=1.66h | |
| step 9580 loss=0.1733 lr=5.84e-04 scale=1048576 cv=0.584 tok=78209024 tok/s=7031 elapsed=1.67h | |
| step 9600 loss=1.3030 lr=5.84e-04 scale=1048576 cv=0.592 tok=78372864 tok/s=7171 elapsed=1.67h | |
| step 9620 loss=1.6193 lr=5.84e-04 scale=1048576 cv=0.533 tok=78536704 tok/s=7305 elapsed=1.68h | |
| step 9640 loss=1.6419 lr=5.84e-04 scale=1048576 cv=0.526 tok=78700544 tok/s=7277 elapsed=1.68h | |
| step 9660 loss=1.6800 lr=5.84e-04 scale=1048576 cv=0.483 tok=78864384 tok/s=7385 elapsed=1.69h | |
| step 9680 loss=1.6986 lr=5.84e-04 scale=1048576 cv=0.455 tok=79028224 tok/s=7262 elapsed=1.70h | |
| step 9700 loss=1.7708 lr=5.84e-04 scale=1048576 cv=0.493 tok=79192064 tok/s=7257 elapsed=1.70h | |
| step 9720 loss=1.5096 lr=5.84e-04 scale=1048576 cv=0.497 tok=79355904 tok/s=7234 elapsed=1.71h | |
| step 9740 loss=1.7778 lr=5.84e-04 scale=2097152 cv=0.496 tok=79519744 tok/s=7057 elapsed=1.72h | |
| step 9760 loss=1.3951 lr=5.83e-04 scale=2097152 cv=0.446 tok=79683584 tok/s=7197 elapsed=1.72h | |
| step 9780 loss=1.2269 lr=5.83e-04 scale=2097152 cv=0.446 tok=79847424 tok/s=7259 elapsed=1.73h | |
| step 9800 loss=0.9745 lr=5.83e-04 scale=2097152 cv=0.479 tok=80011264 tok/s=7226 elapsed=1.74h | |
| step 9820 loss=1.6561 lr=5.83e-04 scale=2097152 cv=0.523 tok=80175104 tok/s=7000 elapsed=1.74h | |
| step 9840 loss=1.3589 lr=5.83e-04 scale=2097152 cv=0.520 tok=80338944 tok/s=7954 elapsed=1.75h | |
| step 9860 loss=0.5861 lr=5.83e-04 scale=2097152 cv=0.577 tok=80502784 tok/s=9022 elapsed=1.75h | |
| step 9880 loss=0.3592 lr=5.83e-04 scale=2097152 cv=0.571 tok=80666624 tok/s=9010 elapsed=1.76h | |
| step 9900 loss=1.3557 lr=5.83e-04 scale=2097152 cv=0.529 tok=80830464 tok/s=9106 elapsed=1.76h | |
| step 9920 loss=0.7722 lr=5.83e-04 scale=2097152 cv=0.508 tok=80994304 tok/s=9082 elapsed=1.77h | |
| step 9939 NaN/Inf grad -> skip; scale=2097152 (nan#35) | |
| step 9940 loss=0.8513 lr=5.83e-04 scale=2097152 cv=0.563 tok=81149952 tok/s=9210 elapsed=1.77h | |
| step 9960 loss=1.0958 lr=5.83e-04 scale=2097152 cv=0.555 tok=81313792 tok/s=9224 elapsed=1.78h | |
| step 9980 loss=1.6298 lr=5.83e-04 scale=2097152 cv=0.561 tok=81477632 tok/s=9069 elapsed=1.79h | |
| step 10000 loss=1.2966 lr=5.83e-04 scale=2097152 cv=0.525 tok=81641472 tok/s=9149 elapsed=1.79h | |
| step 10020 loss=1.4403 lr=5.83e-04 scale=2097152 cv=0.497 tok=81805312 tok/s=9154 elapsed=1.81h | |
| step 10040 loss=1.6131 lr=5.82e-04 scale=2097152 cv=0.520 tok=81969152 tok/s=8961 elapsed=1.81h | |
| step 10060 loss=1.5480 lr=5.82e-04 scale=2097152 cv=0.525 tok=82132992 tok/s=8802 elapsed=1.82h | |
| step 10000 loss=0.7432 lr=5.83e-04 scale=16384 cv=0.646 tok=81928192 tok/s=4531 elapsed=0.01h | |
| step 10020 loss=0.8048 lr=5.83e-04 scale=16384 cv=0.537 tok=82092032 tok/s=9244 elapsed=0.04h | |
| step 10040 loss=1.1227 lr=5.82e-04 scale=16384 cv=0.538 tok=82255872 tok/s=9147 elapsed=0.04h | |
| step 10060 loss=1.3650 lr=5.82e-04 scale=16384 cv=0.548 tok=82419712 tok/s=9135 elapsed=0.05h | |
| step 10080 loss=1.7216 lr=5.82e-04 scale=16384 cv=0.510 tok=82583552 tok/s=8933 elapsed=0.05h | |
| step 10100 loss=1.1475 lr=5.82e-04 scale=16384 cv=0.568 tok=82747392 tok/s=9011 elapsed=0.06h | |
| step 10120 loss=1.5547 lr=5.82e-04 scale=16384 cv=0.497 tok=82911232 tok/s=8994 elapsed=0.06h | |
| step 10140 loss=1.1809 lr=5.82e-04 scale=16384 cv=0.456 tok=83075072 tok/s=8991 elapsed=0.07h | |
| step 10160 loss=1.4797 lr=5.82e-04 scale=16384 cv=0.455 tok=83238912 tok/s=8964 elapsed=0.07h | |
| step 10180 loss=0.9072 lr=5.82e-04 scale=16384 cv=0.433 tok=83402752 tok/s=8872 elapsed=0.08h | |
| step 10200 loss=1.8668 lr=5.82e-04 scale=32768 cv=0.556 tok=83566592 tok/s=8841 elapsed=0.08h | |
| step 10220 loss=1.4887 lr=5.82e-04 scale=32768 cv=0.459 tok=83730432 tok/s=8826 elapsed=0.09h | |
| step 10240 loss=0.7891 lr=5.82e-04 scale=32768 cv=0.479 tok=83894272 tok/s=8915 elapsed=0.09h | |
| step 10260 loss=0.2157 lr=5.82e-04 scale=32768 cv=0.524 tok=84058112 tok/s=9055 elapsed=0.10h | |
| step 10280 loss=0.3618 lr=5.82e-04 scale=32768 cv=0.547 tok=84221952 tok/s=9102 elapsed=0.10h | |
| step 10300 loss=0.9939 lr=5.81e-04 scale=32768 cv=0.559 tok=84385792 tok/s=9191 elapsed=0.11h | |
| step 10320 loss=1.7795 lr=5.81e-04 scale=32768 cv=0.535 tok=84549632 tok/s=9212 elapsed=0.11h | |
| step 10340 loss=1.4053 lr=5.81e-04 scale=32768 cv=0.462 tok=84713472 tok/s=9240 elapsed=0.12h | |
| step 10360 loss=0.8786 lr=5.81e-04 scale=32768 cv=0.630 tok=84877312 tok/s=9226 elapsed=0.12h | |
| step 10380 loss=1.3201 lr=5.81e-04 scale=32768 cv=0.475 tok=85041152 tok/s=9216 elapsed=0.13h | |
| step 10400 loss=0.6458 lr=5.81e-04 scale=65536 cv=0.520 tok=85204992 tok/s=9243 elapsed=0.13h | |
| step 10420 loss=1.5067 lr=5.81e-04 scale=65536 cv=0.536 tok=85368832 tok/s=9152 elapsed=0.14h | |
| step 10440 loss=1.6553 lr=5.81e-04 scale=65536 cv=0.536 tok=85532672 tok/s=9032 elapsed=0.14h | |
| step 10460 loss=1.1901 lr=5.81e-04 scale=65536 cv=0.470 tok=85696512 tok/s=9082 elapsed=0.15h | |
| step 10480 loss=1.1357 lr=5.81e-04 scale=65536 cv=0.457 tok=85860352 tok/s=9170 elapsed=0.15h | |
| step 10500 loss=1.1085 lr=5.81e-04 scale=65536 cv=0.519 tok=86024192 tok/s=9260 elapsed=0.16h | |
| step 10520 loss=1.5448 lr=5.81e-04 scale=65536 cv=0.505 tok=86188032 tok/s=9224 elapsed=0.16h | |
| step 10540 loss=1.7515 lr=5.81e-04 scale=65536 cv=0.518 tok=86351872 tok/s=9205 elapsed=0.17h | |
| step 10560 loss=1.7361 lr=5.81e-04 scale=65536 cv=0.575 tok=86515712 tok/s=9237 elapsed=0.17h | |
| step 10580 loss=1.6343 lr=5.80e-04 scale=65536 cv=0.597 tok=86679552 tok/s=9215 elapsed=0.18h | |
| step 10600 loss=1.7062 lr=5.80e-04 scale=65536 cv=0.529 tok=86843392 tok/s=9214 elapsed=0.18h | |
| step 10620 loss=1.7030 lr=5.80e-04 scale=65536 cv=0.487 tok=87007232 tok/s=8949 elapsed=0.19h | |
| step 10640 loss=0.9366 lr=5.80e-04 scale=65536 cv=0.506 tok=87171072 tok/s=8971 elapsed=0.19h | |
| step 10660 loss=1.5163 lr=5.80e-04 scale=65536 cv=0.467 tok=87334912 tok/s=8966 elapsed=0.20h | |
| step 10680 loss=1.6619 lr=5.80e-04 scale=65536 cv=0.541 tok=87498752 tok/s=8836 elapsed=0.20h | |
| step 10700 loss=1.6309 lr=5.80e-04 scale=65536 cv=0.433 tok=87662592 tok/s=8952 elapsed=0.21h | |
| step 10720 loss=0.5845 lr=5.80e-04 scale=65536 cv=0.469 tok=87826432 tok/s=8910 elapsed=0.22h | |
| step 10740 loss=1.4821 lr=5.80e-04 scale=65536 cv=0.514 tok=87990272 tok/s=9110 elapsed=0.22h | |
| step 10760 loss=0.1896 lr=5.80e-04 scale=65536 cv=0.691 tok=88154112 tok/s=9108 elapsed=0.23h | |
| step 10780 loss=1.3913 lr=5.80e-04 scale=65536 cv=0.571 tok=88317952 tok/s=5889 elapsed=0.23h | |
| step 10800 loss=1.0406 lr=5.80e-04 scale=65536 cv=0.512 tok=88481792 tok/s=5952 elapsed=0.24h | |
| step 10820 loss=0.9801 lr=5.79e-04 scale=65536 cv=0.514 tok=88645632 tok/s=5823 elapsed=0.25h | |
| step 10840 loss=1.4336 lr=5.79e-04 scale=65536 cv=0.508 tok=88809472 tok/s=5891 elapsed=0.26h | |
| step 10860 loss=0.7729 lr=5.79e-04 scale=65536 cv=0.531 tok=88973312 tok/s=5738 elapsed=0.27h | |
| step 10880 loss=1.4084 lr=5.79e-04 scale=65536 cv=0.481 tok=89137152 tok/s=5652 elapsed=0.27h | |
| step 10900 loss=1.6509 lr=5.79e-04 scale=65536 cv=0.492 tok=89300992 tok/s=6196 elapsed=0.28h | |
| step 10920 loss=1.7001 lr=5.79e-04 scale=65536 cv=0.521 tok=89464832 tok/s=6108 elapsed=0.29h | |
| step 10940 loss=1.6633 lr=5.79e-04 scale=65536 cv=0.490 tok=89628672 tok/s=5872 elapsed=0.30h | |
| step 10960 loss=1.1149 lr=5.79e-04 scale=65536 cv=0.493 tok=89792512 tok/s=5934 elapsed=0.30h | |
| step 10980 loss=0.2935 lr=5.79e-04 scale=65536 cv=0.635 tok=89956352 tok/s=11719 elapsed=0.31h | |
| step 11000 loss=0.6276 lr=5.79e-04 scale=65536 cv=0.531 tok=90120192 tok/s=16447 elapsed=0.31h | |
| step 11020 loss=1.5283 lr=5.79e-04 scale=65536 cv=0.541 tok=90284032 tok/s=16772 elapsed=0.33h | |
| step 11040 loss=1.1286 lr=5.79e-04 scale=65536 cv=0.485 tok=90447872 tok/s=16403 elapsed=0.33h | |
| step 11060 loss=1.5895 lr=5.79e-04 scale=65536 cv=0.551 tok=90611712 tok/s=16450 elapsed=0.33h | |
| step 11080 loss=1.8484 lr=5.78e-04 scale=65536 cv=0.516 tok=90775552 tok/s=16460 elapsed=0.34h | |
| step 11100 loss=1.6026 lr=5.78e-04 scale=65536 cv=0.493 tok=90939392 tok/s=16468 elapsed=0.34h | |
| step 11120 loss=0.9718 lr=5.78e-04 scale=65536 cv=0.453 tok=91103232 tok/s=16799 elapsed=0.34h | |
| step 11140 loss=1.2783 lr=5.78e-04 scale=65536 cv=0.487 tok=91267072 tok/s=16840 elapsed=0.34h | |
| step 11160 loss=1.4285 lr=5.78e-04 scale=65536 cv=0.570 tok=91430912 tok/s=16749 elapsed=0.35h | |
| step 11180 loss=1.1860 lr=5.78e-04 scale=65536 cv=0.448 tok=91594752 tok/s=16767 elapsed=0.35h | |
| step 11200 loss=0.7243 lr=5.78e-04 scale=65536 cv=0.559 tok=91758592 tok/s=16756 elapsed=0.35h | |
| step 11220 loss=0.9780 lr=5.78e-04 scale=65536 cv=0.559 tok=91922432 tok/s=16209 elapsed=0.36h | |
| step 11240 loss=1.0562 lr=5.78e-04 scale=65536 cv=0.643 tok=92086272 tok/s=16565 elapsed=0.36h | |
| step 11260 loss=1.5164 lr=5.78e-04 scale=65536 cv=0.510 tok=92250112 tok/s=16374 elapsed=0.36h | |
| step 11280 loss=1.4349 lr=5.78e-04 scale=65536 cv=0.545 tok=92413952 tok/s=16517 elapsed=0.36h | |
| step 11300 loss=1.8893 lr=5.78e-04 scale=65536 cv=0.501 tok=92577792 tok/s=16449 elapsed=0.37h | |
| step 11320 loss=1.6908 lr=5.77e-04 scale=65536 cv=0.452 tok=92741632 tok/s=16455 elapsed=0.37h | |
| step 11340 loss=1.5965 lr=5.77e-04 scale=65536 cv=0.529 tok=92905472 tok/s=16548 elapsed=0.37h | |
| step 11360 loss=1.2182 lr=5.77e-04 scale=65536 cv=0.500 tok=93069312 tok/s=16757 elapsed=0.38h | |
| step 11380 loss=1.4860 lr=5.77e-04 scale=65536 cv=0.481 tok=93233152 tok/s=16715 elapsed=0.38h | |
| step 11400 loss=1.2017 lr=5.77e-04 scale=65536 cv=0.532 tok=93396992 tok/s=16603 elapsed=0.38h | |
| step 11420 loss=1.3912 lr=5.77e-04 scale=65536 cv=0.420 tok=93560832 tok/s=16398 elapsed=0.38h | |
| step 11440 loss=0.9442 lr=5.77e-04 scale=65536 cv=0.622 tok=93724672 tok/s=16375 elapsed=0.39h | |
| step 11460 loss=1.3662 lr=5.77e-04 scale=65536 cv=0.451 tok=93888512 tok/s=16147 elapsed=0.39h | |
| step 11480 loss=0.4979 lr=5.77e-04 scale=65536 cv=0.536 tok=94052352 tok/s=16363 elapsed=0.39h | |
| step 11500 loss=1.5621 lr=5.77e-04 scale=65536 cv=0.552 tok=94216192 tok/s=16416 elapsed=0.40h | |
| step 11520 loss=1.8529 lr=5.77e-04 scale=65536 cv=0.520 tok=94380032 tok/s=16457 elapsed=0.40h | |
| step 11540 loss=1.6867 lr=5.77e-04 scale=65536 cv=0.483 tok=94543872 tok/s=16391 elapsed=0.40h | |
| step 11560 loss=1.3196 lr=5.76e-04 scale=65536 cv=0.509 tok=94707712 tok/s=16441 elapsed=0.40h | |
| step 11580 loss=1.6195 lr=5.76e-04 scale=65536 cv=0.535 tok=94871552 tok/s=16565 elapsed=0.41h | |
| step 11600 loss=1.8275 lr=5.76e-04 scale=65536 cv=0.497 tok=95035392 tok/s=16537 elapsed=0.41h | |
| step 11620 loss=1.6246 lr=5.76e-04 scale=65536 cv=0.463 tok=95199232 tok/s=16510 elapsed=0.41h | |
| step 11640 loss=1.3967 lr=5.76e-04 scale=65536 cv=0.470 tok=95363072 tok/s=16546 elapsed=0.42h | |
| step 11660 loss=1.3543 lr=5.76e-04 scale=65536 cv=0.485 tok=95526912 tok/s=16499 elapsed=0.42h | |
| step 11680 loss=0.7404 lr=5.76e-04 scale=65536 cv=0.526 tok=95690752 tok/s=16430 elapsed=0.42h | |
| step 11700 loss=0.8733 lr=5.76e-04 scale=65536 cv=0.526 tok=95854592 tok/s=16486 elapsed=0.43h | |
| step 11720 loss=1.4103 lr=5.76e-04 scale=65536 cv=0.506 tok=96018432 tok/s=16207 elapsed=0.43h | |
| step 11740 loss=1.5191 lr=5.76e-04 scale=65536 cv=0.539 tok=96182272 tok/s=16435 elapsed=0.43h | |
| step 11760 loss=1.7497 lr=5.76e-04 scale=65536 cv=0.514 tok=96346112 tok/s=16547 elapsed=0.43h | |
| step 11780 loss=1.6638 lr=5.76e-04 scale=65536 cv=0.520 tok=96509952 tok/s=16642 elapsed=0.44h | |
| step 11800 loss=1.1302 lr=5.75e-04 scale=65536 cv=0.455 tok=96673792 tok/s=16454 elapsed=0.44h | |
| step 11820 loss=1.1012 lr=5.75e-04 scale=65536 cv=0.484 tok=96837632 tok/s=16413 elapsed=0.44h | |
| step 11840 loss=0.4337 lr=5.75e-04 scale=65536 cv=0.477 tok=97001472 tok/s=16467 elapsed=0.45h | |
| step 11860 loss=1.4752 lr=5.75e-04 scale=65536 cv=0.527 tok=97165312 tok/s=16492 elapsed=0.45h | |
| step 11880 loss=1.6408 lr=5.75e-04 scale=65536 cv=0.530 tok=97329152 tok/s=16561 elapsed=0.45h | |
| step 11900 loss=1.4488 lr=5.75e-04 scale=65536 cv=0.594 tok=97492992 tok/s=16610 elapsed=0.45h | |
| step 11920 loss=1.3336 lr=5.75e-04 scale=65536 cv=0.540 tok=97656832 tok/s=16525 elapsed=0.46h | |
| step 11940 loss=0.7721 lr=5.75e-04 scale=65536 cv=0.513 tok=97820672 tok/s=16461 elapsed=0.46h | |
| step 11960 loss=1.2003 lr=5.75e-04 scale=65536 cv=0.608 tok=97984512 tok/s=16472 elapsed=0.46h | |
| step 11980 loss=1.6350 lr=5.75e-04 scale=65536 cv=0.584 tok=98148352 tok/s=16780 elapsed=0.47h | |
| step 12000 loss=0.9948 lr=5.75e-04 scale=65536 cv=0.583 tok=98312192 tok/s=16787 elapsed=0.47h | |
| step 12020 loss=1.3850 lr=5.75e-04 scale=65536 cv=0.510 tok=98476032 tok/s=16497 elapsed=0.49h | |
| step 12040 loss=1.7016 lr=5.74e-04 scale=65536 cv=0.541 tok=98639872 tok/s=16607 elapsed=0.49h | |
| step 12060 loss=1.1451 lr=5.74e-04 scale=65536 cv=0.528 tok=98803712 tok/s=16435 elapsed=0.50h | |
| step 12080 loss=1.7045 lr=5.74e-04 scale=65536 cv=0.538 tok=98967552 tok/s=16455 elapsed=0.50h | |
| step 12100 loss=1.1433 lr=5.74e-04 scale=65536 cv=0.551 tok=99131392 tok/s=16454 elapsed=0.50h | |
| step 12120 loss=1.6216 lr=5.74e-04 scale=65536 cv=0.534 tok=99295232 tok/s=16434 elapsed=0.50h | |
| step 12140 loss=1.1082 lr=5.74e-04 scale=65536 cv=0.499 tok=99459072 tok/s=16413 elapsed=0.51h | |
| step 12160 loss=1.7618 lr=5.74e-04 scale=65536 cv=0.545 tok=99622912 tok/s=16456 elapsed=0.51h | |
| step 12165 NaN/Inf grad -> skip; scale=32768 (consec=1 total=1) | |
| step 12166 NaN/Inf grad -> skip; scale=16384 (consec=2 total=2) | |
| step 12167 NaN/Inf grad -> skip; scale=8192 (consec=3 total=3) | |
| step 12168 NaN/Inf grad -> skip; scale=4096 (consec=4 total=4) | |
| step 12169 NaN/Inf grad -> skip; scale=2048 (consec=5 total=5) | |
| step 12170 NaN/Inf grad -> skip; scale=1024 (consec=6 total=6) | |
| step 12171 NaN/Inf grad -> skip; scale=512 (consec=7 total=7) | |
| step 12180 loss=1.6653 lr=5.74e-04 scale=512 cv=0.522 tok=99729408 tok/s=16477 elapsed=0.51h | |
| step 12200 loss=1.5684 lr=5.74e-04 scale=512 cv=0.549 tok=99893248 tok/s=16493 elapsed=0.52h | |
| step 12201 NaN/Inf grad -> skip; scale=256 (consec=1 total=8) | |
| step 12208 NaN/Inf grad -> skip; scale=128 (consec=1 total=9) | |
| step 12209 NaN/Inf grad -> skip; scale=64 (consec=2 total=10) | |
| step 12210 NaN/Inf grad -> skip; scale=32 (consec=3 total=11) | |
| step 12211 NaN/Inf grad -> skip; scale=16 (consec=4 total=12) | |
| step 12212 NaN/Inf grad -> skip; scale=8 (consec=5 total=13) | |
| step 12213 NaN/Inf grad -> skip; scale=4 (consec=6 total=14) | |
| step 12214 NaN/Inf grad -> skip; scale=2 (consec=7 total=15) | |
| step 12215 NaN/Inf grad -> skip; scale=1 (consec=8 total=16) | |
| step 12216 NaN/Inf grad -> skip; scale=1 (consec=9 total=17) | |
| step 12218 NaN/Inf grad -> skip; scale=1 (consec=1 total=18) | |
| step 12219 NaN/Inf grad -> skip; scale=1 (consec=2 total=19) | |
| step 12220 NaN/Inf grad -> skip; scale=1 (consec=3 total=20) | |
| step 12221 NaN/Inf grad -> skip; scale=1 (consec=4 total=21) | |
| step 12222 NaN/Inf grad -> skip; scale=1 (consec=5 total=22) | |
| step 12223 NaN/Inf grad -> skip; scale=1 (consec=6 total=23) | |
| step 12224 NaN/Inf grad -> skip; scale=1 (consec=7 total=24) | |
| step 12225 NaN/Inf grad -> skip; scale=1 (consec=8 total=25) | |
| step 12226 NaN/Inf grad -> skip; scale=1 (consec=9 total=26) | |
| step 12227 NaN/Inf grad -> skip; scale=1 (consec=10 total=27) | |
| step 12228 NaN/Inf grad -> skip; scale=1 (consec=11 total=28) | |
| step 12229 NaN/Inf grad -> skip; scale=1 (consec=12 total=29) | |
| step 12230 NaN/Inf grad -> skip; scale=1 (consec=13 total=30) | |
| step 12231 NaN/Inf grad -> skip; scale=1 (consec=14 total=31) | |
| step 12232 NaN/Inf grad -> skip; scale=1 (consec=15 total=32) | |
| step 12233 NaN/Inf grad -> skip; scale=1 (consec=16 total=33) | |
| step 12234 NaN/Inf grad -> skip; scale=1 (consec=17 total=34) | |
| step 12235 NaN/Inf grad -> skip; scale=1 (consec=18 total=35) | |
| step 12236 NaN/Inf grad -> skip; scale=1 (consec=19 total=36) | |
| step 12237 NaN/Inf grad -> skip; scale=1 (consec=20 total=37) | |
| step 12238 NaN/Inf grad -> skip; scale=1 (consec=21 total=38) | |
| step 12239 NaN/Inf grad -> skip; scale=1 (consec=22 total=39) | |
| step 12240 loss=1.7946 lr=5.74e-04 scale=1 cv=0.529 tok=99958784 tok/s=16548 elapsed=0.52h | |
| step 12242 NaN/Inf grad -> skip; scale=1 (consec=1 total=40) | |
| step 12243 NaN/Inf grad -> skip; scale=1 (consec=2 total=41) | |
| step 12244 NaN/Inf grad -> skip; scale=1 (consec=3 total=42) | |
| step 12245 NaN/Inf grad -> skip; scale=1 (consec=4 total=43) | |
| step 12246 NaN/Inf grad -> skip; scale=1 (consec=5 total=44) | |
| step 12248 NaN/Inf grad -> skip; scale=1 (consec=1 total=45) | |
| step 12249 NaN/Inf grad -> skip; scale=1 (consec=2 total=46) | |
| step 12250 NaN/Inf grad -> skip; scale=1 (consec=3 total=47) | |
| step 12251 NaN/Inf grad -> skip; scale=1 (consec=4 total=48) | |
| step 12252 NaN/Inf grad -> skip; scale=1 (consec=5 total=49) | |
| step 12253 NaN/Inf grad -> skip; scale=1 (consec=6 total=50) | |
| step 12254 NaN/Inf grad -> skip; scale=1 (consec=7 total=51) | |
| step 12255 NaN/Inf grad -> skip; scale=1 (consec=8 total=52) | |
| step 12256 NaN/Inf grad -> skip; scale=1 (consec=9 total=53) | |
| step 12257 NaN/Inf grad -> skip; scale=1 (consec=10 total=54) | |
| step 12258 NaN/Inf grad -> skip; scale=1 (consec=11 total=55) | |
| step 12259 NaN/Inf grad -> skip; scale=1 (consec=12 total=56) | |
| step 12260 NaN/Inf grad -> skip; scale=1 (consec=13 total=57) | |
| step 12261 NaN/Inf grad -> skip; scale=1 (consec=14 total=58) | |
| step 12262 NaN/Inf grad -> skip; scale=1 (consec=15 total=59) | |
| step 12264 NaN/Inf grad -> skip; scale=1 (consec=1 total=60) | |
| step 12265 NaN/Inf grad -> skip; scale=1 (consec=2 total=61) | |
| step 12267 NaN/Inf grad -> skip; scale=1 (consec=1 total=62) | |
| step 12268 NaN/Inf grad -> skip; scale=1 (consec=2 total=63) | |
| step 12270 NaN/Inf grad -> skip; scale=1 (consec=1 total=64) | |
| step 12271 NaN/Inf grad -> skip; scale=1 (consec=2 total=65) | |
| step 12276 NaN/Inf grad -> skip; scale=1 (consec=1 total=66) | |
| step 12277 NaN/Inf grad -> skip; scale=1 (consec=2 total=67) | |
| step 12278 NaN/Inf grad -> skip; scale=1 (consec=3 total=68) | |
| step 12279 NaN/Inf grad -> skip; scale=1 (consec=4 total=69) | |
| step 12280 NaN/Inf grad -> skip; scale=1 (consec=5 total=70) | |
| step 12281 NaN/Inf grad -> skip; scale=1 (consec=6 total=71) | |
| step 12282 NaN/Inf grad -> skip; scale=1 (consec=7 total=72) | |
| step 12283 NaN/Inf grad -> skip; scale=1 (consec=8 total=73) | |
| step 12284 NaN/Inf grad -> skip; scale=1 (consec=9 total=74) | |
| step 12285 NaN/Inf grad -> skip; scale=1 (consec=10 total=75) | |
| step 12286 NaN/Inf grad -> skip; scale=1 (consec=11 total=76) | |
| step 12287 NaN/Inf grad -> skip; scale=1 (consec=12 total=77) | |
| step 12288 NaN/Inf grad -> skip; scale=1 (consec=13 total=78) | |
| step 12289 NaN/Inf grad -> skip; scale=1 (consec=14 total=79) | |
| step 12290 NaN/Inf grad -> skip; scale=1 (consec=15 total=80) | |
| step 12291 NaN/Inf grad -> skip; scale=1 (consec=16 total=81) | |
| step 12292 NaN/Inf grad -> skip; scale=1 (consec=17 total=82) | |
| step 12293 NaN/Inf grad -> skip; scale=1 (consec=18 total=83) | |
| step 12294 NaN/Inf grad -> skip; scale=1 (consec=19 total=84) | |
| step 12295 NaN/Inf grad -> skip; scale=1 (consec=20 total=85) | |
| step 12297 NaN/Inf grad -> skip; scale=1 (consec=1 total=86) | |
| step 12298 NaN/Inf grad -> skip; scale=1 (consec=2 total=87) | |
| step 12299 NaN/Inf grad -> skip; scale=1 (consec=3 total=88) | |
| step 12300 NaN/Inf grad -> skip; scale=1 (consec=4 total=89) | |
| step 12301 NaN/Inf grad -> skip; scale=1 (consec=5 total=90) | |
| step 12302 NaN/Inf grad -> skip; scale=1 (consec=6 total=91) | |
| step 12303 NaN/Inf grad -> skip; scale=1 (consec=7 total=92) | |
| step 12304 NaN/Inf grad -> skip; scale=1 (consec=8 total=93) | |
| step 12305 NaN/Inf grad -> skip; scale=1 (consec=9 total=94) | |
| step 12306 NaN/Inf grad -> skip; scale=1 (consec=10 total=95) | |
| step 12307 NaN/Inf grad -> skip; scale=1 (consec=11 total=96) | |
| step 12308 NaN/Inf grad -> skip; scale=1 (consec=12 total=97) | |
| step 12309 NaN/Inf grad -> skip; scale=1 (consec=13 total=98) | |
| step 12310 NaN/Inf grad -> skip; scale=1 (consec=14 total=99) | |
| step 12311 NaN/Inf grad -> skip; scale=1 (consec=15 total=100) | |
| step 12312 NaN/Inf grad -> skip; scale=1 (consec=16 total=101) | |
| step 12313 NaN/Inf grad -> skip; scale=1 (consec=17 total=102) | |
| step 12314 NaN/Inf grad -> skip; scale=1 (consec=18 total=103) | |
| step 12315 NaN/Inf grad -> skip; scale=1 (consec=19 total=104) | |
| step 12316 NaN/Inf grad -> skip; scale=1 (consec=20 total=105) | |
| step 12317 NaN/Inf grad -> skip; scale=1 (consec=21 total=106) | |
| step 12318 NaN/Inf grad -> skip; scale=1 (consec=22 total=107) | |
| step 12319 NaN/Inf grad -> skip; scale=1 (consec=23 total=108) | |
| step 12320 NaN/Inf grad -> skip; scale=1 (consec=24 total=109) | |
| step 12321 NaN/Inf grad -> skip; scale=1 (consec=25 total=110) | |
| step 12322 NaN/Inf grad -> skip; scale=1 (consec=26 total=111) | |
| step 12323 NaN/Inf grad -> skip; scale=1 (consec=27 total=112) | |
| step 12324 NaN/Inf grad -> skip; scale=1 (consec=28 total=113) | |
| step 12326 NaN/Inf grad -> skip; scale=1 (consec=1 total=114) | |
| step 12327 NaN/Inf grad -> skip; scale=1 (consec=2 total=115) | |
| step 12328 NaN/Inf grad -> skip; scale=1 (consec=3 total=116) | |
| step 12329 NaN/Inf grad -> skip; scale=1 (consec=4 total=117) | |
| step 12330 NaN/Inf grad -> skip; scale=1 (consec=5 total=118) | |
| step 12331 NaN/Inf grad -> skip; scale=1 (consec=6 total=119) | |
| step 12332 NaN/Inf grad -> skip; scale=1 (consec=7 total=120) | |
| step 12333 NaN/Inf grad -> skip; scale=1 (consec=8 total=121) | |
| step 12334 NaN/Inf grad -> skip; scale=1 (consec=9 total=122) | |
| step 12335 NaN/Inf grad -> skip; scale=1 (consec=10 total=123) | |
| step 12336 NaN/Inf grad -> skip; scale=1 (consec=11 total=124) | |
| step 12337 NaN/Inf grad -> skip; scale=1 (consec=12 total=125) | |
| step 12338 NaN/Inf grad -> skip; scale=1 (consec=13 total=126) | |
| step 12339 NaN/Inf grad -> skip; scale=1 (consec=14 total=127) | |
| step 12340 NaN/Inf grad -> skip; scale=1 (consec=15 total=128) | |
| step 12341 NaN/Inf grad -> skip; scale=1 (consec=16 total=129) | |
| step 12342 NaN/Inf grad -> skip; scale=1 (consec=17 total=130) | |
| step 12343 NaN/Inf grad -> skip; scale=1 (consec=18 total=131) | |
| step 12344 NaN/Inf grad -> skip; scale=1 (consec=19 total=132) | |
| step 12345 NaN/Inf grad -> skip; scale=1 (consec=20 total=133) | |
| step 12346 NaN/Inf grad -> skip; scale=1 (consec=21 total=134) | |
| step 12347 NaN/Inf grad -> skip; scale=1 (consec=22 total=135) | |
| step 12348 NaN/Inf grad -> skip; scale=1 (consec=23 total=136) | |
| step 12349 NaN/Inf grad -> skip; scale=1 (consec=24 total=137) | |
| step 12350 NaN/Inf grad -> skip; scale=1 (consec=25 total=138) | |
| step 12360 loss=1.5233 lr=5.73e-04 scale=1 cv=0.594 tok=100130816 tok/s=16549 elapsed=0.53h | |
| step 12369 NaN/Inf grad -> skip; scale=1 (consec=1 total=139) | |
| step 12370 NaN/Inf grad -> skip; scale=1 (consec=2 total=140) | |
| step 12371 NaN/Inf grad -> skip; scale=1 (consec=3 total=141) | |
| step 12372 NaN/Inf grad -> skip; scale=1 (consec=4 total=142) | |
| step 12373 NaN/Inf grad -> skip; scale=1 (consec=5 total=143) | |
| step 12374 NaN/Inf grad -> skip; scale=1 (consec=6 total=144) | |
| step 12375 NaN/Inf grad -> skip; scale=1 (consec=7 total=145) | |
| step 12376 NaN/Inf grad -> skip; scale=1 (consec=8 total=146) | |
| step 12377 NaN/Inf grad -> skip; scale=1 (consec=9 total=147) | |
| step 12378 NaN/Inf grad -> skip; scale=1 (consec=10 total=148) | |
| step 12379 NaN/Inf grad -> skip; scale=1 (consec=11 total=149) | |
| step 12380 NaN/Inf grad -> skip; scale=1 (consec=12 total=150) | |
| step 12381 NaN/Inf grad -> skip; scale=1 (consec=13 total=151) | |
| step 12382 NaN/Inf grad -> skip; scale=1 (consec=14 total=152) | |
| step 12383 NaN/Inf grad -> skip; scale=1 (consec=15 total=153) | |
| step 12384 NaN/Inf grad -> skip; scale=1 (consec=16 total=154) | |
| step 12385 NaN/Inf grad -> skip; scale=1 (consec=17 total=155) | |
| step 12386 NaN/Inf grad -> skip; scale=1 (consec=18 total=156) | |
| step 12396 NaN/Inf grad -> skip; scale=1 (consec=1 total=157) | |
| step 12397 NaN/Inf grad -> skip; scale=1 (consec=2 total=158) | |
| step 12398 NaN/Inf grad -> skip; scale=1 (consec=3 total=159) | |
| step 12399 NaN/Inf grad -> skip; scale=1 (consec=4 total=160) | |
| step 12400 NaN/Inf grad -> skip; scale=1 (consec=5 total=161) | |
| step 12401 NaN/Inf grad -> skip; scale=1 (consec=6 total=162) | |
| step 12402 NaN/Inf grad -> skip; scale=1 (consec=7 total=163) | |
| step 12403 NaN/Inf grad -> skip; scale=1 (consec=8 total=164) | |
| step 12404 NaN/Inf grad -> skip; scale=1 (consec=9 total=165) | |
| step 12405 NaN/Inf grad -> skip; scale=1 (consec=10 total=166) | |
| step 12406 NaN/Inf grad -> skip; scale=1 (consec=11 total=167) | |
| step 12407 NaN/Inf grad -> skip; scale=1 (consec=12 total=168) | |
| step 12408 NaN/Inf grad -> skip; scale=1 (consec=13 total=169) | |
| step 12409 NaN/Inf grad -> skip; scale=1 (consec=14 total=170) | |
| step 12410 NaN/Inf grad -> skip; scale=1 (consec=15 total=171) | |
| step 12411 NaN/Inf grad -> skip; scale=1 (consec=16 total=172) | |
| step 12412 NaN/Inf grad -> skip; scale=1 (consec=17 total=173) | |
| step 12413 NaN/Inf grad -> skip; scale=1 (consec=18 total=174) | |
| step 12414 NaN/Inf grad -> skip; scale=1 (consec=19 total=175) | |
| step 12415 NaN/Inf grad -> skip; scale=1 (consec=20 total=176) | |
| step 12416 NaN/Inf grad -> skip; scale=1 (consec=21 total=177) | |
| step 12417 NaN/Inf grad -> skip; scale=1 (consec=22 total=178) | |
| step 12418 NaN/Inf grad -> skip; scale=1 (consec=23 total=179) | |
| step 12419 NaN/Inf grad -> skip; scale=1 (consec=24 total=180) | |
| step 12420 NaN/Inf grad -> skip; scale=1 (consec=25 total=181) | |
| step 12421 NaN/Inf grad -> skip; scale=1 (consec=26 total=182) | |
| step 12422 NaN/Inf grad -> skip; scale=1 (consec=27 total=183) | |
| step 12423 NaN/Inf grad -> skip; scale=1 (consec=28 total=184) | |
| step 12424 NaN/Inf grad -> skip; scale=1 (consec=29 total=185) | |
| step 12425 NaN/Inf grad -> skip; scale=1 (consec=30 total=186) | |
| step 12426 NaN/Inf grad -> skip; scale=1 (consec=31 total=187) | |
| step 12427 NaN/Inf grad -> skip; scale=1 (consec=32 total=188) | |
| step 12428 NaN/Inf grad -> skip; scale=1 (consec=33 total=189) | |
| step 12429 NaN/Inf grad -> skip; scale=1 (consec=34 total=190) | |
| step 12430 NaN/Inf grad -> skip; scale=1 (consec=35 total=191) | |
| step 12431 NaN/Inf grad -> skip; scale=1 (consec=36 total=192) | |
| step 12432 NaN/Inf grad -> skip; scale=1 (consec=37 total=193) | |
| step 12433 NaN/Inf grad -> skip; scale=1 (consec=38 total=194) | |
| step 12434 NaN/Inf grad -> skip; scale=1 (consec=39 total=195) | |
| step 12435 NaN/Inf grad -> skip; scale=1 (consec=40 total=196) | |
| step 12437 NaN/Inf grad -> skip; scale=1 (consec=1 total=197) | |
| step 12438 NaN/Inf grad -> skip; scale=1 (consec=2 total=198) | |
| step 12439 NaN/Inf grad -> skip; scale=1 (consec=3 total=199) | |
| step 12440 NaN/Inf grad -> skip; scale=1 (consec=4 total=200) | |
| step 12441 NaN/Inf grad -> skip; scale=1 (consec=5 total=201) | |
| step 12442 NaN/Inf grad -> skip; scale=1 (consec=6 total=202) | |
| step 12443 NaN/Inf grad -> skip; scale=1 (consec=7 total=203) | |
| step 12444 NaN/Inf grad -> skip; scale=1 (consec=8 total=204) | |
| step 12445 NaN/Inf grad -> skip; scale=1 (consec=9 total=205) | |
| step 12446 NaN/Inf grad -> skip; scale=1 (consec=10 total=206) | |
| step 12447 NaN/Inf grad -> skip; scale=1 (consec=11 total=207) | |
| step 12448 NaN/Inf grad -> skip; scale=1 (consec=12 total=208) | |
| step 12449 NaN/Inf grad -> skip; scale=1 (consec=13 total=209) | |
| step 12450 NaN/Inf grad -> skip; scale=1 (consec=14 total=210) | |
| step 12451 NaN/Inf grad -> skip; scale=1 (consec=15 total=211) | |
| step 12452 NaN/Inf grad -> skip; scale=1 (consec=16 total=212) | |
| step 12459 NaN/Inf grad -> skip; scale=1 (consec=1 total=213) | |
| step 12460 NaN/Inf grad -> skip; scale=1 (consec=2 total=214) | |
| step 12466 NaN/Inf grad -> skip; scale=1 (consec=1 total=215) | |
| step 12467 NaN/Inf grad -> skip; scale=1 (consec=2 total=216) | |
| step 12468 NaN/Inf grad -> skip; scale=1 (consec=3 total=217) | |
| step 12469 NaN/Inf grad -> skip; scale=1 (consec=4 total=218) | |
| step 12470 NaN/Inf grad -> skip; scale=1 (consec=5 total=219) | |
| step 12471 NaN/Inf grad -> skip; scale=1 (consec=6 total=220) | |
| step 12472 NaN/Inf grad -> skip; scale=1 (consec=7 total=221) | |
| step 12473 NaN/Inf grad -> skip; scale=1 (consec=8 total=222) | |
| step 12474 NaN/Inf grad -> skip; scale=1 (consec=9 total=223) | |
| step 12475 NaN/Inf grad -> skip; scale=1 (consec=10 total=224) | |
| step 12476 NaN/Inf grad -> skip; scale=1 (consec=11 total=225) | |
| step 12477 NaN/Inf grad -> skip; scale=1 (consec=12 total=226) | |
| step 12478 NaN/Inf grad -> skip; scale=1 (consec=13 total=227) | |
| step 12479 NaN/Inf grad -> skip; scale=1 (consec=14 total=228) | |
| step 12480 NaN/Inf grad -> skip; scale=1 (consec=15 total=229) | |
| step 12483 NaN/Inf grad -> skip; scale=1 (consec=1 total=230) | |
| step 12484 NaN/Inf grad -> skip; scale=1 (consec=2 total=231) | |
| step 12485 NaN/Inf grad -> skip; scale=1 (consec=3 total=232) | |
| step 12486 NaN/Inf grad -> skip; scale=1 (consec=4 total=233) | |
| step 12487 NaN/Inf grad -> skip; scale=1 (consec=5 total=234) | |
| step 12488 NaN/Inf grad -> skip; scale=1 (consec=6 total=235) | |
| step 12489 NaN/Inf grad -> skip; scale=1 (consec=7 total=236) | |
| step 12490 NaN/Inf grad -> skip; scale=1 (consec=8 total=237) | |
| step 12491 NaN/Inf grad -> skip; scale=1 (consec=9 total=238) | |
| step 12492 NaN/Inf grad -> skip; scale=1 (consec=10 total=239) | |
| step 12493 NaN/Inf grad -> skip; scale=1 (consec=11 total=240) | |
| step 12495 NaN/Inf grad -> skip; scale=1 (consec=1 total=241) | |
| step 12496 NaN/Inf grad -> skip; scale=1 (consec=2 total=242) | |
| step 12497 NaN/Inf grad -> skip; scale=1 (consec=3 total=243) | |
| step 12498 NaN/Inf grad -> skip; scale=1 (consec=4 total=244) | |
| step 12499 NaN/Inf grad -> skip; scale=1 (consec=5 total=245) | |
| step 12500 NaN/Inf grad -> skip; scale=1 (consec=6 total=246) | |
| step 12501 NaN/Inf grad -> skip; scale=1 (consec=7 total=247) | |
| step 12502 NaN/Inf grad -> skip; scale=1 (consec=8 total=248) | |
| step 12503 NaN/Inf grad -> skip; scale=1 (consec=9 total=249) | |
| step 12504 NaN/Inf grad -> skip; scale=1 (consec=10 total=250) | |
| step 12505 NaN/Inf grad -> skip; scale=1 (consec=11 total=251) | |
| step 12506 NaN/Inf grad -> skip; scale=1 (consec=12 total=252) | |
| step 12507 NaN/Inf grad -> skip; scale=1 (consec=13 total=253) | |
| step 12508 NaN/Inf grad -> skip; scale=1 (consec=14 total=254) | |
| step 12509 NaN/Inf grad -> skip; scale=1 (consec=15 total=255) | |
| step 12510 NaN/Inf grad -> skip; scale=1 (consec=16 total=256) | |
| step 12511 NaN/Inf grad -> skip; scale=1 (consec=17 total=257) | |
| step 12512 NaN/Inf grad -> skip; scale=1 (consec=18 total=258) | |
| step 12513 NaN/Inf grad -> skip; scale=1 (consec=19 total=259) | |
| step 12514 NaN/Inf grad -> skip; scale=1 (consec=20 total=260) | |
| step 12515 NaN/Inf grad -> skip; scale=1 (consec=21 total=261) | |
| step 12516 NaN/Inf grad -> skip; scale=1 (consec=22 total=262) | |
| step 12517 NaN/Inf grad -> skip; scale=1 (consec=23 total=263) | |
| step 12518 NaN/Inf grad -> skip; scale=1 (consec=24 total=264) | |
| step 12519 NaN/Inf grad -> skip; scale=1 (consec=25 total=265) | |
| step 12520 NaN/Inf grad -> skip; scale=1 (consec=26 total=266) | |
| step 12521 NaN/Inf grad -> skip; scale=1 (consec=27 total=267) | |
| step 12522 NaN/Inf grad -> skip; scale=1 (consec=28 total=268) | |
| step 12523 NaN/Inf grad -> skip; scale=1 (consec=29 total=269) | |
| step 12524 NaN/Inf grad -> skip; scale=1 (consec=30 total=270) | |
| step 12525 NaN/Inf grad -> skip; scale=1 (consec=31 total=271) | |
| step 12526 NaN/Inf grad -> skip; scale=1 (consec=32 total=272) | |
| step 12527 NaN/Inf grad -> skip; scale=1 (consec=33 total=273) | |
| step 12528 NaN/Inf grad -> skip; scale=1 (consec=34 total=274) | |
| step 12535 NaN/Inf grad -> skip; scale=1 (consec=1 total=275) | |
| step 12536 NaN/Inf grad -> skip; scale=1 (consec=2 total=276) | |
| step 12537 NaN/Inf grad -> skip; scale=1 (consec=3 total=277) | |
| step 12538 NaN/Inf grad -> skip; scale=1 (consec=4 total=278) | |
| step 12539 NaN/Inf grad -> skip; scale=1 (consec=5 total=279) | |
| step 12540 NaN/Inf grad -> skip; scale=1 (consec=6 total=280) | |
| step 12541 NaN/Inf grad -> skip; scale=1 (consec=7 total=281) | |
| step 12542 NaN/Inf grad -> skip; scale=1 (consec=8 total=282) | |
| step 12543 NaN/Inf grad -> skip; scale=1 (consec=9 total=283) | |
| step 12544 NaN/Inf grad -> skip; scale=1 (consec=10 total=284) | |
| step 12545 NaN/Inf grad -> skip; scale=1 (consec=11 total=285) | |
| step 12546 NaN/Inf grad -> skip; scale=1 (consec=12 total=286) | |
| step 12547 NaN/Inf grad -> skip; scale=1 (consec=13 total=287) | |
| step 12556 NaN/Inf grad -> skip; scale=1 (consec=1 total=288) | |
| step 12557 NaN/Inf grad -> skip; scale=1 (consec=2 total=289) | |
| step 12558 NaN/Inf grad -> skip; scale=1 (consec=3 total=290) | |
| step 12559 NaN/Inf grad -> skip; scale=1 (consec=4 total=291) | |
| step 12560 NaN/Inf grad -> skip; scale=1 (consec=5 total=292) | |
| step 12563 NaN/Inf grad -> skip; scale=1 (consec=1 total=293) | |
| step 12564 NaN/Inf grad -> skip; scale=1 (consec=2 total=294) | |
| step 12565 NaN/Inf grad -> skip; scale=1 (consec=3 total=295) | |
| step 12566 NaN/Inf grad -> skip; scale=1 (consec=4 total=296) | |
| step 12567 NaN/Inf grad -> skip; scale=1 (consec=5 total=297) | |
| step 12568 NaN/Inf grad -> skip; scale=1 (consec=6 total=298) | |
| step 12569 NaN/Inf grad -> skip; scale=1 (consec=7 total=299) | |
| step 12570 NaN/Inf grad -> skip; scale=1 (consec=8 total=300) | |
| step 12571 NaN/Inf grad -> skip; scale=1 (consec=9 total=301) | |
| step 12572 NaN/Inf grad -> skip; scale=1 (consec=10 total=302) | |
| step 12573 NaN/Inf grad -> skip; scale=1 (consec=11 total=303) | |
| step 12574 NaN/Inf grad -> skip; scale=1 (consec=12 total=304) | |
| step 12575 NaN/Inf grad -> skip; scale=1 (consec=13 total=305) | |
| step 12576 NaN/Inf grad -> skip; scale=1 (consec=14 total=306) | |
| step 12577 NaN/Inf grad -> skip; scale=1 (consec=15 total=307) | |
| step 12578 NaN/Inf grad -> skip; scale=1 (consec=16 total=308) | |
| step 12579 NaN/Inf grad -> skip; scale=1 (consec=17 total=309) | |
| step 12580 NaN/Inf grad -> skip; scale=1 (consec=18 total=310) | |
| step 12581 NaN/Inf grad -> skip; scale=1 (consec=19 total=311) | |
| step 12582 NaN/Inf grad -> skip; scale=1 (consec=20 total=312) | |
| step 12583 NaN/Inf grad -> skip; scale=1 (consec=21 total=313) | |
| step 12584 NaN/Inf grad -> skip; scale=1 (consec=22 total=314) | |
| step 12585 NaN/Inf grad -> skip; scale=1 (consec=23 total=315) | |
| step 12586 NaN/Inf grad -> skip; scale=1 (consec=24 total=316) | |
| step 12587 NaN/Inf grad -> skip; scale=1 (consec=25 total=317) | |
| step 12588 NaN/Inf grad -> skip; scale=1 (consec=26 total=318) | |
| step 12589 NaN/Inf grad -> skip; scale=1 (consec=27 total=319) | |
| step 12590 NaN/Inf grad -> skip; scale=1 (consec=28 total=320) | |
| step 12591 NaN/Inf grad -> skip; scale=1 (consec=29 total=321) | |
| step 12592 NaN/Inf grad -> skip; scale=1 (consec=30 total=322) | |
| step 12593 NaN/Inf grad -> skip; scale=1 (consec=31 total=323) | |
| step 12594 NaN/Inf grad -> skip; scale=1 (consec=32 total=324) | |
| step 12595 NaN/Inf grad -> skip; scale=1 (consec=33 total=325) | |
| step 12596 NaN/Inf grad -> skip; scale=1 (consec=34 total=326) | |
| step 12597 NaN/Inf grad -> skip; scale=1 (consec=35 total=327) | |
| step 12598 NaN/Inf grad -> skip; scale=1 (consec=36 total=328) | |
| step 12599 NaN/Inf grad -> skip; scale=1 (consec=37 total=329) | |
| step 12600 NaN/Inf grad -> skip; scale=1 (consec=38 total=330) | |
| step 12601 NaN/Inf grad -> skip; scale=1 (consec=39 total=331) | |
| step 12602 NaN/Inf grad -> skip; scale=1 (consec=40 total=332) | |
| step 12603 NaN/Inf grad -> skip; scale=1 (consec=41 total=333) | |
| step 12604 NaN/Inf grad -> skip; scale=1 (consec=42 total=334) | |
| step 12605 NaN/Inf grad -> skip; scale=1 (consec=43 total=335) | |
| step 12606 NaN/Inf grad -> skip; scale=1 (consec=44 total=336) | |
| step 12607 NaN/Inf grad -> skip; scale=1 (consec=45 total=337) | |
| step 12608 NaN/Inf grad -> skip; scale=1 (consec=46 total=338) | |
| step 12609 NaN/Inf grad -> skip; scale=1 (consec=47 total=339) | |
| step 12610 NaN/Inf grad -> skip; scale=1 (consec=48 total=340) | |
| step 12611 NaN/Inf grad -> skip; scale=1 (consec=49 total=341) | |
| step 12612 NaN/Inf grad -> skip; scale=1 (consec=50 total=342) | |
| step 12613 NaN/Inf grad -> skip; scale=1 (consec=51 total=343) | |
| step 12613 >50 CONSECUTIVE NaN -> ABORT | |
| DONE {"final_train_loss": 0.7157167792320251, "best_eval_loss": 1.7608854333559671, "steps": 12614, "tokens_seen": 100524032, "active_M": 92.829508, "total_M": 246.183748, "wall_hours": 0.5618232626385159, "planned_tokens": 700000000.0, "total_steps": 85449} | |
| step 12000 loss=0.5542 lr=2.88e-04 scale=16384 cv=0.643 tok=98312192 tok/s=5982 elapsed=0.01h | |
| step 12020 loss=0.6959 lr=2.88e-04 scale=16384 cv=0.561 tok=98476032 tok/s=16750 elapsed=0.03h | |
| step 12040 loss=1.0839 lr=2.88e-04 scale=16384 cv=0.544 tok=98639872 tok/s=16647 elapsed=0.04h | |
| step 12060 loss=1.3324 lr=2.88e-04 scale=16384 cv=0.581 tok=98803712 tok/s=16738 elapsed=0.04h | |
| step 12080 loss=1.6827 lr=2.88e-04 scale=16384 cv=0.523 tok=98967552 tok/s=16636 elapsed=0.04h | |
| step 12100 loss=1.0917 lr=2.88e-04 scale=16384 cv=0.588 tok=99131392 tok/s=16673 elapsed=0.04h | |
| step 12120 loss=1.5286 lr=2.88e-04 scale=16384 cv=0.525 tok=99295232 tok/s=16759 elapsed=0.05h | |
| step 12140 loss=1.1626 lr=2.88e-04 scale=16384 cv=0.517 tok=99459072 tok/s=16767 elapsed=0.05h | |
| step 12160 loss=1.4431 lr=2.88e-04 scale=16384 cv=0.516 tok=99622912 tok/s=16792 elapsed=0.05h | |
| step 12180 loss=0.8192 lr=2.88e-04 scale=16384 cv=0.488 tok=99786752 tok/s=16802 elapsed=0.06h | |
| step 12200 loss=1.8088 lr=2.88e-04 scale=32768 cv=0.548 tok=99950592 tok/s=16818 elapsed=0.06h | |
| step 12220 loss=1.4455 lr=2.88e-04 scale=32768 cv=0.472 tok=100114432 tok/s=16793 elapsed=0.06h | |
| step 12240 loss=0.7237 lr=2.87e-04 scale=32768 cv=0.495 tok=100278272 tok/s=16827 elapsed=0.06h | |
| step 12260 loss=0.1734 lr=2.87e-04 scale=32768 cv=0.528 tok=100442112 tok/s=16871 elapsed=0.07h | |
| step 12280 loss=0.2043 lr=2.87e-04 scale=32768 cv=0.525 tok=100605952 tok/s=16778 elapsed=0.07h | |
| step 12300 loss=0.9114 lr=2.87e-04 scale=32768 cv=0.515 tok=100769792 tok/s=16757 elapsed=0.07h | |
| step 12320 loss=1.7540 lr=2.87e-04 scale=32768 cv=0.501 tok=100933632 tok/s=16820 elapsed=0.07h | |
| step 12340 loss=1.3440 lr=2.87e-04 scale=32768 cv=0.459 tok=101097472 tok/s=16844 elapsed=0.08h | |
| step 12360 loss=0.8037 lr=2.87e-04 scale=32768 cv=0.567 tok=101261312 tok/s=16829 elapsed=0.08h | |
| step 12380 loss=1.2519 lr=2.87e-04 scale=32768 cv=0.473 tok=101425152 tok/s=16834 elapsed=0.08h | |
| step 12400 loss=0.4395 lr=2.87e-04 scale=65536 cv=0.515 tok=101588992 tok/s=16852 elapsed=0.09h | |
| step 12420 loss=1.4081 lr=2.87e-04 scale=65536 cv=0.516 tok=101752832 tok/s=16718 elapsed=0.09h | |
| step 12440 loss=1.6134 lr=2.87e-04 scale=65536 cv=0.509 tok=101916672 tok/s=16893 elapsed=0.09h | |
| step 12441 non-finite FORWARD loss -> skip batch (consec=1 total=1) | |
| step 12443 non-finite FORWARD loss -> skip batch (consec=1 total=2) | |
| step 12444 non-finite FORWARD loss -> skip batch (consec=2 total=3) | |
| step 12445 non-finite FORWARD loss -> skip batch (consec=3 total=4) | |
| step 12460 loss=1.0901 lr=2.87e-04 scale=65536 cv=0.470 tok=102047744 tok/s=16539 elapsed=0.09h | |
| step 12480 loss=1.0631 lr=2.87e-04 scale=65536 cv=0.488 tok=102211584 tok/s=16912 elapsed=0.10h | |
| step 12500 loss=0.9943 lr=2.87e-04 scale=65536 cv=0.522 tok=102375424 tok/s=17050 elapsed=0.10h | |
| step 12520 loss=1.4812 lr=2.87e-04 scale=65536 cv=0.485 tok=102539264 tok/s=17068 elapsed=0.10h | |
| step 12530 non-finite FORWARD loss -> skip batch (consec=1 total=5) | |
| step 12533 non-finite FORWARD loss -> skip batch (consec=1 total=6) | |
| step 12534 non-finite FORWARD loss -> skip batch (consec=2 total=7) | |
| step 12535 non-finite FORWARD loss -> skip batch (consec=3 total=8) | |
| step 12540 loss=1.7089 lr=2.87e-04 scale=65536 cv=0.478 tok=102670336 tok/s=17021 elapsed=0.11h | |
| step 12541 non-finite FORWARD loss -> skip batch (consec=1 total=9) | |
| step 12543 non-finite FORWARD loss -> skip batch (consec=1 total=10) | |
| step 12544 non-finite FORWARD loss -> skip batch (consec=2 total=11) | |
| step 12546 non-finite FORWARD loss -> skip batch (consec=1 total=12) | |
| step 12554 non-finite FORWARD loss -> skip batch (consec=1 total=13) | |
| step 12555 non-finite FORWARD loss -> skip batch (consec=2 total=14) | |
| step 12556 non-finite FORWARD loss -> skip batch (consec=3 total=15) | |
| step 12560 loss=1.6523 lr=2.87e-04 scale=65536 cv=0.516 tok=102776832 tok/s=16857 elapsed=0.11h | |
| step 12580 loss=1.5534 lr=2.87e-04 scale=65536 cv=0.546 tok=102940672 tok/s=17105 elapsed=0.11h | |
| step 12600 loss=1.6701 lr=2.87e-04 scale=65536 cv=0.484 tok=103104512 tok/s=17063 elapsed=0.11h | |
| step 12620 loss=1.6718 lr=2.87e-04 scale=65536 cv=0.445 tok=103268352 tok/s=17050 elapsed=0.12h | |
| step 12640 loss=0.8284 lr=2.87e-04 scale=65536 cv=0.471 tok=103432192 tok/s=17079 elapsed=0.12h | |
| step 12660 loss=1.4090 lr=2.87e-04 scale=65536 cv=0.464 tok=103596032 tok/s=17044 elapsed=0.12h | |
| step 12680 loss=1.6217 lr=2.87e-04 scale=65536 cv=0.526 tok=103759872 tok/s=17132 elapsed=0.12h | |
| step 12700 loss=1.5412 lr=2.86e-04 scale=65536 cv=0.484 tok=103923712 tok/s=17081 elapsed=0.13h | |
| step 12720 loss=0.4869 lr=2.86e-04 scale=65536 cv=0.534 tok=104087552 tok/s=16771 elapsed=0.13h | |
| step 12740 loss=1.4074 lr=2.86e-04 scale=65536 cv=0.518 tok=104251392 tok/s=17040 elapsed=0.13h | |
| step 12760 loss=0.1459 lr=2.86e-04 scale=65536 cv=0.599 tok=104415232 tok/s=17019 elapsed=0.14h | |
| step 12780 loss=1.2605 lr=2.86e-04 scale=65536 cv=0.513 tok=104579072 tok/s=16966 elapsed=0.14h | |
| step 12800 loss=0.9145 lr=2.86e-04 scale=65536 cv=0.514 tok=104742912 tok/s=17088 elapsed=0.14h | |
| step 12820 loss=0.8686 lr=2.86e-04 scale=65536 cv=0.469 tok=104906752 tok/s=17037 elapsed=0.14h | |
| step 12840 loss=1.3370 lr=2.86e-04 scale=65536 cv=0.455 tok=105070592 tok/s=17096 elapsed=0.15h | |
| step 12860 loss=0.6729 lr=2.86e-04 scale=65536 cv=0.501 tok=105234432 tok/s=16990 elapsed=0.15h | |
| step 12880 loss=1.2516 lr=2.86e-04 scale=65536 cv=0.420 tok=105398272 tok/s=17071 elapsed=0.15h | |
| step 12900 loss=1.5714 lr=2.86e-04 scale=65536 cv=0.435 tok=105562112 tok/s=17080 elapsed=0.15h | |
| step 12920 loss=1.6663 lr=2.86e-04 scale=65536 cv=0.486 tok=105725952 tok/s=16986 elapsed=0.16h | |
| step 12940 loss=1.6158 lr=2.86e-04 scale=65536 cv=0.434 tok=105889792 tok/s=17081 elapsed=0.16h | |
| step 12960 loss=0.9716 lr=2.86e-04 scale=65536 cv=0.448 tok=106053632 tok/s=17067 elapsed=0.16h | |
| step 12980 loss=0.1707 lr=2.86e-04 scale=65536 cv=0.510 tok=106217472 tok/s=16885 elapsed=0.17h | |
| step 13000 loss=0.5686 lr=2.86e-04 scale=65536 cv=0.394 tok=106381312 tok/s=17099 elapsed=0.17h | |
| step 13020 loss=1.4840 lr=2.86e-04 scale=65536 cv=0.396 tok=106545152 tok/s=16899 elapsed=0.19h | |
| step 13040 loss=1.0521 lr=2.86e-04 scale=65536 cv=0.367 tok=106708992 tok/s=16643 elapsed=0.19h | |
| step 13060 loss=1.5402 lr=2.86e-04 scale=65536 cv=0.477 tok=106872832 tok/s=16516 elapsed=0.19h | |
| step 13080 loss=1.8071 lr=2.86e-04 scale=65536 cv=0.412 tok=107036672 tok/s=16491 elapsed=0.19h | |
| step 13100 loss=1.5576 lr=2.86e-04 scale=65536 cv=0.412 tok=107200512 tok/s=16545 elapsed=0.20h | |
| step 13120 loss=0.7576 lr=2.86e-04 scale=65536 cv=0.364 tok=107364352 tok/s=16554 elapsed=0.20h | |
| step 13140 loss=1.1958 lr=2.86e-04 scale=65536 cv=0.424 tok=107528192 tok/s=16508 elapsed=0.20h | |
| step 13160 loss=1.3767 lr=2.85e-04 scale=65536 cv=0.438 tok=107692032 tok/s=16522 elapsed=0.21h | |
| step 13180 loss=1.0098 lr=2.85e-04 scale=65536 cv=0.419 tok=107855872 tok/s=16257 elapsed=0.21h | |
| step 13200 loss=0.6370 lr=2.85e-04 scale=65536 cv=0.478 tok=108019712 tok/s=16523 elapsed=0.21h | |
| step 13220 loss=0.8989 lr=2.85e-04 scale=65536 cv=0.454 tok=108183552 tok/s=16586 elapsed=0.21h | |
| step 13240 loss=0.9956 lr=2.85e-04 scale=65536 cv=0.492 tok=108347392 tok/s=16652 elapsed=0.22h | |
| step 13260 loss=1.4226 lr=2.85e-04 scale=65536 cv=0.418 tok=108511232 tok/s=16703 elapsed=0.22h | |
| step 13280 loss=1.3665 lr=2.85e-04 scale=65536 cv=0.473 tok=108675072 tok/s=16697 elapsed=0.22h | |
| step 13300 loss=1.8065 lr=2.85e-04 scale=65536 cv=0.445 tok=108838912 tok/s=16738 elapsed=0.23h | |
| step 13320 loss=1.6084 lr=2.85e-04 scale=65536 cv=0.435 tok=109002752 tok/s=16746 elapsed=0.23h | |
| step 13340 loss=1.5498 lr=2.85e-04 scale=65536 cv=0.463 tok=109166592 tok/s=16584 elapsed=0.23h | |
| step 13360 loss=1.1098 lr=2.85e-04 scale=65536 cv=0.451 tok=109330432 tok/s=16512 elapsed=0.23h | |
| step 13380 loss=1.4283 lr=2.85e-04 scale=65536 cv=0.467 tok=109494272 tok/s=16558 elapsed=0.24h | |
| step 13400 loss=1.1338 lr=2.85e-04 scale=65536 cv=0.488 tok=109658112 tok/s=16591 elapsed=0.24h | |
| step 13420 loss=1.2777 lr=2.85e-04 scale=65536 cv=0.416 tok=109821952 tok/s=16566 elapsed=0.24h | |
| step 13440 loss=0.8881 lr=2.85e-04 scale=65536 cv=0.481 tok=109985792 tok/s=16594 elapsed=0.25h | |
| step 13460 loss=1.1986 lr=2.85e-04 scale=65536 cv=0.416 tok=110149632 tok/s=16260 elapsed=0.25h | |
| step 13480 loss=0.3230 lr=2.85e-04 scale=65536 cv=0.439 tok=110313472 tok/s=16562 elapsed=0.25h | |
| step 13500 loss=1.4815 lr=2.85e-04 scale=65536 cv=0.484 tok=110477312 tok/s=16583 elapsed=0.25h | |
| step 13520 loss=1.8040 lr=2.85e-04 scale=65536 cv=0.472 tok=110641152 tok/s=16603 elapsed=0.26h | |
| step 13540 loss=1.6189 lr=2.85e-04 scale=65536 cv=0.478 tok=110804992 tok/s=16786 elapsed=0.26h | |
| step 13560 loss=1.2545 lr=2.85e-04 scale=65536 cv=0.458 tok=110968832 tok/s=16606 elapsed=0.26h | |
| step 13580 loss=1.5854 lr=2.85e-04 scale=65536 cv=0.467 tok=111132672 tok/s=16726 elapsed=0.27h | |
| step 13600 loss=1.7825 lr=2.84e-04 scale=65536 cv=0.447 tok=111296512 tok/s=16633 elapsed=0.27h | |
| step 13620 loss=1.5717 lr=2.84e-04 scale=65536 cv=0.422 tok=111460352 tok/s=16536 elapsed=0.27h | |
| step 13640 loss=1.3077 lr=2.84e-04 scale=65536 cv=0.436 tok=111624192 tok/s=16572 elapsed=0.27h | |
| step 13660 loss=1.1913 lr=2.84e-04 scale=65536 cv=0.448 tok=111788032 tok/s=16584 elapsed=0.28h | |
| step 13680 loss=0.6049 lr=2.84e-04 scale=65536 cv=0.454 tok=111951872 tok/s=16634 elapsed=0.28h | |
| step 13700 loss=0.7956 lr=2.84e-04 scale=65536 cv=0.458 tok=112115712 tok/s=16604 elapsed=0.28h | |
| step 13720 loss=1.2977 lr=2.84e-04 scale=65536 cv=0.459 tok=112279552 tok/s=16379 elapsed=0.29h | |
| step 13740 loss=1.4055 lr=2.84e-04 scale=65536 cv=0.495 tok=112443392 tok/s=16608 elapsed=0.29h | |
| step 13760 loss=1.6511 lr=2.84e-04 scale=65536 cv=0.508 tok=112607232 tok/s=16615 elapsed=0.29h | |
| step 13780 loss=1.5827 lr=2.84e-04 scale=65536 cv=0.485 tok=112771072 tok/s=16710 elapsed=0.29h | |
| step 13800 loss=1.0076 lr=2.84e-04 scale=65536 cv=0.433 tok=112934912 tok/s=16799 elapsed=0.30h | |
| step 13820 loss=0.9792 lr=2.84e-04 scale=65536 cv=0.430 tok=113098752 tok/s=16676 elapsed=0.30h | |
| step 13840 loss=0.2917 lr=2.84e-04 scale=65536 cv=0.445 tok=113262592 tok/s=16577 elapsed=0.30h | |
| step 13860 loss=1.4014 lr=2.84e-04 scale=65536 cv=0.458 tok=113426432 tok/s=16600 elapsed=0.31h | |
| step 13880 loss=1.5817 lr=2.84e-04 scale=65536 cv=0.465 tok=113590272 tok/s=16671 elapsed=0.31h | |
| step 13900 loss=1.3594 lr=2.84e-04 scale=65536 cv=0.489 tok=113754112 tok/s=16765 elapsed=0.31h | |
| step 13920 loss=1.0951 lr=2.84e-04 scale=65536 cv=0.494 tok=113917952 tok/s=17163 elapsed=0.31h | |
| step 13940 loss=0.6853 lr=2.84e-04 scale=65536 cv=0.468 tok=114081792 tok/s=17165 elapsed=0.32h | |
| step 13960 loss=1.0629 lr=2.84e-04 scale=65536 cv=0.557 tok=114245632 tok/s=17218 elapsed=0.32h | |
| step 13980 loss=1.5607 lr=2.84e-04 scale=65536 cv=0.491 tok=114409472 tok/s=17017 elapsed=0.32h | |
| step 14000 loss=0.9808 lr=2.84e-04 scale=65536 cv=0.469 tok=114573312 tok/s=17106 elapsed=0.33h | |
| step 14020 loss=1.3956 lr=2.83e-04 scale=65536 cv=0.462 tok=114737152 tok/s=16803 elapsed=0.34h | |
| step 14040 loss=1.6821 lr=2.83e-04 scale=65536 cv=0.469 tok=114900992 tok/s=17061 elapsed=0.35h | |
| step 14060 loss=1.1034 lr=2.83e-04 scale=65536 cv=0.452 tok=115064832 tok/s=17149 elapsed=0.35h | |
| step 14080 loss=1.6852 lr=2.83e-04 scale=65536 cv=0.450 tok=115228672 tok/s=17146 elapsed=0.35h | |
| step 14100 loss=1.0924 lr=2.83e-04 scale=65536 cv=0.438 tok=115392512 tok/s=17164 elapsed=0.35h | |
| step 14120 loss=1.5954 lr=2.83e-04 scale=65536 cv=0.449 tok=115556352 tok/s=17159 elapsed=0.36h | |
| step 14140 loss=1.1103 lr=2.83e-04 scale=65536 cv=0.390 tok=115720192 tok/s=16769 elapsed=0.36h | |
| step 14160 loss=1.7372 lr=2.83e-04 scale=65536 cv=0.457 tok=115884032 tok/s=16771 elapsed=0.36h | |
| step 14180 loss=1.6434 lr=2.83e-04 scale=65536 cv=0.439 tok=116047872 tok/s=16731 elapsed=0.36h | |
| step 14200 loss=1.5524 lr=2.83e-04 scale=65536 cv=0.456 tok=116211712 tok/s=16812 elapsed=0.37h | |
| step 14220 loss=1.2314 lr=2.83e-04 scale=65536 cv=0.433 tok=116375552 tok/s=16339 elapsed=0.37h | |
| step 14240 loss=1.7074 lr=2.83e-04 scale=65536 cv=0.404 tok=116539392 tok/s=16623 elapsed=0.37h | |
| step 14260 loss=0.9901 lr=2.83e-04 scale=65536 cv=0.413 tok=116703232 tok/s=16704 elapsed=0.38h | |
| step 14280 loss=1.7495 lr=2.83e-04 scale=65536 cv=0.432 tok=116867072 tok/s=16656 elapsed=0.38h | |
| step 14300 loss=1.7511 lr=2.83e-04 scale=65536 cv=0.479 tok=117030912 tok/s=16876 elapsed=0.38h | |
| step 14320 loss=1.0516 lr=2.83e-04 scale=65536 cv=0.498 tok=117194752 tok/s=17183 elapsed=0.38h | |
| step 14340 loss=1.6760 lr=2.83e-04 scale=65536 cv=0.477 tok=117358592 tok/s=17117 elapsed=0.39h | |
| step 14360 loss=1.4971 lr=2.83e-04 scale=65536 cv=0.536 tok=117522432 tok/s=17053 elapsed=0.39h | |
| step 14380 loss=1.3718 lr=2.83e-04 scale=65536 cv=0.491 tok=117686272 tok/s=17035 elapsed=0.39h | |
| step 14400 loss=1.4518 lr=2.83e-04 scale=65536 cv=0.469 tok=117850112 tok/s=16996 elapsed=0.40h | |
| step 14420 loss=1.7958 lr=2.83e-04 scale=65536 cv=0.472 tok=118013952 tok/s=16968 elapsed=0.40h | |
| step 14440 loss=1.3434 lr=2.82e-04 scale=65536 cv=0.468 tok=118177792 tok/s=16958 elapsed=0.40h | |
| step 14460 loss=0.4832 lr=2.82e-04 scale=65536 cv=0.448 tok=118341632 tok/s=16881 elapsed=0.40h | |
| step 14480 loss=1.2335 lr=2.82e-04 scale=65536 cv=0.445 tok=118505472 tok/s=16679 elapsed=0.41h | |
| step 14500 loss=0.9065 lr=2.82e-04 scale=65536 cv=0.443 tok=118669312 tok/s=17089 elapsed=0.41h | |
| step 14520 loss=0.2581 lr=2.82e-04 scale=65536 cv=0.476 tok=118833152 tok/s=17029 elapsed=0.41h | |
| step 14540 loss=1.1093 lr=2.82e-04 scale=65536 cv=0.444 tok=118996992 tok/s=16986 elapsed=0.42h | |
| step 14560 loss=0.8363 lr=2.82e-04 scale=65536 cv=0.445 tok=119160832 tok/s=17063 elapsed=0.42h | |
| step 14580 loss=1.7186 lr=2.82e-04 scale=65536 cv=0.495 tok=119324672 tok/s=16928 elapsed=0.42h | |
| step 14600 loss=1.6941 lr=2.82e-04 scale=65536 cv=0.496 tok=119488512 tok/s=17035 elapsed=0.42h | |
| step 14620 loss=1.1557 lr=2.82e-04 scale=65536 cv=0.439 tok=119652352 tok/s=16516 elapsed=0.43h | |
| step 14640 loss=0.9454 lr=2.82e-04 scale=65536 cv=0.455 tok=119816192 tok/s=17021 elapsed=0.43h | |
| step 14660 loss=1.0707 lr=2.82e-04 scale=65536 cv=0.477 tok=119980032 tok/s=16745 elapsed=0.43h | |
| step 14680 loss=1.6654 lr=2.82e-04 scale=65536 cv=0.523 tok=120143872 tok/s=16787 elapsed=0.44h | |
| step 14700 loss=1.5423 lr=2.82e-04 scale=65536 cv=0.498 tok=120307712 tok/s=16818 elapsed=0.44h | |
| step 14720 loss=0.2732 lr=2.82e-04 scale=65536 cv=0.541 tok=120471552 tok/s=16601 elapsed=0.44h | |
| step 14740 loss=1.4141 lr=2.82e-04 scale=65536 cv=0.477 tok=120635392 tok/s=16873 elapsed=0.44h | |
| step 14760 loss=1.6246 lr=2.82e-04 scale=65536 cv=0.473 tok=120799232 tok/s=16835 elapsed=0.45h | |
| step 14780 loss=1.2562 lr=2.82e-04 scale=65536 cv=0.423 tok=120963072 tok/s=16833 elapsed=0.45h | |
| step 14800 loss=1.5395 lr=2.82e-04 scale=65536 cv=0.425 tok=121126912 tok/s=16808 elapsed=0.45h | |
| step 14820 loss=1.5143 lr=2.82e-04 scale=65536 cv=0.479 tok=121290752 tok/s=16894 elapsed=0.46h | |
| step 14840 loss=1.5188 lr=2.81e-04 scale=65536 cv=0.426 tok=121454592 tok/s=16806 elapsed=0.46h | |
| step 14860 loss=0.2395 lr=2.81e-04 scale=65536 cv=0.417 tok=121618432 tok/s=16785 elapsed=0.46h | |
| step 14880 loss=1.6091 lr=2.81e-04 scale=65536 cv=0.478 tok=121782272 tok/s=16859 elapsed=0.46h | |
| step 14900 loss=1.4815 lr=2.81e-04 scale=65536 cv=0.424 tok=121946112 tok/s=16845 elapsed=0.47h | |
| step 14920 loss=0.9959 lr=2.81e-04 scale=65536 cv=0.444 tok=122109952 tok/s=16856 elapsed=0.47h | |
| step 14940 loss=1.3071 lr=2.81e-04 scale=65536 cv=0.431 tok=122273792 tok/s=16779 elapsed=0.47h | |
| step 14960 loss=1.2472 lr=2.81e-04 scale=65536 cv=0.387 tok=122437632 tok/s=16767 elapsed=0.48h | |
| step 14980 loss=1.1398 lr=2.81e-04 scale=65536 cv=0.399 tok=122601472 tok/s=16793 elapsed=0.48h | |
| step 15000 loss=1.6812 lr=2.81e-04 scale=65536 cv=0.437 tok=122765312 tok/s=16844 elapsed=0.48h | |
| step 15020 loss=1.1906 lr=2.81e-04 scale=65536 cv=0.389 tok=122929152 tok/s=16900 elapsed=0.50h | |
| step 15040 loss=0.5422 lr=2.81e-04 scale=65536 cv=0.415 tok=123092992 tok/s=16893 elapsed=0.50h | |
| step 15060 loss=1.3766 lr=2.81e-04 scale=65536 cv=0.430 tok=123256832 tok/s=16958 elapsed=0.50h | |
| step 15080 loss=1.4465 lr=2.81e-04 scale=65536 cv=0.439 tok=123420672 tok/s=16930 elapsed=0.51h | |
| step 15100 loss=0.5656 lr=2.81e-04 scale=65536 cv=0.537 tok=123584512 tok/s=16944 elapsed=0.51h | |
| step 15120 loss=1.3347 lr=2.81e-04 scale=65536 cv=0.424 tok=123748352 tok/s=16882 elapsed=0.51h | |
| step 15140 loss=1.6433 lr=2.81e-04 scale=65536 cv=0.404 tok=123912192 tok/s=16865 elapsed=0.51h | |
| step 15160 loss=0.7948 lr=2.81e-04 scale=65536 cv=0.421 tok=124076032 tok/s=16836 elapsed=0.52h | |
| step 15180 loss=0.6865 lr=2.81e-04 scale=65536 cv=0.425 tok=124239872 tok/s=16881 elapsed=0.52h | |
| step 15200 loss=0.9281 lr=2.81e-04 scale=65536 cv=0.501 tok=124403712 tok/s=16599 elapsed=0.52h | |
| step 15220 loss=1.0537 lr=2.80e-04 scale=65536 cv=0.392 tok=124567552 tok/s=16932 elapsed=0.53h | |
| step 15240 loss=0.9134 lr=2.80e-04 scale=65536 cv=0.396 tok=124731392 tok/s=16851 elapsed=0.53h | |
| step 15260 loss=1.1854 lr=2.80e-04 scale=65536 cv=0.443 tok=124895232 tok/s=16930 elapsed=0.53h | |
| step 15280 loss=1.5384 lr=2.80e-04 scale=65536 cv=0.435 tok=125059072 tok/s=16915 elapsed=0.53h | |
| step 15300 loss=1.4506 lr=2.80e-04 scale=65536 cv=0.434 tok=125222912 tok/s=16929 elapsed=0.54h | |
| step 15320 loss=1.5115 lr=2.80e-04 scale=65536 cv=0.479 tok=125386752 tok/s=16984 elapsed=0.54h | |
| step 15340 loss=1.5311 lr=2.80e-04 scale=65536 cv=0.525 tok=125550592 tok/s=17028 elapsed=0.54h | |
| step 15360 loss=1.4389 lr=2.80e-04 scale=65536 cv=0.422 tok=125714432 tok/s=16895 elapsed=0.55h | |
| step 15380 loss=0.8970 lr=2.80e-04 scale=65536 cv=0.439 tok=125878272 tok/s=16879 elapsed=0.55h | |
| step 15400 loss=1.0699 lr=2.80e-04 scale=65536 cv=0.421 tok=126042112 tok/s=16836 elapsed=0.55h | |
| step 15420 loss=1.8474 lr=2.80e-04 scale=65536 cv=0.488 tok=126205952 tok/s=16802 elapsed=0.55h | |
| step 15440 loss=1.6047 lr=2.80e-04 scale=65536 cv=0.468 tok=126369792 tok/s=16820 elapsed=0.56h | |
| step 15460 loss=1.6259 lr=2.80e-04 scale=65536 cv=0.450 tok=126533632 tok/s=16845 elapsed=0.56h | |
| step 15480 loss=1.6994 lr=2.80e-04 scale=65536 cv=0.447 tok=126697472 tok/s=16776 elapsed=0.56h | |
| step 15500 loss=1.4763 lr=2.80e-04 scale=65536 cv=0.428 tok=126861312 tok/s=16753 elapsed=0.57h | |
| step 15520 loss=0.6134 lr=2.80e-04 scale=65536 cv=0.470 tok=127025152 tok/s=16766 elapsed=0.57h | |
| step 15540 loss=1.2947 lr=2.80e-04 scale=65536 cv=0.425 tok=127188992 tok/s=16826 elapsed=0.57h | |
| step 15560 loss=0.9595 lr=2.80e-04 scale=65536 cv=0.465 tok=127352832 tok/s=16687 elapsed=0.57h | |
| step 15580 loss=1.4882 lr=2.80e-04 scale=65536 cv=0.392 tok=127516672 tok/s=16798 elapsed=0.58h | |
| step 15600 loss=1.2102 lr=2.79e-04 scale=65536 cv=0.376 tok=127680512 tok/s=16772 elapsed=0.58h | |
| step 15620 loss=0.6568 lr=2.79e-04 scale=65536 cv=0.384 tok=127844352 tok/s=16793 elapsed=0.58h | |
| step 15640 loss=1.1243 lr=2.79e-04 scale=65536 cv=0.357 tok=128008192 tok/s=16762 elapsed=0.59h | |
| step 15660 loss=1.5956 lr=2.79e-04 scale=65536 cv=0.410 tok=128172032 tok/s=16816 elapsed=0.59h | |
| step 15680 loss=1.5690 lr=2.79e-04 scale=65536 cv=0.408 tok=128335872 tok/s=16769 elapsed=0.59h | |
| step 15700 loss=1.5628 lr=2.79e-04 scale=65536 cv=0.386 tok=128499712 tok/s=16602 elapsed=0.59h | |
| step 15720 loss=1.7915 lr=2.79e-04 scale=65536 cv=0.415 tok=128663552 tok/s=16882 elapsed=0.60h | |
| step 15740 loss=1.1809 lr=2.79e-04 scale=65536 cv=0.451 tok=128827392 tok/s=16880 elapsed=0.60h | |
| step 15760 loss=0.7751 lr=2.79e-04 scale=65536 cv=0.428 tok=128991232 tok/s=16987 elapsed=0.60h | |
| step 15780 loss=1.1881 lr=2.79e-04 scale=65536 cv=0.356 tok=129155072 tok/s=17098 elapsed=0.61h | |
| step 15800 loss=0.7951 lr=2.79e-04 scale=65536 cv=0.438 tok=129318912 tok/s=16890 elapsed=0.61h | |
| step 15820 loss=1.0215 lr=2.79e-04 scale=65536 cv=0.395 tok=129482752 tok/s=16900 elapsed=0.61h | |
| step 15840 loss=1.2504 lr=2.79e-04 scale=65536 cv=0.368 tok=129646592 tok/s=16855 elapsed=0.61h | |
| step 15860 loss=1.5157 lr=2.79e-04 scale=65536 cv=0.457 tok=129810432 tok/s=16861 elapsed=0.62h | |
| step 15861 non-finite FORWARD loss -> skip batch (consec=1 total=16) | |
| step 15862 non-finite FORWARD loss -> skip batch (consec=2 total=17) | |
| step 15863 non-finite FORWARD loss -> skip batch (consec=3 total=18) | |
| step 15864 non-finite FORWARD loss -> skip batch (consec=4 total=19) | |
| step 15865 non-finite FORWARD loss -> skip batch (consec=5 total=20) | |
| step 15866 non-finite FORWARD loss -> skip batch (consec=6 total=21) | |
| step 15867 non-finite FORWARD loss -> skip batch (consec=7 total=22) | |
| step 15868 non-finite FORWARD loss -> skip batch (consec=8 total=23) | |
| step 15869 non-finite FORWARD loss -> skip batch (consec=9 total=24) | |
| step 15871 non-finite FORWARD loss -> skip batch (consec=1 total=25) | |
| step 15872 non-finite FORWARD loss -> skip batch (consec=2 total=26) | |
| step 15873 non-finite FORWARD loss -> skip batch (consec=3 total=27) | |
| step 15874 non-finite FORWARD loss -> skip batch (consec=4 total=28) | |
| step 15875 non-finite FORWARD loss -> skip batch (consec=5 total=29) | |
| step 15876 non-finite FORWARD loss -> skip batch (consec=6 total=30) | |
| step 15877 non-finite FORWARD loss -> skip batch (consec=7 total=31) | |
| step 15878 non-finite FORWARD loss -> skip batch (consec=8 total=32) | |
| step 15879 non-finite FORWARD loss -> skip batch (consec=9 total=33) | |
| step 15880 non-finite FORWARD loss -> skip batch (consec=10 total=34) | |
| step 15881 non-finite FORWARD loss -> skip batch (consec=11 total=35) | |
| step 15882 non-finite FORWARD loss -> skip batch (consec=12 total=36) | |
| step 15883 non-finite FORWARD loss -> skip batch (consec=13 total=37) | |
| step 15884 non-finite FORWARD loss -> skip batch (consec=14 total=38) | |
| step 15885 non-finite FORWARD loss -> skip batch (consec=15 total=39) | |
| step 15886 non-finite FORWARD loss -> skip batch (consec=16 total=40) | |
| step 15887 non-finite FORWARD loss -> skip batch (consec=17 total=41) | |
| step 15888 non-finite FORWARD loss -> skip batch (consec=18 total=42) | |
| step 15889 non-finite FORWARD loss -> skip batch (consec=19 total=43) | |
| step 15890 non-finite FORWARD loss -> skip batch (consec=20 total=44) | |
| step 15891 non-finite FORWARD loss -> skip batch (consec=21 total=45) | |
| step 15892 non-finite FORWARD loss -> skip batch (consec=22 total=46) | |
| step 15893 non-finite FORWARD loss -> skip batch (consec=23 total=47) | |
| step 15894 non-finite FORWARD loss -> skip batch (consec=24 total=48) | |
| step 15895 non-finite FORWARD loss -> skip batch (consec=25 total=49) | |
| step 15896 non-finite FORWARD loss -> skip batch (consec=26 total=50) | |
| step 15897 non-finite FORWARD loss -> skip batch (consec=27 total=51) | |
| step 15898 non-finite FORWARD loss -> skip batch (consec=28 total=52) | |
| step 15900 non-finite FORWARD loss -> skip batch (consec=1 total=53) | |
| step 15901 non-finite FORWARD loss -> skip batch (consec=2 total=54) | |
| step 15902 non-finite FORWARD loss -> skip batch (consec=3 total=55) | |
| step 15903 non-finite FORWARD loss -> skip batch (consec=4 total=56) | |
| step 15904 non-finite FORWARD loss -> skip batch (consec=5 total=57) | |
| step 15905 non-finite FORWARD loss -> skip batch (consec=6 total=58) | |
| step 15906 non-finite FORWARD loss -> skip batch (consec=7 total=59) | |
| step 15907 non-finite FORWARD loss -> skip batch (consec=8 total=60) | |
| step 15908 non-finite FORWARD loss -> skip batch (consec=9 total=61) | |
| step 15909 non-finite FORWARD loss -> skip batch (consec=10 total=62) | |
| step 15910 non-finite FORWARD loss -> skip batch (consec=11 total=63) | |
| step 15911 non-finite FORWARD loss -> skip batch (consec=12 total=64) | |
| step 15912 non-finite FORWARD loss -> skip batch (consec=13 total=65) | |
| step 15914 non-finite FORWARD loss -> skip batch (consec=1 total=66) | |
| step 15915 non-finite FORWARD loss -> skip batch (consec=2 total=67) | |
| step 15916 non-finite FORWARD loss -> skip batch (consec=3 total=68) | |
| step 15917 non-finite FORWARD loss -> skip batch (consec=4 total=69) | |
| step 15918 non-finite FORWARD loss -> skip batch (consec=5 total=70) | |
| step 15919 non-finite FORWARD loss -> skip batch (consec=6 total=71) | |
| step 15920 non-finite FORWARD loss -> skip batch (consec=7 total=72) | |
| step 15921 non-finite FORWARD loss -> skip batch (consec=8 total=73) | |
| step 15922 non-finite FORWARD loss -> skip batch (consec=9 total=74) | |
| step 15923 non-finite FORWARD loss -> skip batch (consec=10 total=75) | |
| step 15924 non-finite FORWARD loss -> skip batch (consec=11 total=76) | |
| step 15925 non-finite FORWARD loss -> skip batch (consec=12 total=77) | |
| step 15926 non-finite FORWARD loss -> skip batch (consec=13 total=78) | |
| step 15927 non-finite FORWARD loss -> skip batch (consec=14 total=79) | |
| step 15928 non-finite FORWARD loss -> skip batch (consec=15 total=80) | |
| step 15929 non-finite FORWARD loss -> skip batch (consec=16 total=81) | |
| step 15930 non-finite FORWARD loss -> skip batch (consec=17 total=82) | |
| step 15931 non-finite FORWARD loss -> skip batch (consec=18 total=83) | |
| step 15932 non-finite FORWARD loss -> skip batch (consec=19 total=84) | |
| step 15933 non-finite FORWARD loss -> skip batch (consec=20 total=85) | |
| step 15934 non-finite FORWARD loss -> skip batch (consec=21 total=86) | |
| step 15935 non-finite FORWARD loss -> skip batch (consec=22 total=87) | |
| step 15936 non-finite FORWARD loss -> skip batch (consec=23 total=88) | |
| step 15937 non-finite FORWARD loss -> skip batch (consec=24 total=89) | |
| step 15938 non-finite FORWARD loss -> skip batch (consec=25 total=90) | |
| step 15939 non-finite FORWARD loss -> skip batch (consec=26 total=91) | |
| step 15940 non-finite FORWARD loss -> skip batch (consec=27 total=92) | |
| step 15941 non-finite FORWARD loss -> skip batch (consec=28 total=93) | |
| step 15942 non-finite FORWARD loss -> skip batch (consec=29 total=94) | |
| step 15943 non-finite FORWARD loss -> skip batch (consec=30 total=95) | |
| step 15944 non-finite FORWARD loss -> skip batch (consec=31 total=96) | |
| step 15945 non-finite FORWARD loss -> skip batch (consec=32 total=97) | |
| step 15946 non-finite FORWARD loss -> skip batch (consec=33 total=98) | |
| step 15947 non-finite FORWARD loss -> skip batch (consec=34 total=99) | |
| step 15948 non-finite FORWARD loss -> skip batch (consec=35 total=100) | |
| step 15949 non-finite FORWARD loss -> skip batch (consec=36 total=101) | |
| step 15950 non-finite FORWARD loss -> skip batch (consec=37 total=102) | |
| step 15952 non-finite FORWARD loss -> skip batch (consec=1 total=103) | |
| step 15953 non-finite FORWARD loss -> skip batch (consec=2 total=104) | |
| step 15954 non-finite FORWARD loss -> skip batch (consec=3 total=105) | |
| step 15955 non-finite FORWARD loss -> skip batch (consec=4 total=106) | |
| step 15956 non-finite FORWARD loss -> skip batch (consec=5 total=107) | |
| step 15957 non-finite FORWARD loss -> skip batch (consec=6 total=108) | |
| step 15958 non-finite FORWARD loss -> skip batch (consec=7 total=109) | |
| step 15959 non-finite FORWARD loss -> skip batch (consec=8 total=110) | |
| step 15960 non-finite FORWARD loss -> skip batch (consec=9 total=111) | |
| step 15961 non-finite FORWARD loss -> skip batch (consec=10 total=112) | |
| step 15962 non-finite FORWARD loss -> skip batch (consec=11 total=113) | |
| step 15963 non-finite FORWARD loss -> skip batch (consec=12 total=114) | |
| step 15964 non-finite FORWARD loss -> skip batch (consec=13 total=115) | |
| step 15965 non-finite FORWARD loss -> skip batch (consec=14 total=116) | |
| step 15966 non-finite FORWARD loss -> skip batch (consec=15 total=117) | |
| step 15967 non-finite FORWARD loss -> skip batch (consec=16 total=118) | |
| step 15968 non-finite FORWARD loss -> skip batch (consec=17 total=119) | |
| step 15969 non-finite FORWARD loss -> skip batch (consec=18 total=120) | |
| step 15970 non-finite FORWARD loss -> skip batch (consec=19 total=121) | |
| step 15971 non-finite FORWARD loss -> skip batch (consec=20 total=122) | |
| step 15972 non-finite FORWARD loss -> skip batch (consec=21 total=123) | |
| step 15973 non-finite FORWARD loss -> skip batch (consec=22 total=124) | |
| step 15974 non-finite FORWARD loss -> skip batch (consec=23 total=125) | |
| step 15975 non-finite FORWARD loss -> skip batch (consec=24 total=126) | |
| step 15976 non-finite FORWARD loss -> skip batch (consec=25 total=127) | |
| step 15977 non-finite FORWARD loss -> skip batch (consec=26 total=128) | |
| step 15978 non-finite FORWARD loss -> skip batch (consec=27 total=129) | |
| step 15979 non-finite FORWARD loss -> skip batch (consec=28 total=130) | |
| step 15980 non-finite FORWARD loss -> skip batch (consec=29 total=131) | |
| step 15981 non-finite FORWARD loss -> skip batch (consec=30 total=132) | |
| step 15982 non-finite FORWARD loss -> skip batch (consec=31 total=133) | |
| step 15983 non-finite FORWARD loss -> skip batch (consec=32 total=134) | |
| step 15984 non-finite FORWARD loss -> skip batch (consec=33 total=135) | |
| step 15985 non-finite FORWARD loss -> skip batch (consec=34 total=136) | |
| step 15986 non-finite FORWARD loss -> skip batch (consec=35 total=137) | |
| step 15987 non-finite FORWARD loss -> skip batch (consec=36 total=138) | |
| step 15988 non-finite FORWARD loss -> skip batch (consec=37 total=139) | |
| step 15989 non-finite FORWARD loss -> skip batch (consec=38 total=140) | |
| step 15990 non-finite FORWARD loss -> skip batch (consec=39 total=141) | |
| step 15991 non-finite FORWARD loss -> skip batch (consec=40 total=142) | |
| step 15992 non-finite FORWARD loss -> skip batch (consec=41 total=143) | |
| step 15993 non-finite FORWARD loss -> skip batch (consec=42 total=144) | |
| step 15994 non-finite FORWARD loss -> skip batch (consec=43 total=145) | |
| step 15995 non-finite FORWARD loss -> skip batch (consec=44 total=146) | |
| step 15996 non-finite FORWARD loss -> skip batch (consec=45 total=147) | |
| step 15997 non-finite FORWARD loss -> skip batch (consec=46 total=148) | |
| step 15998 non-finite FORWARD loss -> skip batch (consec=47 total=149) | |
| step 15999 non-finite FORWARD loss -> skip batch (consec=48 total=150) | |
| step 16000 non-finite FORWARD loss -> skip batch (consec=49 total=151) | |
| step 16001 non-finite FORWARD loss -> skip batch (consec=50 total=152) | |
| step 16002 non-finite FORWARD loss -> skip batch (consec=51 total=153) | |
| step 16002 >50 CONSECUTIVE bad -> ABORT | |
| DONE {"final_train_loss": 1.0577669143676758, "best_eval_loss": 1.707251207033793, "steps": 16003, "tokens_seen": 129843200, "active_M": 92.829508, "total_M": 246.183748, "wall_hours": 0.6253413446082009, "planned_tokens": 700000000.0, "total_steps": 85449} | |
| step 15000 loss=0.4845 lr=1.88e-04 scale=16384 cv=0.468 tok=122888192 tok/s=6407 elapsed=0.01h | |
| step 15020 loss=0.6334 lr=1.88e-04 scale=16384 cv=0.393 tok=123052032 tok/s=16701 elapsed=0.03h | |
| step 15040 loss=1.0038 lr=1.88e-04 scale=16384 cv=0.422 tok=123215872 tok/s=16698 elapsed=0.03h | |
| step 15060 loss=1.2637 lr=1.88e-04 scale=16384 cv=0.438 tok=123379712 tok/s=16685 elapsed=0.03h | |
| step 15080 loss=1.6484 lr=1.88e-04 scale=16384 cv=0.430 tok=123543552 tok/s=16723 elapsed=0.04h | |
| step 15100 loss=1.0305 lr=1.88e-04 scale=16384 cv=0.505 tok=123707392 tok/s=16805 elapsed=0.04h | |
| step 15120 loss=1.4862 lr=1.88e-04 scale=16384 cv=0.436 tok=123871232 tok/s=16728 elapsed=0.04h | |
| step 15140 loss=1.0962 lr=1.88e-04 scale=16384 cv=0.428 tok=124035072 tok/s=16707 elapsed=0.05h | |
| step 15160 loss=1.3734 lr=1.88e-04 scale=16384 cv=0.446 tok=124198912 tok/s=16821 elapsed=0.05h | |
| step 15180 loss=0.7424 lr=1.88e-04 scale=16384 cv=0.430 tok=124362752 tok/s=16757 elapsed=0.05h | |
| step 15200 loss=1.7877 lr=1.88e-04 scale=32768 cv=0.466 tok=124526592 tok/s=16838 elapsed=0.05h | |
| step 15220 loss=1.3543 lr=1.88e-04 scale=32768 cv=0.441 tok=124690432 tok/s=16823 elapsed=0.06h | |
| step 15240 loss=0.6638 lr=1.88e-04 scale=32768 cv=0.428 tok=124854272 tok/s=16400 elapsed=0.06h | |
| step 15260 loss=0.1256 lr=1.88e-04 scale=32768 cv=0.479 tok=125018112 tok/s=16308 elapsed=0.06h | |
| step 15280 loss=0.1287 lr=1.88e-04 scale=32768 cv=0.471 tok=125181952 tok/s=16250 elapsed=0.07h | |
| step 15300 loss=0.8572 lr=1.88e-04 scale=32768 cv=0.434 tok=125345792 tok/s=16321 elapsed=0.07h | |
| step 15320 loss=1.7203 lr=1.88e-04 scale=32768 cv=0.447 tok=125509632 tok/s=16637 elapsed=0.07h | |
| step 15340 loss=1.2526 lr=1.88e-04 scale=32768 cv=0.415 tok=125673472 tok/s=16824 elapsed=0.07h | |
| step 15360 loss=0.7632 lr=1.87e-04 scale=32768 cv=0.492 tok=125837312 tok/s=16834 elapsed=0.08h | |
| step 15380 loss=1.1817 lr=1.87e-04 scale=32768 cv=0.393 tok=126001152 tok/s=16833 elapsed=0.08h | |
| step 15400 loss=0.4012 lr=1.87e-04 scale=65536 cv=0.464 tok=126164992 tok/s=16794 elapsed=0.08h | |
| step 15420 loss=1.3488 lr=1.87e-04 scale=65536 cv=0.469 tok=126328832 tok/s=16840 elapsed=0.09h | |
| step 15440 loss=1.5866 lr=1.87e-04 scale=65536 cv=0.471 tok=126492672 tok/s=16736 elapsed=0.09h | |
| step 15460 loss=1.0445 lr=1.87e-04 scale=65536 cv=0.406 tok=126656512 tok/s=16538 elapsed=0.09h | |
| step 15480 loss=1.0077 lr=1.87e-04 scale=65536 cv=0.401 tok=126820352 tok/s=16769 elapsed=0.09h | |
| step 15500 loss=0.9300 lr=1.87e-04 scale=65536 cv=0.448 tok=126984192 tok/s=16717 elapsed=0.10h | |
| step 15520 loss=1.4323 lr=1.87e-04 scale=65536 cv=0.443 tok=127148032 tok/s=16785 elapsed=0.10h | |
| step 15540 loss=1.6682 lr=1.87e-04 scale=65536 cv=0.446 tok=127311872 tok/s=16845 elapsed=0.10h | |
| step 15560 loss=1.5895 lr=1.87e-04 scale=65536 cv=0.485 tok=127475712 tok/s=16783 elapsed=0.11h | |
| step 15580 loss=1.4863 lr=1.87e-04 scale=65536 cv=0.500 tok=127639552 tok/s=16782 elapsed=0.11h | |
| step 15600 loss=1.6420 lr=1.87e-04 scale=65536 cv=0.435 tok=127803392 tok/s=16871 elapsed=0.11h | |
| step 15620 loss=1.6208 lr=1.87e-04 scale=65536 cv=0.439 tok=127967232 tok/s=16805 elapsed=0.11h | |
| step 15640 loss=0.7571 lr=1.87e-04 scale=65536 cv=0.420 tok=128131072 tok/s=16809 elapsed=0.12h | |
| step 15660 loss=1.3322 lr=1.87e-04 scale=65536 cv=0.386 tok=128294912 tok/s=16788 elapsed=0.12h | |
| step 15680 loss=1.5637 lr=1.87e-04 scale=65536 cv=0.414 tok=128458752 tok/s=16854 elapsed=0.12h | |
| step 15700 loss=1.3889 lr=1.87e-04 scale=65536 cv=0.424 tok=128622592 tok/s=16739 elapsed=0.12h | |
| step 15720 loss=0.3753 lr=1.87e-04 scale=65536 cv=0.430 tok=128786432 tok/s=16619 elapsed=0.13h | |
| step 15740 loss=1.3412 lr=1.87e-04 scale=65536 cv=0.411 tok=128950272 tok/s=16458 elapsed=0.13h | |
| step 15760 loss=0.1244 lr=1.87e-04 scale=65536 cv=0.512 tok=129114112 tok/s=16475 elapsed=0.13h | |
| step 15780 loss=1.2094 lr=1.87e-04 scale=65536 cv=0.424 tok=129277952 tok/s=16559 elapsed=0.14h | |
| step 15800 loss=0.8526 lr=1.87e-04 scale=65536 cv=0.422 tok=129441792 tok/s=16878 elapsed=0.14h | |
| step 15820 loss=0.8095 lr=1.87e-04 scale=65536 cv=0.393 tok=129605632 tok/s=16874 elapsed=0.14h | |
| step 15840 loss=1.2883 lr=1.87e-04 scale=65536 cv=0.387 tok=129769472 tok/s=16866 elapsed=0.14h | |
| step 15860 loss=0.6043 lr=1.87e-04 scale=65536 cv=0.423 tok=129933312 tok/s=16925 elapsed=0.15h | |
| step 15880 loss=1.0995 lr=1.87e-04 scale=65536 cv=0.402 tok=130097152 tok/s=16923 elapsed=0.15h | |
| step 15900 loss=1.4726 lr=1.87e-04 scale=65536 cv=0.419 tok=130260992 tok/s=16984 elapsed=0.15h | |
| step 15920 loss=1.6395 lr=1.87e-04 scale=65536 cv=0.442 tok=130424832 tok/s=17092 elapsed=0.16h | |
| step 15940 loss=1.5244 lr=1.87e-04 scale=65536 cv=0.420 tok=130588672 tok/s=17021 elapsed=0.16h | |
| step 15960 loss=0.8915 lr=1.86e-04 scale=65536 cv=0.407 tok=130752512 tok/s=17075 elapsed=0.16h | |
| step 15980 loss=0.1157 lr=1.86e-04 scale=65536 cv=0.506 tok=130916352 tok/s=17021 elapsed=0.16h | |
| step 16000 loss=0.5067 lr=1.86e-04 scale=65536 cv=0.401 tok=131080192 tok/s=17018 elapsed=0.17h | |
| step 16020 loss=1.4179 lr=1.86e-04 scale=65536 cv=0.391 tok=131244032 tok/s=17089 elapsed=0.19h | |
| step 16040 loss=0.9749 lr=1.86e-04 scale=65536 cv=0.380 tok=131407872 tok/s=17183 elapsed=0.19h | |
| step 16060 loss=1.5020 lr=1.86e-04 scale=65536 cv=0.416 tok=131571712 tok/s=17152 elapsed=0.19h | |
| step 16080 loss=1.7635 lr=1.86e-04 scale=65536 cv=0.408 tok=131735552 tok/s=17228 elapsed=0.19h | |
| step 16100 loss=1.5013 lr=1.86e-04 scale=65536 cv=0.418 tok=131899392 tok/s=17206 elapsed=0.20h | |
| step 16120 loss=0.5353 lr=1.86e-04 scale=65536 cv=0.365 tok=132063232 tok/s=17144 elapsed=0.20h | |
| step 16140 loss=1.1476 lr=1.86e-04 scale=65536 cv=0.383 tok=132227072 tok/s=17254 elapsed=0.20h | |
| step 16160 loss=1.3268 lr=1.86e-04 scale=65536 cv=0.384 tok=132390912 tok/s=16985 elapsed=0.20h | |
| step 16180 loss=0.8568 lr=1.86e-04 scale=65536 cv=0.426 tok=132554752 tok/s=16788 elapsed=0.21h | |
| step 16200 loss=0.5981 lr=1.86e-04 scale=65536 cv=0.433 tok=132718592 tok/s=16531 elapsed=0.21h | |
| step 16220 loss=0.8451 lr=1.86e-04 scale=65536 cv=0.409 tok=132882432 tok/s=16813 elapsed=0.21h | |
| step 16240 loss=0.9754 lr=1.86e-04 scale=65536 cv=0.412 tok=133046272 tok/s=16804 elapsed=0.22h | |
| step 16260 loss=1.3365 lr=1.86e-04 scale=65536 cv=0.409 tok=133210112 tok/s=16827 elapsed=0.22h | |
| step 16280 loss=1.3153 lr=1.86e-04 scale=65536 cv=0.410 tok=133373952 tok/s=16840 elapsed=0.22h | |
| step 16300 loss=1.7327 lr=1.86e-04 scale=65536 cv=0.402 tok=133537792 tok/s=16822 elapsed=0.22h | |
| step 16320 loss=1.5256 lr=1.86e-04 scale=65536 cv=0.443 tok=133701632 tok/s=16825 elapsed=0.23h | |
| step 16340 loss=1.5045 lr=1.86e-04 scale=65536 cv=0.459 tok=133865472 tok/s=16836 elapsed=0.23h | |
| step 16360 loss=1.0434 lr=1.86e-04 scale=65536 cv=0.453 tok=134029312 tok/s=16581 elapsed=0.23h | |
| step 16380 loss=1.3779 lr=1.86e-04 scale=65536 cv=0.456 tok=134193152 tok/s=17081 elapsed=0.24h | |
| step 16400 loss=1.1053 lr=1.86e-04 scale=65536 cv=0.441 tok=134356992 tok/s=17072 elapsed=0.24h | |
| step 16420 loss=1.2003 lr=1.86e-04 scale=65536 cv=0.401 tok=134520832 tok/s=17011 elapsed=0.24h | |
| step 16440 loss=0.8354 lr=1.86e-04 scale=65536 cv=0.419 tok=134684672 tok/s=16946 elapsed=0.24h | |
| step 16460 loss=1.0490 lr=1.86e-04 scale=65536 cv=0.426 tok=134848512 tok/s=16951 elapsed=0.25h | |
| step 16480 loss=0.2476 lr=1.86e-04 scale=65536 cv=0.431 tok=135012352 tok/s=16934 elapsed=0.25h | |
| step 16500 loss=1.4425 lr=1.86e-04 scale=65536 cv=0.444 tok=135176192 tok/s=16975 elapsed=0.25h | |
| step 16520 loss=1.7797 lr=1.86e-04 scale=65536 cv=0.408 tok=135340032 tok/s=17042 elapsed=0.26h | |
| step 16540 loss=1.5650 lr=1.85e-04 scale=65536 cv=0.430 tok=135503872 tok/s=16981 elapsed=0.26h | |
| step 16560 loss=1.1961 lr=1.85e-04 scale=65536 cv=0.441 tok=135667712 tok/s=17034 elapsed=0.26h | |
| step 16580 loss=1.5641 lr=1.85e-04 scale=65536 cv=0.413 tok=135831552 tok/s=16916 elapsed=0.26h | |
| step 16600 loss=1.7545 lr=1.85e-04 scale=65536 cv=0.407 tok=135995392 tok/s=16285 elapsed=0.27h | |
| step 16620 loss=1.5239 lr=1.85e-04 scale=65536 cv=0.413 tok=136159232 tok/s=16784 elapsed=0.27h | |
| step 16640 loss=1.2448 lr=1.85e-04 scale=65536 cv=0.404 tok=136323072 tok/s=16747 elapsed=0.27h | |
| step 16660 loss=1.1109 lr=1.85e-04 scale=65536 cv=0.392 tok=136486912 tok/s=16781 elapsed=0.28h | |
| step 16680 loss=0.5190 lr=1.85e-04 scale=65536 cv=0.371 tok=136650752 tok/s=16808 elapsed=0.28h | |
| step 16700 loss=0.7211 lr=1.85e-04 scale=65536 cv=0.373 tok=136814592 tok/s=16788 elapsed=0.28h | |
| step 16720 loss=1.2394 lr=1.85e-04 scale=65536 cv=0.399 tok=136978432 tok/s=16744 elapsed=0.28h | |
| step 16740 loss=1.3600 lr=1.85e-04 scale=65536 cv=0.400 tok=137142272 tok/s=16710 elapsed=0.29h | |
| step 16760 loss=1.5539 lr=1.85e-04 scale=65536 cv=0.462 tok=137306112 tok/s=16739 elapsed=0.29h | |
| step 16780 loss=1.5451 lr=1.85e-04 scale=65536 cv=0.407 tok=137469952 tok/s=16812 elapsed=0.29h | |
| step 16800 loss=0.8878 lr=1.85e-04 scale=65536 cv=0.336 tok=137633792 tok/s=16779 elapsed=0.30h | |
| step 16820 loss=0.9131 lr=1.85e-04 scale=65536 cv=0.370 tok=137797632 tok/s=16793 elapsed=0.30h | |
| step 16840 loss=0.1917 lr=1.85e-04 scale=65536 cv=0.383 tok=137961472 tok/s=16856 elapsed=0.30h | |
| step 16860 loss=1.3654 lr=1.85e-04 scale=65536 cv=0.379 tok=138125312 tok/s=16874 elapsed=0.30h | |
| step 16880 loss=1.5272 lr=1.85e-04 scale=65536 cv=0.350 tok=138289152 tok/s=16760 elapsed=0.31h | |
| step 16900 loss=1.3067 lr=1.85e-04 scale=65536 cv=0.359 tok=138452992 tok/s=16867 elapsed=0.31h | |
| step 16920 loss=0.8523 lr=1.85e-04 scale=65536 cv=0.376 tok=138616832 tok/s=16797 elapsed=0.31h | |
| step 16940 loss=0.5785 lr=1.85e-04 scale=65536 cv=0.362 tok=138780672 tok/s=16732 elapsed=0.31h | |
| step 16960 loss=0.9687 lr=1.85e-04 scale=65536 cv=0.442 tok=138944512 tok/s=16327 elapsed=0.32h | |
| step 16980 loss=1.4204 lr=1.85e-04 scale=65536 cv=0.404 tok=139108352 tok/s=16342 elapsed=0.32h | |
| step 17000 loss=0.8526 lr=1.85e-04 scale=65536 cv=0.379 tok=139272192 tok/s=16775 elapsed=0.32h | |
| step 17020 loss=1.2295 lr=1.85e-04 scale=65536 cv=0.397 tok=139436032 tok/s=16082 elapsed=0.34h | |
| step 17040 loss=1.6363 lr=1.85e-04 scale=65536 cv=0.387 tok=139599872 tok/s=16788 elapsed=0.34h | |
| step 17060 loss=1.0252 lr=1.85e-04 scale=65536 cv=0.331 tok=139763712 tok/s=16698 elapsed=0.34h | |
| step 17080 loss=1.6301 lr=1.85e-04 scale=65536 cv=0.364 tok=139927552 tok/s=16729 elapsed=0.35h | |
| step 17100 loss=1.0383 lr=1.84e-04 scale=65536 cv=0.345 tok=140091392 tok/s=16721 elapsed=0.35h | |
| step 17120 loss=1.5318 lr=1.84e-04 scale=65536 cv=0.389 tok=140255232 tok/s=16767 elapsed=0.35h | |
| step 17140 loss=0.9737 lr=1.84e-04 scale=65536 cv=0.308 tok=140419072 tok/s=16707 elapsed=0.35h | |
| step 17160 loss=1.7041 lr=1.84e-04 scale=65536 cv=0.372 tok=140582912 tok/s=16753 elapsed=0.36h | |
| step 17180 loss=1.5401 lr=1.84e-04 scale=65536 cv=0.370 tok=140746752 tok/s=16769 elapsed=0.36h | |
| step 17200 loss=1.4628 lr=1.84e-04 scale=65536 cv=0.378 tok=140910592 tok/s=16484 elapsed=0.36h | |
| step 17220 loss=1.1523 lr=1.84e-04 scale=65536 cv=0.355 tok=141074432 tok/s=16684 elapsed=0.37h | |
| step 17240 loss=1.6000 lr=1.84e-04 scale=65536 cv=0.370 tok=141238272 tok/s=16733 elapsed=0.37h | |
| step 17260 loss=0.9097 lr=1.84e-04 scale=65536 cv=0.343 tok=141402112 tok/s=16634 elapsed=0.37h | |
| step 17280 loss=1.6984 lr=1.84e-04 scale=65536 cv=0.369 tok=141565952 tok/s=16667 elapsed=0.37h | |
| step 17300 loss=1.6789 lr=1.84e-04 scale=65536 cv=0.381 tok=141729792 tok/s=16770 elapsed=0.38h | |
| step 17320 loss=0.9394 lr=1.84e-04 scale=65536 cv=0.369 tok=141893632 tok/s=16746 elapsed=0.38h | |
| step 17340 loss=1.5527 lr=1.84e-04 scale=65536 cv=0.355 tok=142057472 tok/s=16807 elapsed=0.38h | |
| step 17360 loss=1.4358 lr=1.84e-04 scale=65536 cv=0.410 tok=142221312 tok/s=16793 elapsed=0.39h | |
| step 17380 loss=1.2476 lr=1.84e-04 scale=65536 cv=0.380 tok=142385152 tok/s=16445 elapsed=0.39h | |
| step 17400 loss=1.3929 lr=1.84e-04 scale=65536 cv=0.385 tok=142548992 tok/s=16307 elapsed=0.39h | |
| step 17407 non-finite FORWARD loss -> skip batch (consec=1 total=1) | |
| step 17408 non-finite FORWARD loss -> skip batch (consec=2 total=2) | |
| step 17409 non-finite FORWARD loss -> skip batch (consec=3 total=3) | |
| step 17410 non-finite FORWARD loss -> skip batch (consec=4 total=4) | |
| step 17411 non-finite FORWARD loss -> skip batch (consec=5 total=5) | |
| step 17412 non-finite FORWARD loss -> skip batch (consec=6 total=6) | |
| step 17413 non-finite FORWARD loss -> skip batch (consec=7 total=7) | |
| step 17414 non-finite FORWARD loss -> skip batch (consec=8 total=8) | |
| step 17415 non-finite FORWARD loss -> skip batch (consec=9 total=9) | |
| step 17416 non-finite FORWARD loss -> skip batch (consec=10 total=10) | |
| step 17417 non-finite FORWARD loss -> skip batch (consec=11 total=11) | |
| step 17418 non-finite FORWARD loss -> skip batch (consec=12 total=12) | |
| step 17419 non-finite FORWARD loss -> skip batch (consec=13 total=13) | |
| step 17420 non-finite FORWARD loss -> skip batch (consec=14 total=14) | |
| step 17421 non-finite FORWARD loss -> skip batch (consec=15 total=15) | |
| step 17422 non-finite FORWARD loss -> skip batch (consec=16 total=16) | |
| step 17423 non-finite FORWARD loss -> skip batch (consec=17 total=17) | |
| step 17424 non-finite FORWARD loss -> skip batch (consec=18 total=18) | |
| step 17425 non-finite FORWARD loss -> skip batch (consec=19 total=19) | |
| step 17426 non-finite FORWARD loss -> skip batch (consec=20 total=20) | |
| step 17427 non-finite FORWARD loss -> skip batch (consec=21 total=21) | |
| step 17428 non-finite FORWARD loss -> skip batch (consec=22 total=22) | |
| step 17429 non-finite FORWARD loss -> skip batch (consec=23 total=23) | |
| step 17430 non-finite FORWARD loss -> skip batch (consec=24 total=24) | |
| step 17431 non-finite FORWARD loss -> skip batch (consec=25 total=25) | |
| step 17432 non-finite FORWARD loss -> skip batch (consec=26 total=26) | |
| step 17433 non-finite FORWARD loss -> skip batch (consec=27 total=27) | |
| step 17434 non-finite FORWARD loss -> skip batch (consec=28 total=28) | |
| step 17435 non-finite FORWARD loss -> skip batch (consec=29 total=29) | |
| step 17436 non-finite FORWARD loss -> skip batch (consec=30 total=30) | |
| step 17437 non-finite FORWARD loss -> skip batch (consec=31 total=31) | |
| step 17438 non-finite FORWARD loss -> skip batch (consec=32 total=32) | |
| step 17439 non-finite FORWARD loss -> skip batch (consec=33 total=33) | |
| step 17440 non-finite FORWARD loss -> skip batch (consec=34 total=34) | |
| step 17441 non-finite FORWARD loss -> skip batch (consec=35 total=35) | |
| step 17442 non-finite FORWARD loss -> skip batch (consec=36 total=36) | |
| step 17443 non-finite FORWARD loss -> skip batch (consec=37 total=37) | |
| step 17444 non-finite FORWARD loss -> skip batch (consec=38 total=38) | |
| step 17445 non-finite FORWARD loss -> skip batch (consec=39 total=39) | |
| step 17446 non-finite FORWARD loss -> skip batch (consec=40 total=40) | |
| step 17447 non-finite FORWARD loss -> skip batch (consec=41 total=41) | |
| step 17448 non-finite FORWARD loss -> skip batch (consec=42 total=42) | |
| step 17449 non-finite FORWARD loss -> skip batch (consec=43 total=43) | |
| step 17450 non-finite FORWARD loss -> skip batch (consec=44 total=44) | |
| step 17451 non-finite FORWARD loss -> skip batch (consec=45 total=45) | |
| step 17452 non-finite FORWARD loss -> skip batch (consec=46 total=46) | |
| step 17453 non-finite FORWARD loss -> skip batch (consec=47 total=47) | |
| step 17454 non-finite FORWARD loss -> skip batch (consec=48 total=48) | |
| step 17455 non-finite FORWARD loss -> skip batch (consec=49 total=49) | |
| step 17456 non-finite FORWARD loss -> skip batch (consec=50 total=50) | |
| step 17457 non-finite FORWARD loss -> skip batch (consec=51 total=51) | |
| step 17458 non-finite FORWARD loss -> skip batch (consec=52 total=52) | |
| step 17459 non-finite FORWARD loss -> skip batch (consec=53 total=53) | |
| step 17460 non-finite FORWARD loss -> skip batch (consec=54 total=54) | |
| step 17461 non-finite FORWARD loss -> skip batch (consec=55 total=55) | |
| step 17462 non-finite FORWARD loss -> skip batch (consec=56 total=56) | |
| step 17463 non-finite FORWARD loss -> skip batch (consec=57 total=57) | |
| step 17464 non-finite FORWARD loss -> skip batch (consec=58 total=58) | |
| step 17465 non-finite FORWARD loss -> skip batch (consec=59 total=59) | |
| step 17466 non-finite FORWARD loss -> skip batch (consec=60 total=60) | |
| step 17467 non-finite FORWARD loss -> skip batch (consec=61 total=61) | |
| step 17468 non-finite FORWARD loss -> skip batch (consec=62 total=62) | |
| step 17469 non-finite FORWARD loss -> skip batch (consec=63 total=63) | |
| step 17470 non-finite FORWARD loss -> skip batch (consec=64 total=64) | |
| step 17471 non-finite FORWARD loss -> skip batch (consec=65 total=65) | |
| step 17472 non-finite FORWARD loss -> skip batch (consec=66 total=66) | |
| step 17473 non-finite FORWARD loss -> skip batch (consec=67 total=67) | |
| step 17474 non-finite FORWARD loss -> skip batch (consec=68 total=68) | |
| step 17475 non-finite FORWARD loss -> skip batch (consec=69 total=69) | |
| step 17476 non-finite FORWARD loss -> skip batch (consec=70 total=70) | |
| step 17477 non-finite FORWARD loss -> skip batch (consec=71 total=71) | |
| step 17478 non-finite FORWARD loss -> skip batch (consec=72 total=72) | |
| step 17479 non-finite FORWARD loss -> skip batch (consec=73 total=73) | |
| step 17480 non-finite FORWARD loss -> skip batch (consec=74 total=74) | |
| step 17481 non-finite FORWARD loss -> skip batch (consec=75 total=75) | |
| step 17482 non-finite FORWARD loss -> skip batch (consec=76 total=76) | |
| step 17483 non-finite FORWARD loss -> skip batch (consec=77 total=77) | |
| step 17484 non-finite FORWARD loss -> skip batch (consec=78 total=78) | |
| step 17485 non-finite FORWARD loss -> skip batch (consec=79 total=79) | |
| step 17486 non-finite FORWARD loss -> skip batch (consec=80 total=80) | |
| step 17487 non-finite FORWARD loss -> skip batch (consec=81 total=81) | |
| step 17488 non-finite FORWARD loss -> skip batch (consec=82 total=82) | |
| step 17489 non-finite FORWARD loss -> skip batch (consec=83 total=83) | |
| step 17490 non-finite FORWARD loss -> skip batch (consec=84 total=84) | |
| step 17491 non-finite FORWARD loss -> skip batch (consec=85 total=85) | |
| step 17492 non-finite FORWARD loss -> skip batch (consec=86 total=86) | |
| step 17493 non-finite FORWARD loss -> skip batch (consec=87 total=87) | |
| step 17494 non-finite FORWARD loss -> skip batch (consec=88 total=88) | |
| step 17495 non-finite FORWARD loss -> skip batch (consec=89 total=89) | |
| step 17496 non-finite FORWARD loss -> skip batch (consec=90 total=90) | |
| step 17497 non-finite FORWARD loss -> skip batch (consec=91 total=91) | |
| step 17498 non-finite FORWARD loss -> skip batch (consec=92 total=92) | |
| step 17499 non-finite FORWARD loss -> skip batch (consec=93 total=93) | |
| step 17500 non-finite FORWARD loss -> skip batch (consec=94 total=94) | |
| step 17501 non-finite FORWARD loss -> skip batch (consec=95 total=95) | |
| step 17502 non-finite FORWARD loss -> skip batch (consec=96 total=96) | |
| step 17503 non-finite FORWARD loss -> skip batch (consec=97 total=97) | |
| step 17504 non-finite FORWARD loss -> skip batch (consec=98 total=98) | |
| step 17505 non-finite FORWARD loss -> skip batch (consec=99 total=99) | |
| step 17506 non-finite FORWARD loss -> skip batch (consec=100 total=100) | |
| step 17507 non-finite FORWARD loss -> skip batch (consec=101 total=101) | |
| step 17508 non-finite FORWARD loss -> skip batch (consec=102 total=102) | |
| step 17509 non-finite FORWARD loss -> skip batch (consec=103 total=103) | |
| step 17510 non-finite FORWARD loss -> skip batch (consec=104 total=104) | |
| step 17511 non-finite FORWARD loss -> skip batch (consec=105 total=105) | |
| step 17512 non-finite FORWARD loss -> skip batch (consec=106 total=106) | |
| step 17513 non-finite FORWARD loss -> skip batch (consec=107 total=107) | |
| step 17514 non-finite FORWARD loss -> skip batch (consec=108 total=108) | |
| step 17515 non-finite FORWARD loss -> skip batch (consec=109 total=109) | |
| step 17516 non-finite FORWARD loss -> skip batch (consec=110 total=110) | |
| step 17517 non-finite FORWARD loss -> skip batch (consec=111 total=111) | |
| step 17518 non-finite FORWARD loss -> skip batch (consec=112 total=112) | |
| step 17519 non-finite FORWARD loss -> skip batch (consec=113 total=113) | |
| step 17520 non-finite FORWARD loss -> skip batch (consec=114 total=114) | |
| step 17521 non-finite FORWARD loss -> skip batch (consec=115 total=115) | |
| step 17522 non-finite FORWARD loss -> skip batch (consec=116 total=116) | |
| step 17523 non-finite FORWARD loss -> skip batch (consec=117 total=117) | |
| step 17524 non-finite FORWARD loss -> skip batch (consec=118 total=118) | |
| step 17525 non-finite FORWARD loss -> skip batch (consec=119 total=119) | |
| step 17526 non-finite FORWARD loss -> skip batch (consec=120 total=120) | |
| step 17527 non-finite FORWARD loss -> skip batch (consec=121 total=121) | |
| step 17528 non-finite FORWARD loss -> skip batch (consec=122 total=122) | |
| step 17529 non-finite FORWARD loss -> skip batch (consec=123 total=123) | |
| step 17530 non-finite FORWARD loss -> skip batch (consec=124 total=124) | |
| step 17531 non-finite FORWARD loss -> skip batch (consec=125 total=125) | |
| step 17532 non-finite FORWARD loss -> skip batch (consec=126 total=126) | |
| step 17533 non-finite FORWARD loss -> skip batch (consec=127 total=127) | |
| step 17534 non-finite FORWARD loss -> skip batch (consec=128 total=128) | |
| step 17535 non-finite FORWARD loss -> skip batch (consec=129 total=129) | |
| step 17536 non-finite FORWARD loss -> skip batch (consec=130 total=130) | |
| step 17537 non-finite FORWARD loss -> skip batch (consec=131 total=131) | |
| step 17538 non-finite FORWARD loss -> skip batch (consec=132 total=132) | |
| step 17539 non-finite FORWARD loss -> skip batch (consec=133 total=133) | |
| step 17540 non-finite FORWARD loss -> skip batch (consec=134 total=134) | |
| step 17541 non-finite FORWARD loss -> skip batch (consec=135 total=135) | |
| step 17542 non-finite FORWARD loss -> skip batch (consec=136 total=136) | |
| step 17543 non-finite FORWARD loss -> skip batch (consec=137 total=137) | |
| step 17544 non-finite FORWARD loss -> skip batch (consec=138 total=138) | |
| step 17545 non-finite FORWARD loss -> skip batch (consec=139 total=139) | |
| step 17546 non-finite FORWARD loss -> skip batch (consec=140 total=140) | |
| step 17547 non-finite FORWARD loss -> skip batch (consec=141 total=141) | |
| step 17548 non-finite FORWARD loss -> skip batch (consec=142 total=142) | |
| step 17549 non-finite FORWARD loss -> skip batch (consec=143 total=143) | |
| step 17550 non-finite FORWARD loss -> skip batch (consec=144 total=144) | |
| step 17551 non-finite FORWARD loss -> skip batch (consec=145 total=145) | |
| step 17552 non-finite FORWARD loss -> skip batch (consec=146 total=146) | |
| step 17553 non-finite FORWARD loss -> skip batch (consec=147 total=147) | |
| step 17554 non-finite FORWARD loss -> skip batch (consec=148 total=148) | |
| step 17555 non-finite FORWARD loss -> skip batch (consec=149 total=149) | |
| step 17556 non-finite FORWARD loss -> skip batch (consec=150 total=150) | |
| step 17557 non-finite FORWARD loss -> skip batch (consec=151 total=151) | |
| step 17558 non-finite FORWARD loss -> skip batch (consec=152 total=152) | |
| step 17559 non-finite FORWARD loss -> skip batch (consec=153 total=153) | |
| step 17560 non-finite FORWARD loss -> skip batch (consec=154 total=154) | |
| step 17561 non-finite FORWARD loss -> skip batch (consec=155 total=155) | |
| step 17562 non-finite FORWARD loss -> skip batch (consec=156 total=156) | |
| step 17563 non-finite FORWARD loss -> skip batch (consec=157 total=157) | |
| step 17564 non-finite FORWARD loss -> skip batch (consec=158 total=158) | |
| step 17565 non-finite FORWARD loss -> skip batch (consec=159 total=159) | |
| step 17566 non-finite FORWARD loss -> skip batch (consec=160 total=160) | |
| step 17567 non-finite FORWARD loss -> skip batch (consec=161 total=161) | |
| step 17568 non-finite FORWARD loss -> skip batch (consec=162 total=162) | |
| step 17569 non-finite FORWARD loss -> skip batch (consec=163 total=163) | |
| step 17570 non-finite FORWARD loss -> skip batch (consec=164 total=164) | |
| step 17571 non-finite FORWARD loss -> skip batch (consec=165 total=165) | |
| step 17572 non-finite FORWARD loss -> skip batch (consec=166 total=166) | |
| step 17573 non-finite FORWARD loss -> skip batch (consec=167 total=167) | |
| step 17574 non-finite FORWARD loss -> skip batch (consec=168 total=168) | |
| step 17575 non-finite FORWARD loss -> skip batch (consec=169 total=169) | |
| step 17576 non-finite FORWARD loss -> skip batch (consec=170 total=170) | |
| step 17577 non-finite FORWARD loss -> skip batch (consec=171 total=171) | |
| step 17578 non-finite FORWARD loss -> skip batch (consec=172 total=172) | |
| step 17579 non-finite FORWARD loss -> skip batch (consec=173 total=173) | |
| step 17580 non-finite FORWARD loss -> skip batch (consec=174 total=174) | |
| step 17581 non-finite FORWARD loss -> skip batch (consec=175 total=175) | |
| step 17582 non-finite FORWARD loss -> skip batch (consec=176 total=176) | |
| step 17583 non-finite FORWARD loss -> skip batch (consec=177 total=177) | |
| step 17584 non-finite FORWARD loss -> skip batch (consec=178 total=178) | |
| step 17585 non-finite FORWARD loss -> skip batch (consec=179 total=179) | |
| step 17586 non-finite FORWARD loss -> skip batch (consec=180 total=180) | |
| step 17587 non-finite FORWARD loss -> skip batch (consec=181 total=181) | |
| step 17588 non-finite FORWARD loss -> skip batch (consec=182 total=182) | |
| step 17589 non-finite FORWARD loss -> skip batch (consec=183 total=183) | |
| step 17590 non-finite FORWARD loss -> skip batch (consec=184 total=184) | |
| step 17591 non-finite FORWARD loss -> skip batch (consec=185 total=185) | |
| step 17592 non-finite FORWARD loss -> skip batch (consec=186 total=186) | |
| step 17593 non-finite FORWARD loss -> skip batch (consec=187 total=187) | |
| step 17594 non-finite FORWARD loss -> skip batch (consec=188 total=188) | |
| step 17595 non-finite FORWARD loss -> skip batch (consec=189 total=189) | |
| step 17596 non-finite FORWARD loss -> skip batch (consec=190 total=190) | |
| step 17597 non-finite FORWARD loss -> skip batch (consec=191 total=191) | |
| step 17598 non-finite FORWARD loss -> skip batch (consec=192 total=192) | |
| step 17599 non-finite FORWARD loss -> skip batch (consec=193 total=193) | |
| step 17600 non-finite FORWARD loss -> skip batch (consec=194 total=194) | |
| step 17601 non-finite FORWARD loss -> skip batch (consec=195 total=195) | |
| step 17602 non-finite FORWARD loss -> skip batch (consec=196 total=196) | |
| step 17603 non-finite FORWARD loss -> skip batch (consec=197 total=197) | |
| step 17604 non-finite FORWARD loss -> skip batch (consec=198 total=198) | |
| step 17605 non-finite FORWARD loss -> skip batch (consec=199 total=199) | |
| step 17606 non-finite FORWARD loss -> skip batch (consec=200 total=200) | |
| step 17607 non-finite FORWARD loss -> skip batch (consec=201 total=201) | |
| step 17608 non-finite FORWARD loss -> skip batch (consec=202 total=202) | |
| step 17609 non-finite FORWARD loss -> skip batch (consec=203 total=203) | |
| step 17610 non-finite FORWARD loss -> skip batch (consec=204 total=204) | |
| step 17611 non-finite FORWARD loss -> skip batch (consec=205 total=205) | |
| step 17612 non-finite FORWARD loss -> skip batch (consec=206 total=206) | |
| step 17613 non-finite FORWARD loss -> skip batch (consec=207 total=207) | |
| step 17614 non-finite FORWARD loss -> skip batch (consec=208 total=208) | |
| step 17615 non-finite FORWARD loss -> skip batch (consec=209 total=209) | |
| step 17616 non-finite FORWARD loss -> skip batch (consec=210 total=210) | |
| step 17617 non-finite FORWARD loss -> skip batch (consec=211 total=211) | |
| step 17618 non-finite FORWARD loss -> skip batch (consec=212 total=212) | |
| step 17619 non-finite FORWARD loss -> skip batch (consec=213 total=213) | |
| step 17620 non-finite FORWARD loss -> skip batch (consec=214 total=214) | |
| step 17621 non-finite FORWARD loss -> skip batch (consec=215 total=215) | |
| step 17622 non-finite FORWARD loss -> skip batch (consec=216 total=216) | |
| step 17623 non-finite FORWARD loss -> skip batch (consec=217 total=217) | |
| step 17624 non-finite FORWARD loss -> skip batch (consec=218 total=218) | |
| step 17625 non-finite FORWARD loss -> skip batch (consec=219 total=219) | |
| step 17626 non-finite FORWARD loss -> skip batch (consec=220 total=220) | |
| step 17627 non-finite FORWARD loss -> skip batch (consec=221 total=221) | |
| step 17628 non-finite FORWARD loss -> skip batch (consec=222 total=222) | |
| step 17629 non-finite FORWARD loss -> skip batch (consec=223 total=223) | |
| step 17630 non-finite FORWARD loss -> skip batch (consec=224 total=224) | |
| step 17631 non-finite FORWARD loss -> skip batch (consec=225 total=225) | |
| step 17632 non-finite FORWARD loss -> skip batch (consec=226 total=226) | |
| step 17633 non-finite FORWARD loss -> skip batch (consec=227 total=227) | |
| step 17634 non-finite FORWARD loss -> skip batch (consec=228 total=228) | |
| step 17635 non-finite FORWARD loss -> skip batch (consec=229 total=229) | |
| step 17636 non-finite FORWARD loss -> skip batch (consec=230 total=230) | |
| step 17637 non-finite FORWARD loss -> skip batch (consec=231 total=231) | |
| step 17638 non-finite FORWARD loss -> skip batch (consec=232 total=232) | |
| step 17639 non-finite FORWARD loss -> skip batch (consec=233 total=233) | |
| step 17640 non-finite FORWARD loss -> skip batch (consec=234 total=234) | |
| step 17641 non-finite FORWARD loss -> skip batch (consec=235 total=235) | |
| step 17642 non-finite FORWARD loss -> skip batch (consec=236 total=236) | |
| step 17643 non-finite FORWARD loss -> skip batch (consec=237 total=237) | |
| step 17644 non-finite FORWARD loss -> skip batch (consec=238 total=238) | |
| step 17645 non-finite FORWARD loss -> skip batch (consec=239 total=239) | |
| step 17646 non-finite FORWARD loss -> skip batch (consec=240 total=240) | |
| step 17647 non-finite FORWARD loss -> skip batch (consec=241 total=241) | |
| step 17648 non-finite FORWARD loss -> skip batch (consec=242 total=242) | |
| step 17649 non-finite FORWARD loss -> skip batch (consec=243 total=243) | |
| step 17650 non-finite FORWARD loss -> skip batch (consec=244 total=244) | |
| step 17651 non-finite FORWARD loss -> skip batch (consec=245 total=245) | |
| step 17652 non-finite FORWARD loss -> skip batch (consec=246 total=246) | |
| step 17653 non-finite FORWARD loss -> skip batch (consec=247 total=247) | |
| step 17654 non-finite FORWARD loss -> skip batch (consec=248 total=248) | |
| step 17655 non-finite FORWARD loss -> skip batch (consec=249 total=249) | |
| step 17656 non-finite FORWARD loss -> skip batch (consec=250 total=250) | |
| step 17657 non-finite FORWARD loss -> skip batch (consec=251 total=251) | |
| step 17657 >250 CONSECUTIVE bad -> ABORT | |
| DONE {"final_train_loss": 1.0443034172058105, "best_eval_loss": 1.7043960571289063, "steps": 17658, "tokens_seen": 142598144, "active_M": 92.829508, "total_M": 246.183748, "wall_hours": 0.4055342619286643, "planned_tokens": 700000000.0, "total_steps": 85449} | |
| step 16000 loss=0.8550 lr=1.86e-04 scale=16384 cv=0.434 tok=131080192 tok/s=6579 elapsed=0.01h | |
| step 16020 loss=1.7184 lr=1.86e-04 scale=16384 cv=0.414 tok=131244032 tok/s=16709 elapsed=0.03h | |
| step 16040 loss=1.6021 lr=1.86e-04 scale=16384 cv=0.393 tok=131407872 tok/s=16816 elapsed=0.03h | |
| step 16060 loss=0.9136 lr=1.86e-04 scale=16384 cv=0.369 tok=131571712 tok/s=16839 elapsed=0.04h | |
| step 16080 loss=1.1130 lr=1.86e-04 scale=16384 cv=0.413 tok=131735552 tok/s=16942 elapsed=0.04h | |
| step 16100 loss=1.5564 lr=1.86e-04 scale=16384 cv=0.389 tok=131899392 tok/s=17067 elapsed=0.04h | |
| step 16120 loss=1.1021 lr=1.86e-04 scale=16384 cv=0.351 tok=132063232 tok/s=17156 elapsed=0.04h | |
| step 16140 loss=1.2844 lr=1.86e-04 scale=16384 cv=0.376 tok=132227072 tok/s=17173 elapsed=0.05h | |
| step 16160 loss=1.5433 lr=1.86e-04 scale=16384 cv=0.395 tok=132390912 tok/s=17074 elapsed=0.05h | |
| step 16180 loss=1.1109 lr=1.86e-04 scale=16384 cv=0.344 tok=132554752 tok/s=17112 elapsed=0.05h | |
| step 16200 loss=1.0467 lr=1.86e-04 scale=32768 cv=0.423 tok=132718592 tok/s=17144 elapsed=0.05h | |
| step 16220 loss=1.2712 lr=1.86e-04 scale=32768 cv=0.350 tok=132882432 tok/s=17067 elapsed=0.06h | |
| step 16240 loss=1.6092 lr=1.86e-04 scale=32768 cv=0.378 tok=133046272 tok/s=17066 elapsed=0.06h | |
| step 16260 loss=1.8050 lr=1.86e-04 scale=32768 cv=0.372 tok=133210112 tok/s=17036 elapsed=0.06h | |
| step 16280 loss=1.4491 lr=1.86e-04 scale=32768 cv=0.363 tok=133373952 tok/s=17126 elapsed=0.07h | |
| step 16300 loss=1.5867 lr=1.86e-04 scale=32768 cv=0.382 tok=133537792 tok/s=17110 elapsed=0.07h | |
| step 16320 loss=0.9495 lr=1.86e-04 scale=32768 cv=0.397 tok=133701632 tok/s=17092 elapsed=0.07h | |
| step 16340 loss=1.5850 lr=1.86e-04 scale=32768 cv=0.422 tok=133865472 tok/s=17066 elapsed=0.07h | |
| step 16360 loss=1.4001 lr=1.86e-04 scale=32768 cv=0.345 tok=134029312 tok/s=17060 elapsed=0.08h | |
| step 16380 loss=1.4231 lr=1.86e-04 scale=32768 cv=0.365 tok=134193152 tok/s=17201 elapsed=0.08h | |
| step 16400 loss=1.6048 lr=1.86e-04 scale=65536 cv=0.354 tok=134356992 tok/s=17154 elapsed=0.08h | |
| step 16420 loss=0.6220 lr=1.86e-04 scale=65536 cv=0.387 tok=134520832 tok/s=17152 elapsed=0.09h | |
| step 16440 loss=1.0033 lr=1.86e-04 scale=65536 cv=0.380 tok=134684672 tok/s=17057 elapsed=0.09h | |
| step 16460 loss=0.4720 lr=1.86e-04 scale=65536 cv=0.393 tok=134848512 tok/s=16886 elapsed=0.09h | |
| step 16480 loss=0.9511 lr=1.86e-04 scale=65536 cv=0.374 tok=135012352 tok/s=17114 elapsed=0.09h | |
| step 16500 loss=0.1120 lr=1.86e-04 scale=65536 cv=0.476 tok=135176192 tok/s=16996 elapsed=0.10h | |
| step 16520 loss=1.3231 lr=1.86e-04 scale=65536 cv=0.384 tok=135340032 tok/s=17094 elapsed=0.10h | |
| step 16540 loss=1.3109 lr=1.85e-04 scale=65536 cv=0.373 tok=135503872 tok/s=17168 elapsed=0.10h | |
| step 16560 loss=0.4497 lr=1.85e-04 scale=65536 cv=0.371 tok=135667712 tok/s=17160 elapsed=0.10h | |
| step 16580 loss=1.2850 lr=1.85e-04 scale=65536 cv=0.381 tok=135831552 tok/s=17169 elapsed=0.11h | |
| step 16600 loss=1.4677 lr=1.85e-04 scale=65536 cv=0.464 tok=135995392 tok/s=17140 elapsed=0.11h | |
| step 16620 loss=1.1217 lr=1.85e-04 scale=65536 cv=0.416 tok=136159232 tok/s=17185 elapsed=0.11h | |
| step 16640 loss=0.1890 lr=1.85e-04 scale=65536 cv=0.390 tok=136323072 tok/s=17134 elapsed=0.12h | |
| step 16660 loss=1.0223 lr=1.85e-04 scale=65536 cv=0.397 tok=136486912 tok/s=17122 elapsed=0.12h | |
| step 16680 loss=1.5771 lr=1.85e-04 scale=65536 cv=0.395 tok=136650752 tok/s=17150 elapsed=0.12h | |
| step 16700 loss=1.4933 lr=1.85e-04 scale=65536 cv=0.363 tok=136814592 tok/s=17159 elapsed=0.12h | |
| step 16720 loss=1.6683 lr=1.85e-04 scale=65536 cv=0.450 tok=136978432 tok/s=17130 elapsed=0.13h | |
| step 16740 loss=0.8447 lr=1.85e-04 scale=65536 cv=0.336 tok=137142272 tok/s=16982 elapsed=0.13h | |
| step 16760 loss=1.3083 lr=1.85e-04 scale=65536 cv=0.331 tok=137306112 tok/s=17096 elapsed=0.13h | |
| step 16780 loss=0.9842 lr=1.85e-04 scale=65536 cv=0.396 tok=137469952 tok/s=17069 elapsed=0.14h | |
| step 16800 loss=0.7723 lr=1.85e-04 scale=65536 cv=0.346 tok=137633792 tok/s=17068 elapsed=0.14h | |
| step 16820 loss=1.2367 lr=1.85e-04 scale=65536 cv=0.338 tok=137797632 tok/s=17056 elapsed=0.14h | |
| step 16840 loss=1.8254 lr=1.85e-04 scale=65536 cv=0.372 tok=137961472 tok/s=17083 elapsed=0.14h | |
| step 16860 loss=1.5461 lr=1.85e-04 scale=65536 cv=0.351 tok=138125312 tok/s=17054 elapsed=0.15h | |
| step 16880 loss=1.0839 lr=1.85e-04 scale=65536 cv=0.321 tok=138289152 tok/s=17034 elapsed=0.15h | |
| step 16900 loss=1.5906 lr=1.85e-04 scale=65536 cv=0.392 tok=138452992 tok/s=17048 elapsed=0.15h | |
| step 16920 loss=1.7563 lr=1.85e-04 scale=65536 cv=0.388 tok=138616832 tok/s=17065 elapsed=0.15h | |
| step 16940 loss=1.5504 lr=1.85e-04 scale=65536 cv=0.376 tok=138780672 tok/s=17118 elapsed=0.16h | |
| step 16960 loss=1.1678 lr=1.85e-04 scale=65536 cv=0.346 tok=138944512 tok/s=17086 elapsed=0.16h | |
| step 16980 loss=1.7103 lr=1.85e-04 scale=65536 cv=0.368 tok=139108352 tok/s=16510 elapsed=0.16h | |
| step 17000 loss=1.0120 lr=1.85e-04 scale=65536 cv=0.343 tok=139272192 tok/s=15696 elapsed=0.17h | |
| step 17020 loss=0.6040 lr=1.85e-04 scale=65536 cv=0.328 tok=139436032 tok/s=16815 elapsed=0.18h | |
| step 17029 non-finite FORWARD loss -> skip batch (consec=1 total=1) | |
| step 17030 non-finite FORWARD loss -> skip batch (consec=2 total=2) | |
| step 17031 non-finite FORWARD loss -> skip batch (consec=3 total=3) | |
| step 17032 non-finite FORWARD loss -> skip batch (consec=4 total=4) | |
| step 17033 non-finite FORWARD loss -> skip batch (consec=5 total=5) | |
| step 17034 non-finite FORWARD loss -> skip batch (consec=6 total=6) | |
| step 17035 non-finite FORWARD loss -> skip batch (consec=7 total=7) | |
| step 17036 non-finite FORWARD loss -> skip batch (consec=8 total=8) | |
| step 17037 non-finite FORWARD loss -> skip batch (consec=9 total=9) | |
| step 17038 non-finite FORWARD loss -> skip batch (consec=10 total=10) | |
| step 17039 non-finite FORWARD loss -> skip batch (consec=11 total=11) | |
| step 17040 non-finite FORWARD loss -> skip batch (consec=12 total=12) | |
| step 17041 non-finite FORWARD loss -> skip batch (consec=13 total=13) | |
| step 17042 non-finite FORWARD loss -> skip batch (consec=14 total=14) | |
| step 17043 non-finite FORWARD loss -> skip batch (consec=15 total=15) | |
| step 17044 non-finite FORWARD loss -> skip batch (consec=16 total=16) | |
| step 17045 non-finite FORWARD loss -> skip batch (consec=17 total=17) | |
| step 17046 non-finite FORWARD loss -> skip batch (consec=18 total=18) | |
| step 17047 non-finite FORWARD loss -> skip batch (consec=19 total=19) | |
| step 17048 non-finite FORWARD loss -> skip batch (consec=20 total=20) | |
| step 17049 non-finite FORWARD loss -> skip batch (consec=21 total=21) | |
| step 17050 non-finite FORWARD loss -> skip batch (consec=22 total=22) | |
| step 17051 non-finite FORWARD loss -> skip batch (consec=23 total=23) | |
| step 17052 non-finite FORWARD loss -> skip batch (consec=24 total=24) | |
| step 17053 non-finite FORWARD loss -> skip batch (consec=25 total=25) | |
| step 17054 non-finite FORWARD loss -> skip batch (consec=26 total=26) | |
| step 17055 non-finite FORWARD loss -> skip batch (consec=27 total=27) | |
| step 17056 non-finite FORWARD loss -> skip batch (consec=28 total=28) | |
| step 17057 non-finite FORWARD loss -> skip batch (consec=29 total=29) | |
| step 17058 non-finite FORWARD loss -> skip batch (consec=30 total=30) | |
| step 17059 non-finite FORWARD loss -> skip batch (consec=31 total=31) | |
| step 17060 non-finite FORWARD loss -> skip batch (consec=32 total=32) | |
| step 17061 non-finite FORWARD loss -> skip batch (consec=33 total=33) | |
| step 17062 non-finite FORWARD loss -> skip batch (consec=34 total=34) | |
| step 17063 non-finite FORWARD loss -> skip batch (consec=35 total=35) | |
| step 17064 non-finite FORWARD loss -> skip batch (consec=36 total=36) | |
| step 17065 non-finite FORWARD loss -> skip batch (consec=37 total=37) | |
| step 17066 non-finite FORWARD loss -> skip batch (consec=38 total=38) | |
| step 17067 non-finite FORWARD loss -> skip batch (consec=39 total=39) | |
| step 17068 non-finite FORWARD loss -> skip batch (consec=40 total=40) | |
| step 17069 non-finite FORWARD loss -> skip batch (consec=41 total=41) | |
| step 17070 non-finite FORWARD loss -> skip batch (consec=42 total=42) | |
| step 17071 non-finite FORWARD loss -> skip batch (consec=43 total=43) | |
| step 17072 non-finite FORWARD loss -> skip batch (consec=44 total=44) | |
| step 17073 non-finite FORWARD loss -> skip batch (consec=45 total=45) | |
| step 17074 non-finite FORWARD loss -> skip batch (consec=46 total=46) | |
| step 17075 non-finite FORWARD loss -> skip batch (consec=47 total=47) | |
| step 17076 non-finite FORWARD loss -> skip batch (consec=48 total=48) | |
| step 17077 non-finite FORWARD loss -> skip batch (consec=49 total=49) | |
| step 17078 non-finite FORWARD loss -> skip batch (consec=50 total=50) | |
| step 17079 non-finite FORWARD loss -> skip batch (consec=51 total=51) | |
| step 17080 non-finite FORWARD loss -> skip batch (consec=52 total=52) | |
| step 17081 non-finite FORWARD loss -> skip batch (consec=53 total=53) | |
| step 17082 non-finite FORWARD loss -> skip batch (consec=54 total=54) | |
| step 17083 non-finite FORWARD loss -> skip batch (consec=55 total=55) | |
| step 17084 non-finite FORWARD loss -> skip batch (consec=56 total=56) | |
| step 17085 non-finite FORWARD loss -> skip batch (consec=57 total=57) | |
| step 17086 non-finite FORWARD loss -> skip batch (consec=58 total=58) | |
| step 17087 non-finite FORWARD loss -> skip batch (consec=59 total=59) | |
| step 17088 non-finite FORWARD loss -> skip batch (consec=60 total=60) | |
| step 17089 non-finite FORWARD loss -> skip batch (consec=61 total=61) | |
| step 17090 non-finite FORWARD loss -> skip batch (consec=62 total=62) | |
| step 17091 non-finite FORWARD loss -> skip batch (consec=63 total=63) | |
| step 17092 non-finite FORWARD loss -> skip batch (consec=64 total=64) | |
| step 17093 non-finite FORWARD loss -> skip batch (consec=65 total=65) | |
| step 17094 non-finite FORWARD loss -> skip batch (consec=66 total=66) | |
| step 17095 non-finite FORWARD loss -> skip batch (consec=67 total=67) | |
| step 17096 non-finite FORWARD loss -> skip batch (consec=68 total=68) | |
| step 17097 non-finite FORWARD loss -> skip batch (consec=69 total=69) | |
| step 17098 non-finite FORWARD loss -> skip batch (consec=70 total=70) | |
| step 17099 non-finite FORWARD loss -> skip batch (consec=71 total=71) | |
| step 17100 non-finite FORWARD loss -> skip batch (consec=72 total=72) | |
| step 17101 non-finite FORWARD loss -> skip batch (consec=73 total=73) | |
| step 17102 non-finite FORWARD loss -> skip batch (consec=74 total=74) | |
| step 17103 non-finite FORWARD loss -> skip batch (consec=75 total=75) | |
| step 17104 non-finite FORWARD loss -> skip batch (consec=76 total=76) | |
| step 17105 non-finite FORWARD loss -> skip batch (consec=77 total=77) | |
| step 17106 non-finite FORWARD loss -> skip batch (consec=78 total=78) | |
| step 17107 non-finite FORWARD loss -> skip batch (consec=79 total=79) | |
| step 17108 non-finite FORWARD loss -> skip batch (consec=80 total=80) | |
| step 17109 non-finite FORWARD loss -> skip batch (consec=81 total=81) | |
| step 17110 non-finite FORWARD loss -> skip batch (consec=82 total=82) | |
| step 17111 non-finite FORWARD loss -> skip batch (consec=83 total=83) | |
| step 17112 non-finite FORWARD loss -> skip batch (consec=84 total=84) | |
| step 17113 non-finite FORWARD loss -> skip batch (consec=85 total=85) | |
| step 17114 non-finite FORWARD loss -> skip batch (consec=86 total=86) | |
| step 17115 non-finite FORWARD loss -> skip batch (consec=87 total=87) | |
| step 17116 non-finite FORWARD loss -> skip batch (consec=88 total=88) | |
| step 17117 non-finite FORWARD loss -> skip batch (consec=89 total=89) | |
| step 17118 non-finite FORWARD loss -> skip batch (consec=90 total=90) | |
| step 17119 non-finite FORWARD loss -> skip batch (consec=91 total=91) | |
| step 17120 non-finite FORWARD loss -> skip batch (consec=92 total=92) | |
| step 17121 non-finite FORWARD loss -> skip batch (consec=93 total=93) | |
| step 17122 non-finite FORWARD loss -> skip batch (consec=94 total=94) | |
| step 17123 non-finite FORWARD loss -> skip batch (consec=95 total=95) | |
| step 17124 non-finite FORWARD loss -> skip batch (consec=96 total=96) | |
| step 17125 non-finite FORWARD loss -> skip batch (consec=97 total=97) | |
| step 17126 non-finite FORWARD loss -> skip batch (consec=98 total=98) | |
| step 17127 non-finite FORWARD loss -> skip batch (consec=99 total=99) | |
| step 17128 non-finite FORWARD loss -> skip batch (consec=100 total=100) | |
| step 17129 non-finite FORWARD loss -> skip batch (consec=101 total=101) | |
| step 17130 non-finite FORWARD loss -> skip batch (consec=102 total=102) | |
| step 17131 non-finite FORWARD loss -> skip batch (consec=103 total=103) | |
| step 17132 non-finite FORWARD loss -> skip batch (consec=104 total=104) | |
| step 17133 non-finite FORWARD loss -> skip batch (consec=105 total=105) | |
| step 17134 non-finite FORWARD loss -> skip batch (consec=106 total=106) | |
| step 17135 non-finite FORWARD loss -> skip batch (consec=107 total=107) | |
| step 17136 non-finite FORWARD loss -> skip batch (consec=108 total=108) | |
| step 17137 non-finite FORWARD loss -> skip batch (consec=109 total=109) | |
| step 17138 non-finite FORWARD loss -> skip batch (consec=110 total=110) | |
| step 17139 non-finite FORWARD loss -> skip batch (consec=111 total=111) | |
| step 17140 non-finite FORWARD loss -> skip batch (consec=112 total=112) | |
| step 17141 non-finite FORWARD loss -> skip batch (consec=113 total=113) | |
| step 17142 non-finite FORWARD loss -> skip batch (consec=114 total=114) | |
| step 17143 non-finite FORWARD loss -> skip batch (consec=115 total=115) | |
| step 17144 non-finite FORWARD loss -> skip batch (consec=116 total=116) | |
| step 17145 non-finite FORWARD loss -> skip batch (consec=117 total=117) | |
| step 17146 non-finite FORWARD loss -> skip batch (consec=118 total=118) | |
| step 17147 non-finite FORWARD loss -> skip batch (consec=119 total=119) | |
| step 17148 non-finite FORWARD loss -> skip batch (consec=120 total=120) | |
| step 17149 non-finite FORWARD loss -> skip batch (consec=121 total=121) | |
| step 17150 non-finite FORWARD loss -> skip batch (consec=122 total=122) | |
| step 17151 non-finite FORWARD loss -> skip batch (consec=123 total=123) | |
| step 17152 non-finite FORWARD loss -> skip batch (consec=124 total=124) | |
| step 17153 non-finite FORWARD loss -> skip batch (consec=125 total=125) | |
| step 17154 non-finite FORWARD loss -> skip batch (consec=126 total=126) | |
| step 17155 non-finite FORWARD loss -> skip batch (consec=127 total=127) | |
| step 17156 non-finite FORWARD loss -> skip batch (consec=128 total=128) | |
| step 17157 non-finite FORWARD loss -> skip batch (consec=129 total=129) | |
| step 17158 non-finite FORWARD loss -> skip batch (consec=130 total=130) | |
| step 17159 non-finite FORWARD loss -> skip batch (consec=131 total=131) | |
| step 17160 non-finite FORWARD loss -> skip batch (consec=132 total=132) | |
| step 17161 non-finite FORWARD loss -> skip batch (consec=133 total=133) | |
| step 17162 non-finite FORWARD loss -> skip batch (consec=134 total=134) | |
| step 17163 non-finite FORWARD loss -> skip batch (consec=135 total=135) | |
| step 17164 non-finite FORWARD loss -> skip batch (consec=136 total=136) | |
| step 17165 non-finite FORWARD loss -> skip batch (consec=137 total=137) | |
| step 17166 non-finite FORWARD loss -> skip batch (consec=138 total=138) | |
| step 17167 non-finite FORWARD loss -> skip batch (consec=139 total=139) | |
| step 17168 non-finite FORWARD loss -> skip batch (consec=140 total=140) | |
| step 17169 non-finite FORWARD loss -> skip batch (consec=141 total=141) | |
| step 17170 non-finite FORWARD loss -> skip batch (consec=142 total=142) | |
| step 17171 non-finite FORWARD loss -> skip batch (consec=143 total=143) | |
| step 17172 non-finite FORWARD loss -> skip batch (consec=144 total=144) | |
| step 17173 non-finite FORWARD loss -> skip batch (consec=145 total=145) | |
| step 17174 non-finite FORWARD loss -> skip batch (consec=146 total=146) | |
| step 17175 non-finite FORWARD loss -> skip batch (consec=147 total=147) | |
| step 17176 non-finite FORWARD loss -> skip batch (consec=148 total=148) | |
| step 17177 non-finite FORWARD loss -> skip batch (consec=149 total=149) | |
| step 17178 non-finite FORWARD loss -> skip batch (consec=150 total=150) | |
| step 17179 non-finite FORWARD loss -> skip batch (consec=151 total=151) | |
| step 17180 non-finite FORWARD loss -> skip batch (consec=152 total=152) | |
| step 17181 non-finite FORWARD loss -> skip batch (consec=153 total=153) | |
| step 17182 non-finite FORWARD loss -> skip batch (consec=154 total=154) | |
| step 17183 non-finite FORWARD loss -> skip batch (consec=155 total=155) | |
| step 17184 non-finite FORWARD loss -> skip batch (consec=156 total=156) | |
| step 17185 non-finite FORWARD loss -> skip batch (consec=157 total=157) | |
| step 17186 non-finite FORWARD loss -> skip batch (consec=158 total=158) | |
| step 17187 non-finite FORWARD loss -> skip batch (consec=159 total=159) | |
| step 17188 non-finite FORWARD loss -> skip batch (consec=160 total=160) | |
| step 17189 non-finite FORWARD loss -> skip batch (consec=161 total=161) | |
| step 17190 non-finite FORWARD loss -> skip batch (consec=162 total=162) | |
| step 17191 non-finite FORWARD loss -> skip batch (consec=163 total=163) | |
| step 17192 non-finite FORWARD loss -> skip batch (consec=164 total=164) | |
| step 17193 non-finite FORWARD loss -> skip batch (consec=165 total=165) | |
| step 17194 non-finite FORWARD loss -> skip batch (consec=166 total=166) | |
| step 17195 non-finite FORWARD loss -> skip batch (consec=167 total=167) | |
| step 17196 non-finite FORWARD loss -> skip batch (consec=168 total=168) | |
| step 17197 non-finite FORWARD loss -> skip batch (consec=169 total=169) | |
| step 17198 non-finite FORWARD loss -> skip batch (consec=170 total=170) | |
| step 17199 non-finite FORWARD loss -> skip batch (consec=171 total=171) | |
| step 17200 non-finite FORWARD loss -> skip batch (consec=172 total=172) | |
| step 17201 non-finite FORWARD loss -> skip batch (consec=173 total=173) | |
| step 17202 non-finite FORWARD loss -> skip batch (consec=174 total=174) | |
| step 17203 non-finite FORWARD loss -> skip batch (consec=175 total=175) | |
| step 17204 non-finite FORWARD loss -> skip batch (consec=176 total=176) | |
| step 17205 non-finite FORWARD loss -> skip batch (consec=177 total=177) | |
| step 17206 non-finite FORWARD loss -> skip batch (consec=178 total=178) | |
| step 17207 non-finite FORWARD loss -> skip batch (consec=179 total=179) | |
| step 17208 non-finite FORWARD loss -> skip batch (consec=180 total=180) | |
| step 17209 non-finite FORWARD loss -> skip batch (consec=181 total=181) | |
| step 17210 non-finite FORWARD loss -> skip batch (consec=182 total=182) | |
| step 17211 non-finite FORWARD loss -> skip batch (consec=183 total=183) | |
| step 17212 non-finite FORWARD loss -> skip batch (consec=184 total=184) | |
| step 17213 non-finite FORWARD loss -> skip batch (consec=185 total=185) | |
| step 17214 non-finite FORWARD loss -> skip batch (consec=186 total=186) | |
| step 17215 non-finite FORWARD loss -> skip batch (consec=187 total=187) | |
| step 17216 non-finite FORWARD loss -> skip batch (consec=188 total=188) | |
| step 17217 non-finite FORWARD loss -> skip batch (consec=189 total=189) | |
| step 17218 non-finite FORWARD loss -> skip batch (consec=190 total=190) | |
| step 17219 non-finite FORWARD loss -> skip batch (consec=191 total=191) | |
| step 17220 non-finite FORWARD loss -> skip batch (consec=192 total=192) | |
| step 17221 non-finite FORWARD loss -> skip batch (consec=193 total=193) | |
| step 17222 non-finite FORWARD loss -> skip batch (consec=194 total=194) | |
| step 17223 non-finite FORWARD loss -> skip batch (consec=195 total=195) | |
| step 17224 non-finite FORWARD loss -> skip batch (consec=196 total=196) | |
| step 17225 non-finite FORWARD loss -> skip batch (consec=197 total=197) | |
| step 17226 non-finite FORWARD loss -> skip batch (consec=198 total=198) | |
| step 17227 non-finite FORWARD loss -> skip batch (consec=199 total=199) | |
| step 17228 non-finite FORWARD loss -> skip batch (consec=200 total=200) | |
| step 17229 non-finite FORWARD loss -> skip batch (consec=201 total=201) | |
| step 17230 non-finite FORWARD loss -> skip batch (consec=202 total=202) | |
| step 17231 non-finite FORWARD loss -> skip batch (consec=203 total=203) | |
| step 17232 non-finite FORWARD loss -> skip batch (consec=204 total=204) | |
| step 17233 non-finite FORWARD loss -> skip batch (consec=205 total=205) | |
| step 17234 non-finite FORWARD loss -> skip batch (consec=206 total=206) | |
| step 17235 non-finite FORWARD loss -> skip batch (consec=207 total=207) | |
| step 17236 non-finite FORWARD loss -> skip batch (consec=208 total=208) | |
| step 17237 non-finite FORWARD loss -> skip batch (consec=209 total=209) | |
| step 17238 non-finite FORWARD loss -> skip batch (consec=210 total=210) | |
| step 17239 non-finite FORWARD loss -> skip batch (consec=211 total=211) | |
| step 17240 non-finite FORWARD loss -> skip batch (consec=212 total=212) | |
| step 17241 non-finite FORWARD loss -> skip batch (consec=213 total=213) | |
| step 17242 non-finite FORWARD loss -> skip batch (consec=214 total=214) | |
| step 17243 non-finite FORWARD loss -> skip batch (consec=215 total=215) | |
| step 17244 non-finite FORWARD loss -> skip batch (consec=216 total=216) | |
| step 17245 non-finite FORWARD loss -> skip batch (consec=217 total=217) | |
| step 17246 non-finite FORWARD loss -> skip batch (consec=218 total=218) | |
| step 17247 non-finite FORWARD loss -> skip batch (consec=219 total=219) | |
| step 17248 non-finite FORWARD loss -> skip batch (consec=220 total=220) | |
| step 17249 non-finite FORWARD loss -> skip batch (consec=221 total=221) | |
| step 17250 non-finite FORWARD loss -> skip batch (consec=222 total=222) | |
| step 17251 non-finite FORWARD loss -> skip batch (consec=223 total=223) | |
| step 17252 non-finite FORWARD loss -> skip batch (consec=224 total=224) | |
| step 17253 non-finite FORWARD loss -> skip batch (consec=225 total=225) | |
| step 17254 non-finite FORWARD loss -> skip batch (consec=226 total=226) | |
| step 17255 non-finite FORWARD loss -> skip batch (consec=227 total=227) | |
| step 17256 non-finite FORWARD loss -> skip batch (consec=228 total=228) | |
| step 17257 non-finite FORWARD loss -> skip batch (consec=229 total=229) | |
| step 17258 non-finite FORWARD loss -> skip batch (consec=230 total=230) | |
| step 17259 non-finite FORWARD loss -> skip batch (consec=231 total=231) | |
| step 17260 non-finite FORWARD loss -> skip batch (consec=232 total=232) | |
| step 17261 non-finite FORWARD loss -> skip batch (consec=233 total=233) | |
| step 17262 non-finite FORWARD loss -> skip batch (consec=234 total=234) | |
| step 17263 non-finite FORWARD loss -> skip batch (consec=235 total=235) | |
| step 17264 non-finite FORWARD loss -> skip batch (consec=236 total=236) | |
| step 17265 non-finite FORWARD loss -> skip batch (consec=237 total=237) | |
| step 17266 non-finite FORWARD loss -> skip batch (consec=238 total=238) | |
| step 17267 non-finite FORWARD loss -> skip batch (consec=239 total=239) | |
| step 17268 non-finite FORWARD loss -> skip batch (consec=240 total=240) | |
| step 17269 non-finite FORWARD loss -> skip batch (consec=241 total=241) | |
| step 17270 non-finite FORWARD loss -> skip batch (consec=242 total=242) | |
| step 17271 non-finite FORWARD loss -> skip batch (consec=243 total=243) | |
| step 17272 non-finite FORWARD loss -> skip batch (consec=244 total=244) | |
| step 17273 non-finite FORWARD loss -> skip batch (consec=245 total=245) | |
| step 17274 non-finite FORWARD loss -> skip batch (consec=246 total=246) | |
| step 17275 non-finite FORWARD loss -> skip batch (consec=247 total=247) | |
| step 17276 non-finite FORWARD loss -> skip batch (consec=248 total=248) | |
| step 17277 non-finite FORWARD loss -> skip batch (consec=249 total=249) | |
| step 17278 non-finite FORWARD loss -> skip batch (consec=250 total=250) | |
| step 17279 non-finite FORWARD loss -> skip batch (consec=251 total=251) | |
| step 17279 >250 CONSECUTIVE bad -> ABORT | |
| DONE {"final_train_loss": 1.7106709480285645, "best_eval_loss": 1.6925994396209716, "steps": 17280, "tokens_seen": 139501568, "active_M": 92.829508, "total_M": 246.183748, "wall_hours": 0.19512769089804755, "planned_tokens": 700000000.0, "total_steps": 85449} | |