File size: 29,291 Bytes
bffe8cd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 | Device: cuda
Loading tokenizer: /tmp/eval/multilingual_32k.model
Loading base model: /tmp/eval/best_model.pt
Model loaded: 3.04B parameters
Loading SFT data from: /tmp/sft_data_v2
Train: 3949348 tokens, Val: 201020 tokens
Using 8-bit AdamW (bitsandbytes)
Starting SFT training for 4000 steps...
Batch size: 1 x 4 accum = 4 effective, Seq len: 2048, LR: 2e-05
Step 10/4000 | Loss: 2.3791 | LR: 0.000001 | TPS: 1196 | 68s
Step 20/4000 | Loss: 2.5346 | LR: 0.000002 | TPS: 1418 | 116s
Step 30/4000 | Loss: 2.7910 | LR: 0.000003 | TPS: 1511 | 163s
Step 40/4000 | Loss: 2.5189 | LR: 0.000004 | TPS: 1562 | 210s
Step 50/4000 | Loss: 2.5049 | LR: 0.000005 | TPS: 1594 | 257s
Step 60/4000 | Loss: 2.5417 | LR: 0.000006 | TPS: 1616 | 304s
Step 70/4000 | Loss: 2.2374 | LR: 0.000007 | TPS: 1633 | 351s
Step 80/4000 | Loss: 2.5328 | LR: 0.000008 | TPS: 1645 | 398s
Step 90/4000 | Loss: 2.5359 | LR: 0.000009 | TPS: 1655 | 445s
Step 100/4000 | Loss: 2.4830 | LR: 0.000010 | TPS: 1663 | 493s
Step 110/4000 | Loss: 2.3015 | LR: 0.000011 | TPS: 1669 | 540s
Step 120/4000 | Loss: 2.4667 | LR: 0.000012 | TPS: 1675 | 587s
Step 130/4000 | Loss: 2.3792 | LR: 0.000013 | TPS: 1680 | 634s
Step 140/4000 | Loss: 2.3918 | LR: 0.000014 | TPS: 1684 | 681s
Step 150/4000 | Loss: 2.3368 | LR: 0.000015 | TPS: 1687 | 728s
Step 160/4000 | Loss: 2.4838 | LR: 0.000016 | TPS: 1690 | 775s
Step 170/4000 | Loss: 2.3578 | LR: 0.000017 | TPS: 1693 | 823s
Step 180/4000 | Loss: 2.5485 | LR: 0.000018 | TPS: 1695 | 870s
Step 190/4000 | Loss: 2.0834 | LR: 0.000019 | TPS: 1698 | 917s
Step 200/4000 | Loss: 1.9784 | LR: 0.000020 | TPS: 1699 | 964s
Step 210/4000 | Loss: 2.4826 | LR: 0.000020 | TPS: 1701 | 1011s
Step 220/4000 | Loss: 2.3540 | LR: 0.000020 | TPS: 1703 | 1058s
Step 230/4000 | Loss: 2.2093 | LR: 0.000020 | TPS: 1704 | 1105s
Step 240/4000 | Loss: 2.2137 | LR: 0.000020 | TPS: 1706 | 1153s
Step 250/4000 | Loss: 2.2151 | LR: 0.000020 | TPS: 1707 | 1200s
Step 260/4000 | Loss: 2.2535 | LR: 0.000020 | TPS: 1708 | 1247s
Step 270/4000 | Loss: 2.2235 | LR: 0.000020 | TPS: 1709 | 1294s
Step 280/4000 | Loss: 2.0449 | LR: 0.000020 | TPS: 1710 | 1341s
Step 290/4000 | Loss: 2.1502 | LR: 0.000020 | TPS: 1711 | 1388s
Step 300/4000 | Loss: 2.3716 | LR: 0.000020 | TPS: 1712 | 1435s
Step 310/4000 | Loss: 2.1591 | LR: 0.000020 | TPS: 1713 | 1483s
Step 320/4000 | Loss: 2.2153 | LR: 0.000020 | TPS: 1714 | 1530s
Step 330/4000 | Loss: 2.2023 | LR: 0.000020 | TPS: 1714 | 1577s
Step 340/4000 | Loss: 2.3968 | LR: 0.000020 | TPS: 1715 | 1624s
Step 350/4000 | Loss: 2.1146 | LR: 0.000020 | TPS: 1716 | 1671s
Step 360/4000 | Loss: 2.1857 | LR: 0.000020 | TPS: 1716 | 1718s
Step 370/4000 | Loss: 2.1965 | LR: 0.000020 | TPS: 1717 | 1765s
Step 380/4000 | Loss: 2.1613 | LR: 0.000020 | TPS: 1717 | 1813s
Step 390/4000 | Loss: 2.3080 | LR: 0.000020 | TPS: 1718 | 1860s
Step 400/4000 | Loss: 2.2964 | LR: 0.000020 | TPS: 1718 | 1907s
📊 Val loss: 2.2256 (NEW BEST!)
💾 Best model saved to /tmp/sft/sft_model_v2.pt
Step 410/4000 | Loss: 2.2859 | LR: 0.000020 | TPS: 1703 | 1973s
Step 420/4000 | Loss: 2.1711 | LR: 0.000020 | TPS: 1703 | 2020s
Step 430/4000 | Loss: 2.1434 | LR: 0.000020 | TPS: 1704 | 2067s
Step 440/4000 | Loss: 2.2115 | LR: 0.000020 | TPS: 1705 | 2114s
Step 450/4000 | Loss: 2.2985 | LR: 0.000020 | TPS: 1706 | 2161s
Step 460/4000 | Loss: 1.9845 | LR: 0.000020 | TPS: 1707 | 2208s
Step 470/4000 | Loss: 2.3135 | LR: 0.000020 | TPS: 1707 | 2255s
Step 480/4000 | Loss: 2.3004 | LR: 0.000020 | TPS: 1708 | 2302s
Step 490/4000 | Loss: 2.1841 | LR: 0.000020 | TPS: 1709 | 2349s
Step 500/4000 | Loss: 2.3647 | LR: 0.000020 | TPS: 1709 | 2396s
Step 510/4000 | Loss: 2.1587 | LR: 0.000020 | TPS: 1710 | 2443s
Step 520/4000 | Loss: 2.0790 | LR: 0.000020 | TPS: 1711 | 2490s
Step 530/4000 | Loss: 2.0842 | LR: 0.000020 | TPS: 1711 | 2537s
Step 540/4000 | Loss: 2.4031 | LR: 0.000020 | TPS: 1712 | 2584s
Step 550/4000 | Loss: 2.3037 | LR: 0.000020 | TPS: 1712 | 2632s
Step 560/4000 | Loss: 2.2433 | LR: 0.000020 | TPS: 1713 | 2679s
Step 570/4000 | Loss: 2.1670 | LR: 0.000020 | TPS: 1713 | 2726s
Step 580/4000 | Loss: 2.1579 | LR: 0.000020 | TPS: 1714 | 2773s
Step 590/4000 | Loss: 1.9392 | LR: 0.000020 | TPS: 1714 | 2820s
Step 600/4000 | Loss: 2.1226 | LR: 0.000020 | TPS: 1715 | 2867s
Step 610/4000 | Loss: 2.2641 | LR: 0.000019 | TPS: 1715 | 2914s
Step 620/4000 | Loss: 2.0771 | LR: 0.000019 | TPS: 1715 | 2961s
Step 630/4000 | Loss: 2.4527 | LR: 0.000019 | TPS: 1716 | 3008s
Step 640/4000 | Loss: 2.2605 | LR: 0.000019 | TPS: 1716 | 3055s
Step 650/4000 | Loss: 1.9801 | LR: 0.000019 | TPS: 1717 | 3102s
Step 660/4000 | Loss: 2.4208 | LR: 0.000019 | TPS: 1717 | 3149s
Step 670/4000 | Loss: 2.3331 | LR: 0.000019 | TPS: 1717 | 3196s
Step 680/4000 | Loss: 2.1299 | LR: 0.000019 | TPS: 1718 | 3243s
Step 690/4000 | Loss: 2.1551 | LR: 0.000019 | TPS: 1718 | 3290s
Step 700/4000 | Loss: 2.0940 | LR: 0.000019 | TPS: 1718 | 3337s
Step 710/4000 | Loss: 2.0533 | LR: 0.000019 | TPS: 1719 | 3384s
Step 720/4000 | Loss: 2.2076 | LR: 0.000019 | TPS: 1719 | 3431s
Step 730/4000 | Loss: 1.9816 | LR: 0.000019 | TPS: 1719 | 3478s
Step 740/4000 | Loss: 2.1420 | LR: 0.000019 | TPS: 1719 | 3526s
Step 750/4000 | Loss: 2.2928 | LR: 0.000019 | TPS: 1720 | 3573s
Step 760/4000 | Loss: 2.1035 | LR: 0.000019 | TPS: 1720 | 3620s
Step 770/4000 | Loss: 2.1663 | LR: 0.000019 | TPS: 1720 | 3667s
Step 780/4000 | Loss: 2.2270 | LR: 0.000019 | TPS: 1721 | 3714s
Step 790/4000 | Loss: 2.1436 | LR: 0.000019 | TPS: 1721 | 3761s
Step 800/4000 | Loss: 2.3599 | LR: 0.000019 | TPS: 1721 | 3808s
📊 Val loss: 2.1960 (NEW BEST!)
💾 Best model saved to /tmp/sft/sft_model_v2.pt
Step 810/4000 | Loss: 2.2325 | LR: 0.000019 | TPS: 1696 | 3912s
Step 820/4000 | Loss: 2.0798 | LR: 0.000019 | TPS: 1696 | 3960s
Step 830/4000 | Loss: 2.1527 | LR: 0.000019 | TPS: 1697 | 4007s
Step 840/4000 | Loss: 2.2046 | LR: 0.000019 | TPS: 1697 | 4054s
Step 850/4000 | Loss: 2.0648 | LR: 0.000019 | TPS: 1698 | 4101s
Step 860/4000 | Loss: 2.1708 | LR: 0.000019 | TPS: 1698 | 4148s
Step 870/4000 | Loss: 2.3088 | LR: 0.000019 | TPS: 1699 | 4195s
Step 880/4000 | Loss: 1.9936 | LR: 0.000019 | TPS: 1699 | 4242s
Step 890/4000 | Loss: 2.1869 | LR: 0.000019 | TPS: 1700 | 4290s
Step 900/4000 | Loss: 2.4199 | LR: 0.000019 | TPS: 1700 | 4337s
Step 910/4000 | Loss: 2.3803 | LR: 0.000018 | TPS: 1700 | 4384s
Step 920/4000 | Loss: 2.0193 | LR: 0.000018 | TPS: 1701 | 4431s
Step 930/4000 | Loss: 2.1047 | LR: 0.000018 | TPS: 1701 | 4478s
Step 940/4000 | Loss: 2.1449 | LR: 0.000018 | TPS: 1702 | 4525s
Step 950/4000 | Loss: 2.1521 | LR: 0.000018 | TPS: 1702 | 4572s
Step 960/4000 | Loss: 2.2820 | LR: 0.000018 | TPS: 1702 | 4620s
Step 970/4000 | Loss: 2.2996 | LR: 0.000018 | TPS: 1703 | 4667s
Step 980/4000 | Loss: 2.3187 | LR: 0.000018 | TPS: 1703 | 4714s
Step 990/4000 | Loss: 2.1756 | LR: 0.000018 | TPS: 1703 | 4761s
Step 1000/4000 | Loss: 1.9765 | LR: 0.000018 | TPS: 1704 | 4808s
🔤 Generation samples (step 1000):
[EN] The capital of France is located in Normandy.
[HE] מלזיה.
[AR] باريس.
[FA] پاریس یکی از شهرهای بزرگ و تاریخی جهان است که دارای جاذبه های طبیعی، فرهنگی و اقتصادی متعددی می باشد. شهر پاریس در غرب کشورمان قرار دارد و به عنوان یکی از مهم ترین مراکز تجاری و مالی دنیا شناخته شده ا
[TRANSLATE] "תודה על הכול, אבא. אני כאן איתך בכל רגע נתון."
Step 1010/4000 | Loss: 2.1665 | LR: 0.000018 | TPS: 1703 | 4859s
Step 1020/4000 | Loss: 2.1047 | LR: 0.000018 | TPS: 1703 | 4906s
Step 1030/4000 | Loss: 2.2359 | LR: 0.000018 | TPS: 1704 | 4953s
Step 1040/4000 | Loss: 2.0109 | LR: 0.000018 | TPS: 1704 | 5000s
Step 1050/4000 | Loss: 2.1515 | LR: 0.000018 | TPS: 1704 | 5047s
Step 1060/4000 | Loss: 2.0880 | LR: 0.000018 | TPS: 1705 | 5094s
Step 1070/4000 | Loss: 2.2460 | LR: 0.000018 | TPS: 1705 | 5142s
Step 1080/4000 | Loss: 1.9325 | LR: 0.000018 | TPS: 1705 | 5189s
Step 1090/4000 | Loss: 2.2283 | LR: 0.000018 | TPS: 1705 | 5236s
Step 1100/4000 | Loss: 2.3303 | LR: 0.000018 | TPS: 1706 | 5283s
Step 1110/4000 | Loss: 2.1772 | LR: 0.000018 | TPS: 1706 | 5330s
Step 1120/4000 | Loss: 2.1615 | LR: 0.000018 | TPS: 1706 | 5377s
Step 1130/4000 | Loss: 2.1470 | LR: 0.000017 | TPS: 1707 | 5424s
Step 1140/4000 | Loss: 1.9640 | LR: 0.000017 | TPS: 1707 | 5472s
Step 1150/4000 | Loss: 2.1891 | LR: 0.000017 | TPS: 1707 | 5519s
Step 1160/4000 | Loss: 2.2183 | LR: 0.000017 | TPS: 1707 | 5566s
Step 1170/4000 | Loss: 2.0268 | LR: 0.000017 | TPS: 1708 | 5613s
Step 1180/4000 | Loss: 2.2234 | LR: 0.000017 | TPS: 1708 | 5660s
Step 1190/4000 | Loss: 2.1961 | LR: 0.000017 | TPS: 1708 | 5707s
Step 1200/4000 | Loss: 2.2019 | LR: 0.000017 | TPS: 1708 | 5754s
📊 Val loss: 2.2238
Step 1210/4000 | Loss: 2.0809 | LR: 0.000017 | TPS: 1707 | 5807s
Step 1220/4000 | Loss: 2.1716 | LR: 0.000017 | TPS: 1707 | 5854s
Step 1230/4000 | Loss: 2.2607 | LR: 0.000017 | TPS: 1707 | 5901s
Step 1240/4000 | Loss: 2.1838 | LR: 0.000017 | TPS: 1708 | 5949s
Step 1250/4000 | Loss: 2.0725 | LR: 0.000017 | TPS: 1708 | 5996s
Step 1260/4000 | Loss: 2.2797 | LR: 0.000017 | TPS: 1708 | 6043s
Step 1270/4000 | Loss: 2.0366 | LR: 0.000017 | TPS: 1708 | 6090s
Step 1280/4000 | Loss: 2.1469 | LR: 0.000017 | TPS: 1709 | 6137s
Step 1290/4000 | Loss: 2.1541 | LR: 0.000017 | TPS: 1709 | 6184s
Step 1300/4000 | Loss: 2.0311 | LR: 0.000017 | TPS: 1709 | 6231s
Step 1310/4000 | Loss: 2.1828 | LR: 0.000016 | TPS: 1709 | 6279s
Step 1320/4000 | Loss: 2.2004 | LR: 0.000016 | TPS: 1709 | 6326s
Step 1330/4000 | Loss: 2.2589 | LR: 0.000016 | TPS: 1710 | 6373s
Step 1340/4000 | Loss: 2.1475 | LR: 0.000016 | TPS: 1710 | 6420s
Step 1350/4000 | Loss: 2.1672 | LR: 0.000016 | TPS: 1710 | 6467s
Step 1360/4000 | Loss: 2.1921 | LR: 0.000016 | TPS: 1710 | 6514s
Step 1370/4000 | Loss: 2.0689 | LR: 0.000016 | TPS: 1710 | 6561s
Step 1380/4000 | Loss: 2.2560 | LR: 0.000016 | TPS: 1711 | 6609s
Step 1390/4000 | Loss: 1.9519 | LR: 0.000016 | TPS: 1711 | 6656s
Step 1400/4000 | Loss: 1.9671 | LR: 0.000016 | TPS: 1711 | 6703s
Step 1410/4000 | Loss: 2.1535 | LR: 0.000016 | TPS: 1711 | 6750s
Step 1420/4000 | Loss: 2.1726 | LR: 0.000016 | TPS: 1711 | 6797s
Step 1430/4000 | Loss: 2.0854 | LR: 0.000016 | TPS: 1712 | 6844s
Step 1440/4000 | Loss: 2.0955 | LR: 0.000016 | TPS: 1712 | 6891s
Step 1450/4000 | Loss: 2.1260 | LR: 0.000016 | TPS: 1712 | 6939s
Step 1460/4000 | Loss: 2.2860 | LR: 0.000016 | TPS: 1712 | 6986s
Step 1470/4000 | Loss: 1.6098 | LR: 0.000015 | TPS: 1712 | 7033s
Step 1480/4000 | Loss: 2.1327 | LR: 0.000015 | TPS: 1712 | 7080s
Step 1490/4000 | Loss: 2.0506 | LR: 0.000015 | TPS: 1713 | 7127s
Step 1500/4000 | Loss: 2.0568 | LR: 0.000015 | TPS: 1713 | 7174s
Step 1510/4000 | Loss: 2.0177 | LR: 0.000015 | TPS: 1713 | 7221s
Step 1520/4000 | Loss: 2.0383 | LR: 0.000015 | TPS: 1713 | 7269s
Step 1530/4000 | Loss: 2.0994 | LR: 0.000015 | TPS: 1713 | 7316s
Step 1540/4000 | Loss: 2.0863 | LR: 0.000015 | TPS: 1713 | 7363s
Step 1550/4000 | Loss: 2.3287 | LR: 0.000015 | TPS: 1714 | 7410s
Step 1560/4000 | Loss: 2.1585 | LR: 0.000015 | TPS: 1714 | 7457s
Step 1570/4000 | Loss: 1.9781 | LR: 0.000015 | TPS: 1714 | 7504s
Step 1580/4000 | Loss: 1.9344 | LR: 0.000015 | TPS: 1714 | 7551s
Step 1590/4000 | Loss: 2.1031 | LR: 0.000015 | TPS: 1714 | 7599s
Step 1600/4000 | Loss: 2.2633 | LR: 0.000015 | TPS: 1714 | 7646s
📊 Val loss: 2.1164 (NEW BEST!)
💾 Best model saved to /tmp/sft/sft_model_v2.pt
Step 1610/4000 | Loss: 2.0217 | LR: 0.000015 | TPS: 1702 | 7750s
Step 1620/4000 | Loss: 2.0437 | LR: 0.000014 | TPS: 1702 | 7797s
Step 1630/4000 | Loss: 2.3588 | LR: 0.000014 | TPS: 1702 | 7844s
Step 1640/4000 | Loss: 2.1927 | LR: 0.000014 | TPS: 1702 | 7892s
Step 1650/4000 | Loss: 1.9298 | LR: 0.000014 | TPS: 1703 | 7939s
Step 1660/4000 | Loss: 2.1604 | LR: 0.000014 | TPS: 1703 | 7986s
Step 1670/4000 | Loss: 2.0326 | LR: 0.000014 | TPS: 1703 | 8033s
Step 1680/4000 | Loss: 2.1872 | LR: 0.000014 | TPS: 1703 | 8080s
Step 1690/4000 | Loss: 2.0633 | LR: 0.000014 | TPS: 1703 | 8127s
Step 1700/4000 | Loss: 2.2547 | LR: 0.000014 | TPS: 1704 | 8174s
Step 1710/4000 | Loss: 1.8940 | LR: 0.000014 | TPS: 1704 | 8221s
Step 1720/4000 | Loss: 2.0726 | LR: 0.000014 | TPS: 1704 | 8269s
Step 1730/4000 | Loss: 2.0857 | LR: 0.000014 | TPS: 1704 | 8316s
Step 1740/4000 | Loss: 2.0686 | LR: 0.000014 | TPS: 1704 | 8363s
Step 1750/4000 | Loss: 2.1306 | LR: 0.000014 | TPS: 1705 | 8410s
Step 1760/4000 | Loss: 2.0932 | LR: 0.000013 | TPS: 1705 | 8457s
Step 1770/4000 | Loss: 2.0751 | LR: 0.000013 | TPS: 1705 | 8504s
Step 1780/4000 | Loss: 2.1802 | LR: 0.000013 | TPS: 1705 | 8551s
Step 1790/4000 | Loss: 1.6657 | LR: 0.000013 | TPS: 1705 | 8599s
Step 1800/4000 | Loss: 2.1290 | LR: 0.000013 | TPS: 1706 | 8646s
Step 1810/4000 | Loss: 2.1032 | LR: 0.000013 | TPS: 1706 | 8693s
Step 1820/4000 | Loss: 2.1255 | LR: 0.000013 | TPS: 1706 | 8740s
Step 1830/4000 | Loss: 2.1091 | LR: 0.000013 | TPS: 1706 | 8787s
Step 1840/4000 | Loss: 1.9875 | LR: 0.000013 | TPS: 1706 | 8834s
Step 1850/4000 | Loss: 1.9615 | LR: 0.000013 | TPS: 1706 | 8881s
Step 1860/4000 | Loss: 2.0189 | LR: 0.000013 | TPS: 1707 | 8929s
Step 1870/4000 | Loss: 2.1387 | LR: 0.000013 | TPS: 1707 | 8976s
Step 1880/4000 | Loss: 2.0963 | LR: 0.000013 | TPS: 1707 | 9023s
Step 1890/4000 | Loss: 2.1750 | LR: 0.000013 | TPS: 1707 | 9070s
Step 1900/4000 | Loss: 2.3945 | LR: 0.000012 | TPS: 1707 | 9117s
Step 1910/4000 | Loss: 2.1515 | LR: 0.000012 | TPS: 1707 | 9164s
Step 1920/4000 | Loss: 2.2224 | LR: 0.000012 | TPS: 1708 | 9211s
Step 1930/4000 | Loss: 2.3160 | LR: 0.000012 | TPS: 1708 | 9259s
Step 1940/4000 | Loss: 2.0126 | LR: 0.000012 | TPS: 1708 | 9306s
Step 1950/4000 | Loss: 2.2443 | LR: 0.000012 | TPS: 1708 | 9353s
Step 1960/4000 | Loss: 1.9590 | LR: 0.000012 | TPS: 1708 | 9400s
Step 1970/4000 | Loss: 2.2280 | LR: 0.000012 | TPS: 1708 | 9447s
Step 1980/4000 | Loss: 1.9723 | LR: 0.000012 | TPS: 1708 | 9494s
Step 1990/4000 | Loss: 2.0697 | LR: 0.000012 | TPS: 1709 | 9541s
Step 2000/4000 | Loss: 2.0568 | LR: 0.000012 | TPS: 1709 | 9589s
📊 Val loss: 2.1674
🔤 Generation samples (step 2000):
[EN] Paris (pronounced "Paris") is a city located in northeastern France. It borders Germany to the east, with Belgium and Luxembourg as its easternmost provinces.
[HE] בצרפת, העיר העתיקה היא אזור התיירות העיקרי.
[AR] باريس
[FA] پاریس، پایتخت کشور فرانسه است.
[TRANSLATE] The answer is YES.
Step 2010/4000 | Loss: 1.9474 | LR: 0.000012 | TPS: 1708 | 9643s
Step 2020/4000 | Loss: 2.1131 | LR: 0.000012 | TPS: 1708 | 9690s
Step 2030/4000 | Loss: 2.0446 | LR: 0.000012 | TPS: 1708 | 9737s
Step 2040/4000 | Loss: 2.2229 | LR: 0.000011 | TPS: 1708 | 9784s
Step 2050/4000 | Loss: 2.1576 | LR: 0.000011 | TPS: 1708 | 9832s
Step 2060/4000 | Loss: 2.1899 | LR: 0.000011 | TPS: 1708 | 9879s
Step 2070/4000 | Loss: 2.0957 | LR: 0.000011 | TPS: 1708 | 9926s
Step 2080/4000 | Loss: 2.2643 | LR: 0.000011 | TPS: 1709 | 9973s
Step 2090/4000 | Loss: 2.0676 | LR: 0.000011 | TPS: 1709 | 10020s
Step 2100/4000 | Loss: 2.1386 | LR: 0.000011 | TPS: 1709 | 10067s
Step 2110/4000 | Loss: 2.1891 | LR: 0.000011 | TPS: 1709 | 10114s
Step 2120/4000 | Loss: 1.9532 | LR: 0.000011 | TPS: 1709 | 10162s
Step 2130/4000 | Loss: 1.9766 | LR: 0.000011 | TPS: 1709 | 10209s
Step 2140/4000 | Loss: 2.3656 | LR: 0.000011 | TPS: 1709 | 10256s
Step 2150/4000 | Loss: 2.0545 | LR: 0.000011 | TPS: 1709 | 10303s
Step 2160/4000 | Loss: 1.9706 | LR: 0.000011 | TPS: 1710 | 10350s
Step 2170/4000 | Loss: 2.0302 | LR: 0.000010 | TPS: 1710 | 10397s
Step 2180/4000 | Loss: 2.1752 | LR: 0.000010 | TPS: 1710 | 10444s
Step 2190/4000 | Loss: 2.1455 | LR: 0.000010 | TPS: 1710 | 10492s
Step 2200/4000 | Loss: 2.2238 | LR: 0.000010 | TPS: 1710 | 10539s
Step 2210/4000 | Loss: 2.1010 | LR: 0.000010 | TPS: 1710 | 10586s
Step 2220/4000 | Loss: 2.1831 | LR: 0.000010 | TPS: 1710 | 10633s
Step 2230/4000 | Loss: 1.6542 | LR: 0.000010 | TPS: 1710 | 10680s
Step 2240/4000 | Loss: 2.1102 | LR: 0.000010 | TPS: 1711 | 10727s
Step 2250/4000 | Loss: 2.2099 | LR: 0.000010 | TPS: 1711 | 10774s
Step 2260/4000 | Loss: 2.1750 | LR: 0.000010 | TPS: 1711 | 10821s
Step 2270/4000 | Loss: 2.2369 | LR: 0.000010 | TPS: 1711 | 10869s
Step 2280/4000 | Loss: 2.0393 | LR: 0.000010 | TPS: 1711 | 10916s
Step 2290/4000 | Loss: 2.3140 | LR: 0.000010 | TPS: 1711 | 10963s
Step 2300/4000 | Loss: 2.0601 | LR: 0.000010 | TPS: 1711 | 11010s
Step 2310/4000 | Loss: 2.1472 | LR: 0.000009 | TPS: 1711 | 11057s
Step 2320/4000 | Loss: 2.0987 | LR: 0.000009 | TPS: 1712 | 11104s
Step 2330/4000 | Loss: 2.0354 | LR: 0.000009 | TPS: 1712 | 11152s
Step 2340/4000 | Loss: 1.9309 | LR: 0.000009 | TPS: 1712 | 11199s
Step 2350/4000 | Loss: 2.1222 | LR: 0.000009 | TPS: 1712 | 11246s
Step 2360/4000 | Loss: 1.9861 | LR: 0.000009 | TPS: 1712 | 11293s
Step 2370/4000 | Loss: 2.1986 | LR: 0.000009 | TPS: 1712 | 11340s
Step 2380/4000 | Loss: 2.0335 | LR: 0.000009 | TPS: 1712 | 11387s
Step 2390/4000 | Loss: 2.2123 | LR: 0.000009 | TPS: 1712 | 11434s
Step 2400/4000 | Loss: 2.0287 | LR: 0.000009 | TPS: 1712 | 11482s
📊 Val loss: 2.1943
Step 2410/4000 | Loss: 2.0483 | LR: 0.000009 | TPS: 1712 | 11534s
Step 2420/4000 | Loss: 2.0710 | LR: 0.000009 | TPS: 1712 | 11581s
Step 2430/4000 | Loss: 2.3005 | LR: 0.000009 | TPS: 1712 | 11629s
Step 2440/4000 | Loss: 2.0617 | LR: 0.000009 | TPS: 1712 | 11676s
Step 2450/4000 | Loss: 2.2063 | LR: 0.000008 | TPS: 1712 | 11723s
Step 2460/4000 | Loss: 2.0405 | LR: 0.000008 | TPS: 1712 | 11770s
Step 2470/4000 | Loss: 2.2280 | LR: 0.000008 | TPS: 1712 | 11817s
Step 2480/4000 | Loss: 2.3856 | LR: 0.000008 | TPS: 1712 | 11864s
Step 2490/4000 | Loss: 1.9853 | LR: 0.000008 | TPS: 1712 | 11911s
Step 2500/4000 | Loss: 2.0673 | LR: 0.000008 | TPS: 1713 | 11959s
Step 2510/4000 | Loss: 2.1777 | LR: 0.000008 | TPS: 1713 | 12006s
Step 2520/4000 | Loss: 1.9846 | LR: 0.000008 | TPS: 1713 | 12053s
Step 2530/4000 | Loss: 2.1922 | LR: 0.000008 | TPS: 1713 | 12100s
Step 2540/4000 | Loss: 2.0542 | LR: 0.000008 | TPS: 1713 | 12147s
Step 2550/4000 | Loss: 2.1041 | LR: 0.000008 | TPS: 1713 | 12194s
Step 2560/4000 | Loss: 2.0099 | LR: 0.000008 | TPS: 1713 | 12241s
Step 2570/4000 | Loss: 1.8186 | LR: 0.000008 | TPS: 1713 | 12289s
Step 2580/4000 | Loss: 2.2079 | LR: 0.000008 | TPS: 1713 | 12336s
Step 2590/4000 | Loss: 1.9931 | LR: 0.000007 | TPS: 1713 | 12383s
Step 2600/4000 | Loss: 2.0986 | LR: 0.000007 | TPS: 1714 | 12430s
Step 2610/4000 | Loss: 2.0439 | LR: 0.000007 | TPS: 1714 | 12477s
Step 2620/4000 | Loss: 1.9408 | LR: 0.000007 | TPS: 1714 | 12524s
Step 2630/4000 | Loss: 2.1992 | LR: 0.000007 | TPS: 1714 | 12571s
Step 2640/4000 | Loss: 2.0929 | LR: 0.000007 | TPS: 1714 | 12619s
Step 2650/4000 | Loss: 1.9728 | LR: 0.000007 | TPS: 1714 | 12666s
Step 2660/4000 | Loss: 1.8369 | LR: 0.000007 | TPS: 1714 | 12713s
Step 2670/4000 | Loss: 1.9926 | LR: 0.000007 | TPS: 1714 | 12760s
Step 2680/4000 | Loss: 2.0414 | LR: 0.000007 | TPS: 1714 | 12807s
Step 2690/4000 | Loss: 2.1368 | LR: 0.000007 | TPS: 1714 | 12854s
Step 2700/4000 | Loss: 2.0254 | LR: 0.000007 | TPS: 1714 | 12901s
Step 2710/4000 | Loss: 2.1572 | LR: 0.000007 | TPS: 1715 | 12948s
Step 2720/4000 | Loss: 2.0418 | LR: 0.000007 | TPS: 1715 | 12996s
Step 2730/4000 | Loss: 2.1235 | LR: 0.000007 | TPS: 1715 | 13043s
Step 2740/4000 | Loss: 2.0756 | LR: 0.000006 | TPS: 1715 | 13090s
Step 2750/4000 | Loss: 2.1417 | LR: 0.000006 | TPS: 1715 | 13137s
Step 2760/4000 | Loss: 1.9427 | LR: 0.000006 | TPS: 1715 | 13184s
Step 2770/4000 | Loss: 2.1166 | LR: 0.000006 | TPS: 1715 | 13231s
Step 2780/4000 | Loss: 1.9711 | LR: 0.000006 | TPS: 1715 | 13278s
Step 2790/4000 | Loss: 2.1390 | LR: 0.000006 | TPS: 1715 | 13326s
Step 2800/4000 | Loss: 2.0557 | LR: 0.000006 | TPS: 1715 | 13373s
📊 Val loss: 2.1839
Step 2810/4000 | Loss: 2.0581 | LR: 0.000006 | TPS: 1715 | 13425s
Step 2820/4000 | Loss: 2.1139 | LR: 0.000006 | TPS: 1715 | 13473s
Step 2830/4000 | Loss: 2.1228 | LR: 0.000006 | TPS: 1715 | 13520s
Step 2840/4000 | Loss: 1.9685 | LR: 0.000006 | TPS: 1715 | 13567s
Step 2850/4000 | Loss: 2.1206 | LR: 0.000006 | TPS: 1715 | 13614s
Step 2860/4000 | Loss: 2.1942 | LR: 0.000006 | TPS: 1715 | 13661s
Step 2870/4000 | Loss: 1.9068 | LR: 0.000006 | TPS: 1715 | 13708s
Step 2880/4000 | Loss: 2.2099 | LR: 0.000006 | TPS: 1715 | 13755s
Step 2890/4000 | Loss: 2.0948 | LR: 0.000006 | TPS: 1715 | 13803s
Step 2900/4000 | Loss: 2.0630 | LR: 0.000005 | TPS: 1715 | 13850s
Step 2910/4000 | Loss: 1.9867 | LR: 0.000005 | TPS: 1715 | 13897s
Step 2920/4000 | Loss: 2.0602 | LR: 0.000005 | TPS: 1715 | 13944s
Step 2930/4000 | Loss: 2.0163 | LR: 0.000005 | TPS: 1716 | 13991s
Step 2940/4000 | Loss: 2.0337 | LR: 0.000005 | TPS: 1716 | 14038s
Step 2950/4000 | Loss: 2.2476 | LR: 0.000005 | TPS: 1716 | 14085s
Step 2960/4000 | Loss: 2.0430 | LR: 0.000005 | TPS: 1716 | 14133s
Step 2970/4000 | Loss: 2.3037 | LR: 0.000005 | TPS: 1716 | 14180s
Step 2980/4000 | Loss: 2.0831 | LR: 0.000005 | TPS: 1716 | 14227s
Step 2990/4000 | Loss: 2.1781 | LR: 0.000005 | TPS: 1716 | 14274s
Step 3000/4000 | Loss: 2.0784 | LR: 0.000005 | TPS: 1716 | 14321s
🔤 Generation samples (step 3000):
[EN] The city of Paris is a metropolitan area in Europe, consisting of 57 counties. Its main cities include Lyons, Bordeaux and Valence.
[HE] איטליה.
[AR] باريس.
[FA] پاریس پایتخت کشور فرانسه و یکی از شهرهای بزرگ این کشور است. شهر پاریس در شمال غربی قاره اروپا قرار دارد.
[TRANSLATE] You are the first one in the world to learn how to think.
Step 3010/4000 | Loss: 2.1244 | LR: 0.000005 | TPS: 1716 | 14370s
Step 3020/4000 | Loss: 2.1107 | LR: 0.000005 | TPS: 1716 | 14417s
Step 3030/4000 | Loss: 2.3589 | LR: 0.000005 | TPS: 1716 | 14464s
Step 3040/4000 | Loss: 2.0592 | LR: 0.000005 | TPS: 1716 | 14511s
Step 3050/4000 | Loss: 2.0730 | LR: 0.000005 | TPS: 1716 | 14559s
Step 3060/4000 | Loss: 2.1365 | LR: 0.000005 | TPS: 1716 | 14606s
Step 3070/4000 | Loss: 1.9819 | LR: 0.000005 | TPS: 1716 | 14653s
Step 3080/4000 | Loss: 2.2175 | LR: 0.000004 | TPS: 1716 | 14700s
Step 3090/4000 | Loss: 2.1442 | LR: 0.000004 | TPS: 1716 | 14747s
Step 3100/4000 | Loss: 2.0811 | LR: 0.000004 | TPS: 1717 | 14794s
Step 3110/4000 | Loss: 2.1427 | LR: 0.000004 | TPS: 1717 | 14841s
Step 3120/4000 | Loss: 2.1722 | LR: 0.000004 | TPS: 1717 | 14889s
Step 3130/4000 | Loss: 2.0577 | LR: 0.000004 | TPS: 1717 | 14936s
Step 3140/4000 | Loss: 2.0873 | LR: 0.000004 | TPS: 1717 | 14983s
Step 3150/4000 | Loss: 2.2920 | LR: 0.000004 | TPS: 1717 | 15030s
Step 3160/4000 | Loss: 1.8839 | LR: 0.000004 | TPS: 1717 | 15077s
Step 3170/4000 | Loss: 2.0144 | LR: 0.000004 | TPS: 1717 | 15124s
Step 3180/4000 | Loss: 1.9689 | LR: 0.000004 | TPS: 1717 | 15171s
Step 3190/4000 | Loss: 2.2123 | LR: 0.000004 | TPS: 1717 | 15219s
Step 3200/4000 | Loss: 2.0510 | LR: 0.000004 | TPS: 1717 | 15266s
📊 Val loss: 2.1269
Step 3210/4000 | Loss: 2.4087 | LR: 0.000004 | TPS: 1717 | 15318s
Step 3220/4000 | Loss: 2.2608 | LR: 0.000004 | TPS: 1717 | 15365s
Step 3230/4000 | Loss: 2.1930 | LR: 0.000004 | TPS: 1717 | 15413s
Step 3240/4000 | Loss: 2.0713 | LR: 0.000004 | TPS: 1717 | 15460s
Step 3250/4000 | Loss: 2.2660 | LR: 0.000004 | TPS: 1717 | 15507s
Step 3260/4000 | Loss: 1.9479 | LR: 0.000004 | TPS: 1717 | 15554s
Step 3270/4000 | Loss: 1.9657 | LR: 0.000004 | TPS: 1717 | 15601s
Step 3280/4000 | Loss: 2.1884 | LR: 0.000004 | TPS: 1717 | 15648s
Step 3290/4000 | Loss: 2.0927 | LR: 0.000004 | TPS: 1717 | 15695s
Step 3300/4000 | Loss: 2.0393 | LR: 0.000003 | TPS: 1717 | 15743s
Step 3310/4000 | Loss: 2.1302 | LR: 0.000003 | TPS: 1717 | 15790s
Step 3320/4000 | Loss: 2.0059 | LR: 0.000003 | TPS: 1717 | 15837s
Step 3330/4000 | Loss: 1.8687 | LR: 0.000003 | TPS: 1717 | 15884s
Step 3340/4000 | Loss: 2.0293 | LR: 0.000003 | TPS: 1717 | 15931s
Step 3350/4000 | Loss: 2.1500 | LR: 0.000003 | TPS: 1718 | 15978s
Step 3360/4000 | Loss: 1.9667 | LR: 0.000003 | TPS: 1718 | 16025s
Step 3370/4000 | Loss: 2.1206 | LR: 0.000003 | TPS: 1718 | 16073s
Step 3380/4000 | Loss: 2.3028 | LR: 0.000003 | TPS: 1718 | 16120s
Step 3390/4000 | Loss: 2.0075 | LR: 0.000003 | TPS: 1718 | 16167s
Step 3400/4000 | Loss: 2.0562 | LR: 0.000003 | TPS: 1718 | 16214s
Step 3410/4000 | Loss: 1.9977 | LR: 0.000003 | TPS: 1718 | 16261s
Step 3420/4000 | Loss: 2.1680 | LR: 0.000003 | TPS: 1718 | 16308s
Step 3430/4000 | Loss: 2.0009 | LR: 0.000003 | TPS: 1718 | 16355s
Step 3440/4000 | Loss: 1.8301 | LR: 0.000003 | TPS: 1718 | 16403s
Step 3450/4000 | Loss: 2.0239 | LR: 0.000003 | TPS: 1718 | 16450s
Step 3460/4000 | Loss: 2.0535 | LR: 0.000003 | TPS: 1718 | 16497s
Step 3470/4000 | Loss: 2.1348 | LR: 0.000003 | TPS: 1718 | 16544s
Step 3480/4000 | Loss: 2.0337 | LR: 0.000003 | TPS: 1718 | 16591s
Step 3490/4000 | Loss: 1.9342 | LR: 0.000003 | TPS: 1718 | 16638s
Step 3500/4000 | Loss: 2.0052 | LR: 0.000003 | TPS: 1718 | 16685s
Step 3510/4000 | Loss: 1.9902 | LR: 0.000003 | TPS: 1718 | 16732s
Step 3520/4000 | Loss: 2.1567 | LR: 0.000003 | TPS: 1719 | 16780s
Step 3530/4000 | Loss: 2.0515 | LR: 0.000003 | TPS: 1719 | 16827s
Step 3540/4000 | Loss: 2.1572 | LR: 0.000003 | TPS: 1719 | 16874s
Step 3550/4000 | Loss: 2.1381 | LR: 0.000003 | TPS: 1719 | 16921s
Step 3560/4000 | Loss: 2.0383 | LR: 0.000003 | TPS: 1719 | 16968s
Step 3570/4000 | Loss: 2.3566 | LR: 0.000003 | TPS: 1719 | 17015s
Step 3580/4000 | Loss: 1.9773 | LR: 0.000003 | TPS: 1719 | 17062s
Step 3590/4000 | Loss: 2.0418 | LR: 0.000003 | TPS: 1719 | 17110s
Step 3600/4000 | Loss: 2.1756 | LR: 0.000002 | TPS: 1719 | 17157s
📊 Val loss: 2.1478
Step 3610/4000 | Loss: 2.0761 | LR: 0.000002 | TPS: 1718 | 17209s
Step 3620/4000 | Loss: 2.1353 | LR: 0.000002 | TPS: 1718 | 17257s
Step 3630/4000 | Loss: 2.1856 | LR: 0.000002 | TPS: 1719 | 17304s
Step 3640/4000 | Loss: 2.1298 | LR: 0.000002 | TPS: 1719 | 17351s
Step 3650/4000 | Loss: 2.0784 | LR: 0.000002 | TPS: 1719 | 17398s
Step 3660/4000 | Loss: 2.0533 | LR: 0.000002 | TPS: 1719 | 17445s
Step 3670/4000 | Loss: 2.2151 | LR: 0.000002 | TPS: 1719 | 17492s
Step 3680/4000 | Loss: 2.0177 | LR: 0.000002 | TPS: 1719 | 17539s
Step 3690/4000 | Loss: 2.1048 | LR: 0.000002 | TPS: 1719 | 17587s
Step 3700/4000 | Loss: 2.0629 | LR: 0.000002 | TPS: 1719 | 17634s
Step 3710/4000 | Loss: 2.0375 | LR: 0.000002 | TPS: 1719 | 17681s
Step 3720/4000 | Loss: 2.2282 | LR: 0.000002 | TPS: 1719 | 17728s
Step 3730/4000 | Loss: 2.2049 | LR: 0.000002 | TPS: 1719 | 17775s
Step 3740/4000 | Loss: 2.0247 | LR: 0.000002 | TPS: 1719 | 17822s
Step 3750/4000 | Loss: 2.0337 | LR: 0.000002 | TPS: 1719 | 17869s
Step 3760/4000 | Loss: 2.0922 | LR: 0.000002 | TPS: 1719 | 17917s
Step 3770/4000 | Loss: 2.1018 | LR: 0.000002 | TPS: 1719 | 17964s
Step 3780/4000 | Loss: 2.1183 | LR: 0.000002 | TPS: 1719 | 18011s
Step 3790/4000 | Loss: 2.2469 | LR: 0.000002 | TPS: 1719 | 18058s
Step 3800/4000 | Loss: 2.1373 | LR: 0.000002 | TPS: 1719 | 18105s
Step 3810/4000 | Loss: 2.1103 | LR: 0.000002 | TPS: 1719 | 18152s
Step 3820/4000 | Loss: 2.0317 | LR: 0.000002 | TPS: 1719 | 18199s
Step 3830/4000 | Loss: 2.0022 | LR: 0.000002 | TPS: 1720 | 18247s
Step 3840/4000 | Loss: 2.1618 | LR: 0.000002 | TPS: 1720 | 18294s
Step 3850/4000 | Loss: 2.1421 | LR: 0.000002 | TPS: 1720 | 18341s
Step 3860/4000 | Loss: 1.9279 | LR: 0.000002 | TPS: 1720 | 18388s
Step 3870/4000 | Loss: 2.1657 | LR: 0.000002 | TPS: 1720 | 18435s
Step 3880/4000 | Loss: 2.1433 | LR: 0.000002 | TPS: 1720 | 18482s
Step 3890/4000 | Loss: 2.0893 | LR: 0.000002 | TPS: 1720 | 18529s
Step 3900/4000 | Loss: 2.0036 | LR: 0.000002 | TPS: 1720 | 18576s
Step 3910/4000 | Loss: 2.0691 | LR: 0.000002 | TPS: 1720 | 18624s
Step 3920/4000 | Loss: 2.0282 | LR: 0.000002 | TPS: 1720 | 18671s
Step 3930/4000 | Loss: 1.9818 | LR: 0.000002 | TPS: 1720 | 18718s
Step 3940/4000 | Loss: 2.1466 | LR: 0.000002 | TPS: 1720 | 18765s
Step 3950/4000 | Loss: 2.0455 | LR: 0.000002 | TPS: 1720 | 18812s
Step 3960/4000 | Loss: 2.1226 | LR: 0.000002 | TPS: 1720 | 18859s
Step 3970/4000 | Loss: 1.9890 | LR: 0.000002 | TPS: 1720 | 18906s
Step 3980/4000 | Loss: 2.1891 | LR: 0.000002 | TPS: 1720 | 18954s
Step 3990/4000 | Loss: 1.8920 | LR: 0.000002 | TPS: 1720 | 19001s
Step 4000/4000 | Loss: 2.0073 | LR: 0.000002 | TPS: 1720 | 19048s
📊 Val loss: 2.1472
🔤 Generation samples (step 4000):
[EN] The capital of France consists of 38 cities, 26.9% (14) of which are in the metropolitan area.
[HE] צרפת היא אחת מיעדי התיירות הפופולאריים ביותר בעולם, בשל היותה מוקד משיכה תיירותי משמעותי עבור תיירים מכל רחבי העולם. העיר בנויה משני חלקים עיקריים - כיכר ד'ארסאן (Droite Sud) ורחוב ד'ארסאן (De La Roch
[AR] باريس.
[FA] پاریس شهری بزرگ و تاریخی در شمال غربی اروپا است.
[TRANSLATE] It’s very short.
============================================================
SFT TRAINING COMPLETE
Steps: 4000, Time: 19057s (317.6min)
Best val loss: 2.1164
Model saved to: /tmp/sft/sft_model_v2.pt
============================================================
Uploading to S3...
|