| [2026-04-07 15:25:15] === Multilingual 3.14B Training — FSDP (Arabic-Rebalanced) === |
| [2026-04-07 15:25:15] World size: 8, Batch/GPU: 32, Grad accum: 1 |
| [2026-04-07 15:25:15] Effective batch: 256 seqs = 524,288 tokens/step |
| [2026-04-07 15:25:15] Total steps: 20000 = 10,485,760,000 tokens |
| [2026-04-07 15:25:15] Schedule: WSD-LINEAR | warmup=600 | stable_end=14000 | total=20000 |
| [2026-04-07 15:25:15] AdamW LR=0.0003, betas=(0.9, 0.98), WD=0.02 |
| [2026-04-07 15:25:15] Label smoothing=0.06, min_lr=0.03, grad_clip=1.0 |
| [2026-04-07 15:25:15] SWA: start=8000, freq=40 |
| [2026-04-07 15:25:15] Model: dim=3072, depth=26, heads=24, dropout=0.05 |
| [2026-04-07 15:25:15] FSDP: FULL_SHARD, MixedPrecision(bf16/fp32), Block-level wrapping |
| [2026-04-07 15:25:15] Loading training data... |
| [2026-04-07 15:25:15] Train tokens: 4,479,984,561 |
| [2026-04-07 15:25:15] Loading validation data... |
| [2026-04-07 15:25:15] val_en: 20 batches |
| [2026-04-07 15:25:15] val_ar: 20 batches |
| [2026-04-07 15:25:15] val_he: 20 batches |
| [2026-04-07 15:25:15] val_fa: 20 batches |
| [2026-04-07 15:25:15] Creating model... |
| [2026-04-07 15:25:45] Model params: 3,042,868,224 (non-embedding: 2,944,564,224) |
| [2026-04-07 15:25:45] Wrapping model with FSDP... |
| [2026-04-07 15:25:47] FSDP wrapped. GPU memory: 1.5 GB |
| [2026-04-07 15:25:47] Applied FSDP activation checkpointing to Block layers |
| [2026-04-07 15:25:47] Using standard AdamW (bitsandbytes not available) |
| [2026-04-07 15:25:54] Loading checkpoint from step 11500... |
| [2026-04-07 15:26:03] Resumed from step 11500. Tokens: 6,029,312,000, Best BPB: 3.2654 |
| [2026-04-07 15:26:03] NOTE: Optimizer state reset (fresh AdamW). LR schedule continues from step 11501. |
| [2026-04-07 15:26:04] Starting training from step 11501... |
| [2026-04-07 15:29:51] Step 11550/20000 [stable] | Loss: 2.8941 | BPB: 4.1753 | LR: 0.000300 | Tokens: 6,055,526,400 | TPS: 26,710,866 | SWA: 1 | Mem: 46.7GB | 3.8min |
| [2026-04-07 15:34:25] Step 11600/20000 [stable] | Loss: 3.0006 | BPB: 4.3289 | LR: 0.000300 | Tokens: 6,081,740,800 | TPS: 12,149,351 | SWA: 3 | Mem: 46.7GB | 8.3min |
| [2026-04-07 15:38:25] Step 11650/20000 [stable] | Loss: 2.9979 | BPB: 4.3250 | LR: 0.000300 | Tokens: 6,107,955,200 | TPS: 8,243,253 | SWA: 4 | Mem: 46.7GB | 12.3min |
| [2026-04-07 15:42:26] Step 11700/20000 [stable] | Loss: 2.8086 | BPB: 4.0520 | LR: 0.000300 | Tokens: 6,134,169,600 | TPS: 6,249,806 | SWA: 5 | Mem: 46.7GB | 16.4min |
| [2026-04-07 15:46:26] Step 11750/20000 [stable] | Loss: 2.9363 | BPB: 4.2361 | LR: 0.000300 | Tokens: 6,160,384,000 | TPS: 5,040,765 | SWA: 6 | Mem: 46.7GB | 20.4min |
| [2026-04-07 15:46:50] Saved checkpoint at step 11750 |
| [2026-04-07 15:51:23] Step 11800/20000 [stable] | Loss: 3.1766 | BPB: 4.5828 | LR: 0.000300 | Tokens: 6,186,598,400 | TPS: 4,072,372 | SWA: 8 | Mem: 46.7GB | 25.3min |
| [2026-04-07 15:55:24] Step 11850/20000 [stable] | Loss: 2.9082 | BPB: 4.1957 | LR: 0.000300 | Tokens: 6,212,812,800 | TPS: 3,530,105 | SWA: 9 | Mem: 46.7GB | 29.3min |
| [2026-04-07 15:59:25] Step 11900/20000 [stable] | Loss: 3.0306 | BPB: 4.3722 | LR: 0.000300 | Tokens: 6,239,027,200 | TPS: 3,118,964 | SWA: 10 | Mem: 46.7GB | 33.3min |
| [2026-04-07 16:03:26] Step 11950/20000 [stable] | Loss: 3.0494 | BPB: 4.3993 | LR: 0.000300 | Tokens: 6,265,241,600 | TPS: 2,795,440 | SWA: 11 | Mem: 46.7GB | 37.4min |
| [2026-04-07 16:08:01] Step 12000/20000 [stable] | Loss: 2.7197 | BPB: 3.9236 | LR: 0.000300 | Tokens: 6,291,456,000 | TPS: 2,499,902 | SWA: 13 | Mem: 46.7GB | 41.9min |
| [2026-04-07 16:08:25] Saved checkpoint at step 12000 |
| [2026-04-07 16:08:31] --- Evaluation at step 12000 --- |
| [2026-04-07 16:08:31] Combined val BPB: 3.2512 |
| [2026-04-07 16:08:36] en val BPB: 4.0340 |
| [2026-04-07 16:08:41] ar val BPB: 3.9566 |
| [2026-04-07 16:08:46] he val BPB: 1.8995 |
| [2026-04-07 16:08:52] fa val BPB: 3.3992 |
| [2026-04-07 16:08:52] New best! BPB: 3.2512 |
| [2026-04-07 16:09:45] Saved model to /tmp/checkpoints/best_model.pt |
| [2026-04-07 16:13:46] Step 12050/20000 [stable] | Loss: 3.1092 | BPB: 4.4856 | LR: 0.000300 | Tokens: 6,317,670,400 | TPS: 2,207,326 | SWA: 14 | Mem: 46.7GB | 47.7min |
| [2026-04-07 16:17:48] Step 12100/20000 [stable] | Loss: 3.0531 | BPB: 4.4046 | LR: 0.000300 | Tokens: 6,343,884,800 | TPS: 2,044,072 | SWA: 15 | Mem: 46.7GB | 51.7min |
| [2026-04-07 16:21:49] Step 12150/20000 [stable] | Loss: 3.0343 | BPB: 4.3776 | LR: 0.000300 | Tokens: 6,370,099,200 | TPS: 1,904,576 | SWA: 16 | Mem: 46.7GB | 55.7min |
| [2026-04-07 16:26:24] Step 12200/20000 [stable] | Loss: 3.0501 | BPB: 4.4004 | LR: 0.000300 | Tokens: 6,396,313,600 | TPS: 1,767,257 | SWA: 18 | Mem: 46.7GB | 60.3min |
| [2026-04-07 16:30:25] Step 12250/20000 [stable] | Loss: 3.1364 | BPB: 4.5249 | LR: 0.000300 | Tokens: 6,422,528,000 | TPS: 1,663,761 | SWA: 19 | Mem: 46.7GB | 64.3min |
| [2026-04-07 16:30:48] Saved checkpoint at step 12250 |
| [2026-04-07 16:34:49] Step 12300/20000 [stable] | Loss: 3.1457 | BPB: 4.5383 | LR: 0.000300 | Tokens: 6,448,742,400 | TPS: 1,563,292 | SWA: 20 | Mem: 46.7GB | 68.8min |
| [2026-04-07 16:38:51] Step 12350/20000 [stable] | Loss: 3.0414 | BPB: 4.3879 | LR: 0.000300 | Tokens: 6,474,956,800 | TPS: 1,482,944 | SWA: 21 | Mem: 46.7GB | 72.8min |
| [2026-04-07 16:43:25] Step 12400/20000 [stable] | Loss: 3.0295 | BPB: 4.3707 | LR: 0.000300 | Tokens: 6,501,171,200 | TPS: 1,400,772 | SWA: 23 | Mem: 46.7GB | 77.4min |
| [2026-04-07 16:47:26] Step 12450/20000 [stable] | Loss: 2.9572 | BPB: 4.2663 | LR: 0.000300 | Tokens: 6,527,385,600 | TPS: 1,337,132 | SWA: 24 | Mem: 46.7GB | 81.4min |
| [2026-04-07 16:51:27] Step 12500/20000 [stable] | Loss: 3.0041 | BPB: 4.3339 | LR: 0.000300 | Tokens: 6,553,600,000 | TPS: 1,279,217 | SWA: 25 | Mem: 46.7GB | 85.4min |
| [2026-04-07 16:51:51] Saved checkpoint at step 12500 |
| [2026-04-07 16:51:56] --- Evaluation at step 12500 --- |
| [2026-04-07 16:51:56] Combined val BPB: 3.2389 |
| [2026-04-07 16:52:02] en val BPB: 4.0211 |
| [2026-04-07 16:52:07] ar val BPB: 3.9414 |
| [2026-04-07 16:52:12] he val BPB: 1.8939 |
| [2026-04-07 16:52:17] fa val BPB: 3.3833 |
| [2026-04-07 16:52:17] New best! BPB: 3.2389 |
| [2026-04-07 16:53:11] Saved model to /tmp/checkpoints/best_model.pt |
| [2026-04-07 16:57:12] Step 12550/20000 [stable] | Loss: 3.0963 | BPB: 4.4670 | LR: 0.000300 | Tokens: 6,579,814,400 | TPS: 1,203,315 | SWA: 26 | Mem: 46.7GB | 91.1min |
| [2026-04-07 17:01:47] Step 12600/20000 [stable] | Loss: 3.0497 | BPB: 4.3998 | LR: 0.000300 | Tokens: 6,606,028,800 | TPS: 1,150,394 | SWA: 28 | Mem: 46.7GB | 95.7min |
| [2026-04-07 17:05:47] Step 12650/20000 [stable] | Loss: 3.0359 | BPB: 4.3798 | LR: 0.000300 | Tokens: 6,632,243,200 | TPS: 1,108,491 | SWA: 29 | Mem: 46.7GB | 99.7min |
| [2026-04-07 17:09:48] Step 12700/20000 [stable] | Loss: 3.1734 | BPB: 4.5783 | LR: 0.000300 | Tokens: 6,658,457,600 | TPS: 1,069,814 | SWA: 30 | Mem: 46.7GB | 103.7min |
| [2026-04-07 17:13:50] Step 12750/20000 [stable] | Loss: 2.9372 | BPB: 4.2374 | LR: 0.000300 | Tokens: 6,684,672,000 | TPS: 1,033,934 | SWA: 31 | Mem: 46.7GB | 107.8min |
| [2026-04-07 17:14:13] Saved checkpoint at step 12750 |
| [2026-04-07 17:18:48] Step 12800/20000 [stable] | Loss: 2.6958 | BPB: 3.8893 | LR: 0.000300 | Tokens: 6,710,886,400 | TPS: 992,175 | SWA: 33 | Mem: 46.7GB | 112.7min |
| [2026-04-07 17:22:49] Step 12850/20000 [stable] | Loss: 3.3123 | BPB: 4.7786 | LR: 0.000300 | Tokens: 6,737,100,800 | TPS: 961,854 | SWA: 34 | Mem: 46.7GB | 116.7min |
| [2026-04-07 17:26:49] Step 12900/20000 [stable] | Loss: 2.9599 | BPB: 4.2702 | LR: 0.000300 | Tokens: 6,763,315,200 | TPS: 933,516 | SWA: 35 | Mem: 46.7GB | 120.7min |
| [2026-04-07 17:30:51] Step 12950/20000 [stable] | Loss: 2.8022 | BPB: 4.0428 | LR: 0.000300 | Tokens: 6,789,529,600 | TPS: 906,909 | SWA: 36 | Mem: 46.7GB | 124.8min |
| [2026-04-07 17:35:26] Step 13000/20000 [stable] | Loss: 2.9235 | BPB: 4.2177 | LR: 0.000300 | Tokens: 6,815,744,000 | TPS: 878,094 | SWA: 38 | Mem: 46.7GB | 129.4min |
| [2026-04-07 17:35:50] Saved checkpoint at step 13000 |
| [2026-04-07 17:35:55] --- Evaluation at step 13000 --- |
| [2026-04-07 17:35:55] Combined val BPB: 3.2196 |
| [2026-04-07 17:36:01] en val BPB: 4.0101 |
| [2026-04-07 17:36:06] ar val BPB: 3.9315 |
| [2026-04-07 17:36:11] he val BPB: 1.8794 |
| [2026-04-07 17:36:16] fa val BPB: 3.3678 |
| [2026-04-07 17:36:16] New best! BPB: 3.2196 |
| [2026-04-07 17:37:10] Saved model to /tmp/checkpoints/best_model.pt |
| [2026-04-07 17:41:13] Step 13050/20000 [stable] | Loss: 2.9555 | BPB: 4.2639 | LR: 0.000300 | Tokens: 6,841,958,400 | TPS: 843,812 | SWA: 39 | Mem: 46.7GB | 135.1min |
| [2026-04-07 17:45:14] Step 13100/20000 [stable] | Loss: 2.9740 | BPB: 4.2906 | LR: 0.000300 | Tokens: 6,868,172,800 | TPS: 822,576 | SWA: 40 | Mem: 46.7GB | 139.2min |
| [2026-04-07 17:49:15] Step 13150/20000 [stable] | Loss: 3.1133 | BPB: 4.4915 | LR: 0.000300 | Tokens: 6,894,387,200 | TPS: 802,504 | SWA: 41 | Mem: 46.7GB | 143.2min |
| [2026-04-07 17:53:50] Step 13200/20000 [stable] | Loss: 3.0578 | BPB: 4.4115 | LR: 0.000300 | Tokens: 6,920,601,600 | TPS: 780,574 | SWA: 43 | Mem: 46.7GB | 147.8min |
| [2026-04-07 17:57:52] Step 13250/20000 [stable] | Loss: 3.0051 | BPB: 4.3354 | LR: 0.000300 | Tokens: 6,946,816,000 | TPS: 762,773 | SWA: 44 | Mem: 46.7GB | 151.8min |
| [2026-04-07 17:58:15] Saved checkpoint at step 13250 |
| [2026-04-07 18:02:17] Step 13300/20000 [stable] | Loss: 3.0474 | BPB: 4.3965 | LR: 0.000300 | Tokens: 6,973,030,400 | TPS: 743,988 | SWA: 45 | Mem: 46.7GB | 156.2min |
| [2026-04-07 18:06:18] Step 13350/20000 [stable] | Loss: 3.0762 | BPB: 4.4380 | LR: 0.000300 | Tokens: 6,999,244,800 | TPS: 728,028 | SWA: 46 | Mem: 46.7GB | 160.2min |
| [2026-04-07 18:10:54] Step 13400/20000 [stable] | Loss: 2.9245 | BPB: 4.2191 | LR: 0.000300 | Tokens: 7,025,459,200 | TPS: 710,379 | SWA: 48 | Mem: 46.7GB | 164.8min |
| [2026-04-07 18:14:55] Step 13450/20000 [stable] | Loss: 2.9225 | BPB: 4.2163 | LR: 0.000300 | Tokens: 7,051,673,600 | TPS: 696,044 | SWA: 49 | Mem: 46.7GB | 168.9min |
| [2026-04-07 18:18:56] Step 13500/20000 [stable] | Loss: 3.0370 | BPB: 4.3815 | LR: 0.000300 | Tokens: 7,077,888,000 | TPS: 682,398 | SWA: 50 | Mem: 46.7GB | 172.9min |
| [2026-04-07 18:19:20] Saved checkpoint at step 13500 |
| [2026-04-07 18:19:25] --- Evaluation at step 13500 --- |
| [2026-04-07 18:19:25] Combined val BPB: 3.2094 |
| [2026-04-07 18:19:31] en val BPB: 4.0007 |
| [2026-04-07 18:19:36] ar val BPB: 3.9208 |
| [2026-04-07 18:19:41] he val BPB: 1.8821 |
| [2026-04-07 18:19:46] fa val BPB: 3.3442 |
| [2026-04-07 18:19:46] New best! BPB: 3.2094 |
| [2026-04-07 18:20:41] Saved model to /tmp/checkpoints/best_model.pt |
| [2026-04-07 18:24:42] Step 13550/20000 [stable] | Loss: 3.1894 | BPB: 4.6014 | LR: 0.000300 | Tokens: 7,104,102,400 | TPS: 662,824 | SWA: 51 | Mem: 46.7GB | 178.6min |
| [2026-04-07 18:29:18] Step 13600/20000 [stable] | Loss: 2.8078 | BPB: 4.0508 | LR: 0.000300 | Tokens: 7,130,316,800 | TPS: 648,578 | SWA: 53 | Mem: 46.7GB | 183.2min |
| [2026-04-07 18:33:19] Step 13650/20000 [stable] | Loss: 2.8769 | BPB: 4.1505 | LR: 0.000300 | Tokens: 7,156,531,200 | TPS: 636,984 | SWA: 54 | Mem: 46.7GB | 187.3min |
| [2026-04-07 18:37:21] Step 13700/20000 [stable] | Loss: 2.9699 | BPB: 4.2846 | LR: 0.000300 | Tokens: 7,182,745,600 | TPS: 625,875 | SWA: 55 | Mem: 46.7GB | 191.3min |
| [2026-04-07 18:41:23] Step 13750/20000 [stable] | Loss: 2.9296 | BPB: 4.2265 | LR: 0.000300 | Tokens: 7,208,960,000 | TPS: 615,186 | SWA: 56 | Mem: 46.7GB | 195.3min |
| [2026-04-07 18:41:47] Saved checkpoint at step 13750 |
| [2026-04-07 18:46:24] Step 13800/20000 [stable] | Loss: 3.0591 | BPB: 4.4133 | LR: 0.000300 | Tokens: 7,235,174,400 | TPS: 601,926 | SWA: 58 | Mem: 46.7GB | 200.3min |
| [2026-04-07 18:50:26] Step 13850/20000 [stable] | Loss: 3.2637 | BPB: 4.7085 | LR: 0.000300 | Tokens: 7,261,388,800 | TPS: 592,200 | SWA: 59 | Mem: 46.7GB | 204.4min |
| [2026-04-07 18:54:27] Step 13900/20000 [stable] | Loss: 3.2091 | BPB: 4.6297 | LR: 0.000300 | Tokens: 7,287,603,200 | TPS: 582,867 | SWA: 60 | Mem: 46.7GB | 208.4min |
| [2026-04-07 18:58:28] Step 13950/20000 [stable] | Loss: 2.9200 | BPB: 4.2127 | LR: 0.000300 | Tokens: 7,313,817,600 | TPS: 573,894 | SWA: 61 | Mem: 46.7GB | 212.4min |
| [2026-04-07 19:03:05] Step 14000/20000 [decay] | Loss: 2.7259 | BPB: 3.9326 | LR: 0.000300 | Tokens: 7,340,032,000 | TPS: 563,718 | SWA: 63 | Mem: 46.7GB | 217.0min |
| [2026-04-07 19:03:29] Saved checkpoint at step 14000 |
| [2026-04-07 19:03:34] --- Evaluation at step 14000 --- |
| [2026-04-07 19:03:34] Combined val BPB: 3.2072 |
| [2026-04-07 19:03:39] en val BPB: 3.9933 |
| [2026-04-07 19:03:44] ar val BPB: 3.9105 |
| [2026-04-07 19:03:50] he val BPB: 1.8704 |
| [2026-04-07 19:03:55] fa val BPB: 3.3445 |
| [2026-04-07 19:03:55] New best! BPB: 3.2072 |
| [2026-04-07 19:04:48] Saved model to /tmp/checkpoints/best_model.pt |
| [2026-04-07 19:08:51] Step 14050/20000 [decay] | Loss: 2.9732 | BPB: 4.2894 | LR: 0.000298 | Tokens: 7,366,246,400 | TPS: 551,076 | SWA: 64 | Mem: 46.7GB | 222.8min |
| [2026-04-07 19:12:54] Step 14100/20000 [decay] | Loss: 2.9500 | BPB: 4.2560 | LR: 0.000295 | Tokens: 7,392,460,800 | TPS: 543,194 | SWA: 65 | Mem: 46.7GB | 226.8min |
| [2026-04-07 19:16:56] Step 14150/20000 [decay] | Loss: 2.9680 | BPB: 4.2820 | LR: 0.000293 | Tokens: 7,418,675,200 | TPS: 535,593 | SWA: 66 | Mem: 46.7GB | 230.9min |
| [2026-04-07 19:21:31] Step 14200/20000 [decay] | Loss: 3.0077 | BPB: 4.3392 | LR: 0.000290 | Tokens: 7,444,889,600 | TPS: 527,012 | SWA: 68 | Mem: 46.7GB | 235.4min |
| [2026-04-07 19:25:32] Step 14250/20000 [decay] | Loss: 3.0809 | BPB: 4.4448 | LR: 0.000288 | Tokens: 7,471,104,000 | TPS: 519,984 | SWA: 69 | Mem: 46.7GB | 239.5min |
| [2026-04-07 19:25:57] Saved checkpoint at step 14250 |
| [2026-04-07 19:29:59] Step 14300/20000 [decay] | Loss: 3.0979 | BPB: 4.4693 | LR: 0.000285 | Tokens: 7,497,318,400 | TPS: 512,307 | SWA: 70 | Mem: 46.7GB | 243.9min |
| [2026-04-07 19:34:01] Step 14350/20000 [decay] | Loss: 3.1389 | BPB: 4.5284 | LR: 0.000283 | Tokens: 7,523,532,800 | TPS: 505,738 | SWA: 71 | Mem: 46.7GB | 247.9min |
| [2026-04-07 19:38:36] Step 14400/20000 [decay] | Loss: 2.7258 | BPB: 3.9325 | LR: 0.000281 | Tokens: 7,549,747,200 | TPS: 498,262 | SWA: 73 | Mem: 46.7GB | 252.5min |
| [2026-04-07 19:42:38] Step 14450/20000 [decay] | Loss: 3.0346 | BPB: 4.3781 | LR: 0.000278 | Tokens: 7,575,961,600 | TPS: 492,141 | SWA: 74 | Mem: 46.7GB | 256.6min |
| [2026-04-07 19:46:41] Step 14500/20000 [decay] | Loss: 2.8087 | BPB: 4.0521 | LR: 0.000276 | Tokens: 7,602,176,000 | TPS: 486,179 | SWA: 75 | Mem: 46.7GB | 260.6min |
| [2026-04-07 19:47:05] Saved checkpoint at step 14500 |
| [2026-04-07 19:47:10] --- Evaluation at step 14500 --- |
| [2026-04-07 19:47:10] Combined val BPB: 3.1949 |
| [2026-04-07 19:47:16] en val BPB: 3.9780 |
| [2026-04-07 19:47:21] ar val BPB: 3.8893 |
| [2026-04-07 19:47:26] he val BPB: 1.8582 |
| [2026-04-07 19:47:31] fa val BPB: 3.3323 |
| [2026-04-07 19:47:31] New best! BPB: 3.1949 |
| [2026-04-07 19:48:25] Saved model to /tmp/checkpoints/best_model.pt |
| [2026-04-07 19:52:27] Step 14550/20000 [decay] | Loss: 2.9876 | BPB: 4.3103 | LR: 0.000273 | Tokens: 7,628,390,400 | TPS: 477,294 | SWA: 76 | Mem: 46.7GB | 266.4min |
| [2026-04-07 19:57:02] Step 14600/20000 [decay] | Loss: 2.9462 | BPB: 4.2504 | LR: 0.000271 | Tokens: 7,654,604,800 | TPS: 470,841 | SWA: 78 | Mem: 46.7GB | 271.0min |
| [2026-04-07 20:01:03] Step 14650/20000 [decay] | Loss: 2.9918 | BPB: 4.3162 | LR: 0.000268 | Tokens: 7,680,819,200 | TPS: 465,549 | SWA: 79 | Mem: 46.7GB | 275.0min |
| [2026-04-07 20:05:04] Step 14700/20000 [decay] | Loss: 2.8854 | BPB: 4.1628 | LR: 0.000266 | Tokens: 7,707,033,600 | TPS: 460,412 | SWA: 80 | Mem: 46.7GB | 279.0min |
| [2026-04-07 20:09:06] Step 14750/20000 [decay] | Loss: 3.2876 | BPB: 4.7430 | LR: 0.000264 | Tokens: 7,733,248,000 | TPS: 455,389 | SWA: 81 | Mem: 46.7GB | 283.0min |
| [2026-04-07 20:09:29] Saved checkpoint at step 14750 |
| [2026-04-07 20:14:04] Step 14800/20000 [decay] | Loss: 2.8727 | BPB: 4.1444 | LR: 0.000261 | Tokens: 7,759,462,400 | TPS: 449,043 | SWA: 83 | Mem: 46.7GB | 288.0min |
| [2026-04-07 20:18:06] Step 14850/20000 [decay] | Loss: 3.0526 | BPB: 4.4040 | LR: 0.000259 | Tokens: 7,785,676,800 | TPS: 444,357 | SWA: 84 | Mem: 46.7GB | 292.0min |
| [2026-04-07 20:22:07] Step 14900/20000 [decay] | Loss: 2.6748 | BPB: 3.8590 | LR: 0.000256 | Tokens: 7,811,891,200 | TPS: 439,802 | SWA: 85 | Mem: 46.7GB | 296.0min |
| [2026-04-07 20:26:09] Step 14950/20000 [decay] | Loss: 2.9423 | BPB: 4.2449 | LR: 0.000254 | Tokens: 7,838,105,600 | TPS: 435,348 | SWA: 86 | Mem: 46.7GB | 300.1min |
| [2026-04-07 20:30:44] Step 15000/20000 [decay] | Loss: 2.7843 | BPB: 4.0169 | LR: 0.000251 | Tokens: 7,864,320,000 | TPS: 430,231 | SWA: 88 | Mem: 46.7GB | 304.7min |
| [2026-04-07 20:31:08] Saved checkpoint at step 15000 |
| [2026-04-07 20:31:13] --- Evaluation at step 15000 --- |
| [2026-04-07 20:31:13] Combined val BPB: 3.1722 |
| [2026-04-07 20:31:19] en val BPB: 3.9594 |
| [2026-04-07 20:31:24] ar val BPB: 3.8730 |
| [2026-04-07 20:31:29] he val BPB: 1.8443 |
| [2026-04-07 20:31:34] fa val BPB: 3.3170 |
| [2026-04-07 20:31:34] New best! BPB: 3.1722 |
| [2026-04-07 20:32:28] Saved model to /tmp/checkpoints/best_model.pt |
| [2026-04-07 20:36:31] Step 15050/20000 [decay] | Loss: 2.9149 | BPB: 4.2053 | LR: 0.000249 | Tokens: 7,890,534,400 | TPS: 423,622 | SWA: 89 | Mem: 46.7GB | 310.4min |
| [2026-04-07 20:40:32] Step 15100/20000 [decay] | Loss: 2.8421 | BPB: 4.1003 | LR: 0.000247 | Tokens: 7,916,748,800 | TPS: 419,582 | SWA: 90 | Mem: 46.7GB | 314.5min |
| [2026-04-07 20:44:34] Step 15150/20000 [decay] | Loss: 3.0003 | BPB: 4.3285 | LR: 0.000244 | Tokens: 7,942,963,200 | TPS: 415,645 | SWA: 91 | Mem: 46.7GB | 318.5min |
| [2026-04-07 20:49:10] Step 15200/20000 [decay] | Loss: 2.9927 | BPB: 4.3175 | LR: 0.000242 | Tokens: 7,969,177,600 | TPS: 411,081 | SWA: 93 | Mem: 46.7GB | 323.1min |
| [2026-04-07 20:53:12] Step 15250/20000 [decay] | Loss: 2.9789 | BPB: 4.2977 | LR: 0.000239 | Tokens: 7,995,392,000 | TPS: 407,346 | SWA: 94 | Mem: 46.7GB | 327.1min |
| [2026-04-07 20:53:36] Saved checkpoint at step 15250 |
| [2026-04-07 20:57:38] Step 15300/20000 [decay] | Loss: 2.7740 | BPB: 4.0020 | LR: 0.000237 | Tokens: 8,021,606,400 | TPS: 403,221 | SWA: 95 | Mem: 46.7GB | 331.6min |
| [2026-04-07 21:01:40] Step 15350/20000 [decay] | Loss: 2.8095 | BPB: 4.0533 | LR: 0.000235 | Tokens: 8,047,820,800 | TPS: 399,685 | SWA: 96 | Mem: 46.7GB | 335.6min |
| [2026-04-07 21:06:15] Step 15400/20000 [decay] | Loss: 3.0071 | BPB: 4.3383 | LR: 0.000232 | Tokens: 8,074,035,200 | TPS: 395,569 | SWA: 98 | Mem: 46.7GB | 340.2min |
| [2026-04-07 21:10:18] Step 15450/20000 [decay] | Loss: 2.8658 | BPB: 4.1345 | LR: 0.000230 | Tokens: 8,100,249,600 | TPS: 392,187 | SWA: 99 | Mem: 46.7GB | 344.2min |
| [2026-04-07 21:14:21] Step 15500/20000 [decay] | Loss: 2.9056 | BPB: 4.1919 | LR: 0.000227 | Tokens: 8,126,464,000 | TPS: 388,892 | SWA: 100 | Mem: 46.7GB | 348.3min |
| [2026-04-07 21:14:45] Saved checkpoint at step 15500 |
| [2026-04-07 21:14:50] --- Evaluation at step 15500 --- |
| [2026-04-07 21:14:50] Combined val BPB: 3.1581 |
| [2026-04-07 21:14:56] en val BPB: 3.9423 |
| [2026-04-07 21:15:01] ar val BPB: 3.8530 |
| [2026-04-07 21:15:06] he val BPB: 1.8316 |
| [2026-04-07 21:15:11] fa val BPB: 3.2893 |
| [2026-04-07 21:15:11] New best! BPB: 3.1581 |
| [2026-04-07 21:16:05] Saved model to /tmp/checkpoints/best_model.pt |
| [2026-04-07 21:20:07] Step 15550/20000 [decay] | Loss: 3.0142 | BPB: 4.3485 | LR: 0.000225 | Tokens: 8,152,678,400 | TPS: 383,789 | SWA: 101 | Mem: 46.7GB | 354.0min |
| [2026-04-07 21:24:44] Step 15600/20000 [decay] | Loss: 3.0668 | BPB: 4.4245 | LR: 0.000222 | Tokens: 8,178,892,800 | TPS: 380,074 | SWA: 103 | Mem: 46.7GB | 358.7min |
| [2026-04-07 21:28:45] Step 15650/20000 [decay] | Loss: 2.9354 | BPB: 4.2348 | LR: 0.000220 | Tokens: 8,205,107,200 | TPS: 377,066 | SWA: 104 | Mem: 46.7GB | 362.7min |
| [2026-04-07 21:32:46] Step 15700/20000 [decay] | Loss: 2.8455 | BPB: 4.1051 | LR: 0.000218 | Tokens: 8,231,321,600 | TPS: 374,120 | SWA: 105 | Mem: 46.7GB | 366.7min |
| [2026-04-07 21:36:48] Step 15750/20000 [decay] | Loss: 2.9204 | BPB: 4.2133 | LR: 0.000215 | Tokens: 8,257,536,000 | TPS: 371,223 | SWA: 106 | Mem: 46.7GB | 370.7min |
| [2026-04-07 21:37:13] Saved checkpoint at step 15750 |
| [2026-04-07 21:41:49] Step 15800/20000 [decay] | Loss: 3.0251 | BPB: 4.3643 | LR: 0.000213 | Tokens: 8,283,750,400 | TPS: 367,438 | SWA: 108 | Mem: 46.7GB | 375.7min |
| [2026-04-07 21:45:51] Step 15850/20000 [decay] | Loss: 2.7579 | BPB: 3.9788 | LR: 0.000210 | Tokens: 8,309,964,800 | TPS: 364,691 | SWA: 109 | Mem: 46.7GB | 379.8min |
| [2026-04-07 21:49:53] Step 15900/20000 [decay] | Loss: 3.0697 | BPB: 4.4286 | LR: 0.000208 | Tokens: 8,336,179,200 | TPS: 361,997 | SWA: 110 | Mem: 46.7GB | 383.8min |
| [2026-04-07 21:53:55] Step 15950/20000 [decay] | Loss: 3.0842 | BPB: 4.4495 | LR: 0.000205 | Tokens: 8,362,393,600 | TPS: 359,355 | SWA: 111 | Mem: 46.7GB | 387.8min |
| [2026-04-07 21:58:31] Step 16000/20000 [decay] | Loss: 2.7730 | BPB: 4.0006 | LR: 0.000203 | Tokens: 8,388,608,000 | TPS: 356,248 | SWA: 113 | Mem: 46.7GB | 392.5min |
| [2026-04-07 21:58:56] Saved checkpoint at step 16000 |
| [2026-04-07 21:59:01] --- Evaluation at step 16000 --- |
| [2026-04-07 21:59:01] Combined val BPB: 3.1337 |
| [2026-04-07 21:59:06] en val BPB: 3.9261 |
| [2026-04-07 21:59:12] ar val BPB: 3.8328 |
| [2026-04-07 21:59:17] he val BPB: 1.8194 |
| [2026-04-07 21:59:22] fa val BPB: 3.2746 |
| [2026-04-07 21:59:22] New best! BPB: 3.1337 |
| [2026-04-07 22:00:16] Saved model to /tmp/checkpoints/best_model.pt |
| [2026-04-07 22:04:18] Step 16050/20000 [decay] | Loss: 2.9140 | BPB: 4.2040 | LR: 0.000201 | Tokens: 8,414,822,400 | TPS: 352,177 | SWA: 114 | Mem: 46.7GB | 398.2min |
| [2026-04-07 22:08:20] Step 16100/20000 [decay] | Loss: 3.1789 | BPB: 4.5862 | LR: 0.000198 | Tokens: 8,441,036,800 | TPS: 349,734 | SWA: 115 | Mem: 46.7GB | 402.3min |
| [2026-04-07 22:12:21] Step 16150/20000 [decay] | Loss: 2.8371 | BPB: 4.0931 | LR: 0.000196 | Tokens: 8,467,251,200 | TPS: 347,345 | SWA: 116 | Mem: 46.7GB | 406.3min |
| [2026-04-07 22:16:57] Step 16200/20000 [decay] | Loss: 2.9875 | BPB: 4.3100 | LR: 0.000193 | Tokens: 8,493,465,600 | TPS: 344,518 | SWA: 118 | Mem: 46.7GB | 410.9min |
| [2026-04-07 22:21:00] Step 16250/20000 [decay] | Loss: 2.7889 | BPB: 4.0235 | LR: 0.000191 | Tokens: 8,519,680,000 | TPS: 342,221 | SWA: 119 | Mem: 46.7GB | 414.9min |
| [2026-04-07 22:21:24] Saved checkpoint at step 16250 |
| [2026-04-07 22:25:25] Step 16300/20000 [decay] | Loss: 3.0371 | BPB: 4.3816 | LR: 0.000188 | Tokens: 8,545,894,400 | TPS: 339,646 | SWA: 120 | Mem: 46.7GB | 419.4min |
| [2026-04-07 22:29:27] Step 16350/20000 [decay] | Loss: 2.8663 | BPB: 4.1352 | LR: 0.000186 | Tokens: 8,572,108,800 | TPS: 337,451 | SWA: 121 | Mem: 46.7GB | 423.4min |
| [2026-04-07 22:34:03] Step 16400/20000 [decay] | Loss: 2.9042 | BPB: 4.1899 | LR: 0.000184 | Tokens: 8,598,323,200 | TPS: 334,840 | SWA: 123 | Mem: 46.7GB | 428.0min |
| [2026-04-07 22:38:05] Step 16450/20000 [decay] | Loss: 2.8025 | BPB: 4.0432 | LR: 0.000181 | Tokens: 8,624,537,600 | TPS: 332,728 | SWA: 124 | Mem: 46.7GB | 432.0min |
| [2026-04-07 22:42:06] Step 16500/20000 [decay] | Loss: 2.7192 | BPB: 3.9230 | LR: 0.000179 | Tokens: 8,650,752,000 | TPS: 330,660 | SWA: 125 | Mem: 46.7GB | 436.0min |
| [2026-04-07 22:42:30] Saved checkpoint at step 16500 |
| [2026-04-07 22:42:36] --- Evaluation at step 16500 --- |
| [2026-04-07 22:42:36] Combined val BPB: 3.1256 |
| [2026-04-07 22:42:41] en val BPB: 3.9089 |
| [2026-04-07 22:42:46] ar val BPB: 3.8154 |
| [2026-04-07 22:42:51] he val BPB: 1.8087 |
| [2026-04-07 22:42:57] fa val BPB: 3.2425 |
| [2026-04-07 22:42:57] New best! BPB: 3.1256 |
| [2026-04-07 22:43:51] Saved model to /tmp/checkpoints/best_model.pt |
| [2026-04-07 22:47:52] Step 16550/20000 [decay] | Loss: 2.9323 | BPB: 4.2304 | LR: 0.000176 | Tokens: 8,676,966,400 | TPS: 327,332 | SWA: 126 | Mem: 46.7GB | 441.8min |
| [2026-04-07 22:52:29] Step 16600/20000 [decay] | Loss: 3.1493 | BPB: 4.5435 | LR: 0.000174 | Tokens: 8,703,180,800 | TPS: 324,931 | SWA: 128 | Mem: 46.7GB | 446.4min |
| [2026-04-07 22:56:31] Step 16650/20000 [decay] | Loss: 2.8272 | BPB: 4.0789 | LR: 0.000171 | Tokens: 8,729,395,200 | TPS: 322,991 | SWA: 129 | Mem: 46.7GB | 450.4min |
| [2026-04-07 23:00:33] Step 16700/20000 [decay] | Loss: 2.7261 | BPB: 3.9330 | LR: 0.000169 | Tokens: 8,755,609,600 | TPS: 321,087 | SWA: 130 | Mem: 46.7GB | 454.5min |
| [2026-04-07 23:04:35] Step 16750/20000 [decay] | Loss: 2.8333 | BPB: 4.0876 | LR: 0.000167 | Tokens: 8,781,824,000 | TPS: 319,218 | SWA: 131 | Mem: 46.7GB | 458.5min |
| [2026-04-07 23:04:58] Saved checkpoint at step 16750 |
| [2026-04-07 23:09:34] Step 16800/20000 [decay] | Loss: 3.0743 | BPB: 4.4353 | LR: 0.000164 | Tokens: 8,808,038,400 | TPS: 316,729 | SWA: 133 | Mem: 46.7GB | 463.5min |
| [2026-04-07 23:13:35] Step 16850/20000 [decay] | Loss: 2.8271 | BPB: 4.0787 | LR: 0.000162 | Tokens: 8,834,252,800 | TPS: 314,938 | SWA: 134 | Mem: 46.7GB | 467.5min |
| [2026-04-07 23:17:36] Step 16900/20000 [decay] | Loss: 2.6806 | BPB: 3.8673 | LR: 0.000159 | Tokens: 8,860,467,200 | TPS: 313,180 | SWA: 135 | Mem: 46.7GB | 471.5min |
| [2026-04-07 23:21:38] Step 16950/20000 [decay] | Loss: 3.0622 | BPB: 4.4178 | LR: 0.000157 | Tokens: 8,886,681,600 | TPS: 311,442 | SWA: 136 | Mem: 46.7GB | 475.6min |
| [2026-04-07 23:26:14] Step 17000/20000 [decay] | Loss: 2.6824 | BPB: 3.8699 | LR: 0.000154 | Tokens: 8,912,896,000 | TPS: 309,369 | SWA: 138 | Mem: 46.7GB | 480.2min |
| [2026-04-07 23:26:38] Saved checkpoint at step 17000 |
| [2026-04-07 23:26:44] --- Evaluation at step 17000 --- |
| [2026-04-07 23:26:44] Combined val BPB: 3.1151 |
| [2026-04-07 23:26:49] en val BPB: 3.8959 |
| [2026-04-07 23:26:54] ar val BPB: 3.7989 |
| [2026-04-07 23:27:00] he val BPB: 1.7978 |
| [2026-04-07 23:27:05] fa val BPB: 3.2179 |
| [2026-04-07 23:27:05] New best! BPB: 3.1151 |
| [2026-04-07 23:27:59] Saved model to /tmp/checkpoints/best_model.pt |
| [2026-04-07 23:32:01] Step 17050/20000 [decay] | Loss: 2.7147 | BPB: 3.9164 | LR: 0.000152 | Tokens: 8,939,110,400 | TPS: 306,584 | SWA: 139 | Mem: 46.7GB | 486.0min |
| [2026-04-07 23:36:04] Step 17100/20000 [decay] | Loss: 2.7153 | BPB: 3.9174 | LR: 0.000150 | Tokens: 8,965,324,800 | TPS: 304,951 | SWA: 140 | Mem: 46.7GB | 490.0min |
| [2026-04-07 23:40:06] Step 17150/20000 [decay] | Loss: 2.8889 | BPB: 4.1678 | LR: 0.000147 | Tokens: 8,991,539,200 | TPS: 303,346 | SWA: 141 | Mem: 46.7GB | 494.0min |
| [2026-04-07 23:44:43] Step 17200/20000 [decay] | Loss: 2.7390 | BPB: 3.9516 | LR: 0.000145 | Tokens: 9,017,753,600 | TPS: 301,406 | SWA: 143 | Mem: 46.7GB | 498.7min |
| [2026-04-07 23:48:46] Step 17250/20000 [decay] | Loss: 2.9655 | BPB: 4.2784 | LR: 0.000142 | Tokens: 9,043,968,000 | TPS: 299,847 | SWA: 144 | Mem: 46.7GB | 502.7min |
| [2026-04-07 23:49:11] Saved checkpoint at step 17250 |
| [2026-04-07 23:53:13] Step 17300/20000 [decay] | Loss: 2.9317 | BPB: 4.2296 | LR: 0.000140 | Tokens: 9,070,182,400 | TPS: 298,081 | SWA: 145 | Mem: 46.7GB | 507.1min |
| [2026-04-07 23:57:15] Step 17350/20000 [decay] | Loss: 2.8900 | BPB: 4.1694 | LR: 0.000138 | Tokens: 9,096,396,800 | TPS: 296,584 | SWA: 146 | Mem: 46.7GB | 511.2min |
| [2026-04-08 00:01:51] Step 17400/20000 [decay] | Loss: 2.7505 | BPB: 3.9681 | LR: 0.000135 | Tokens: 9,122,611,200 | TPS: 294,786 | SWA: 148 | Mem: 46.7GB | 515.8min |
| [2026-04-08 00:05:53] Step 17450/20000 [decay] | Loss: 2.7415 | BPB: 3.9551 | LR: 0.000133 | Tokens: 9,148,825,600 | TPS: 293,342 | SWA: 149 | Mem: 46.7GB | 519.8min |
| [2026-04-08 00:09:54] Step 17500/20000 [decay] | Loss: 3.1388 | BPB: 4.5283 | LR: 0.000130 | Tokens: 9,175,040,000 | TPS: 291,924 | SWA: 150 | Mem: 46.7GB | 523.8min |
| [2026-04-08 00:10:18] Saved checkpoint at step 17500 |
| [2026-04-08 00:10:23] --- Evaluation at step 17500 --- |
| [2026-04-08 00:10:23] Combined val BPB: 3.1000 |
| [2026-04-08 00:10:28] en val BPB: 3.8792 |
| [2026-04-08 00:10:34] ar val BPB: 3.7830 |
| [2026-04-08 00:10:39] he val BPB: 1.7858 |
| [2026-04-08 00:10:44] fa val BPB: 3.2099 |
| [2026-04-08 00:10:44] New best! BPB: 3.1000 |
| [2026-04-08 00:11:38] Saved model to /tmp/checkpoints/best_model.pt |
| [2026-04-08 00:15:39] Step 17550/20000 [decay] | Loss: 2.9232 | BPB: 4.2173 | LR: 0.000128 | Tokens: 9,201,254,400 | TPS: 289,576 | SWA: 151 | Mem: 46.7GB | 529.6min |
| [2026-04-08 00:20:16] Step 17600/20000 [decay] | Loss: 2.9519 | BPB: 4.2587 | LR: 0.000125 | Tokens: 9,227,468,800 | TPS: 287,897 | SWA: 153 | Mem: 46.7GB | 534.2min |
| [2026-04-08 00:24:17] Step 17650/20000 [decay] | Loss: 2.8536 | BPB: 4.1168 | LR: 0.000123 | Tokens: 9,253,683,200 | TPS: 286,553 | SWA: 154 | Mem: 46.7GB | 538.2min |
| [2026-04-08 00:28:20] Step 17700/20000 [decay] | Loss: 3.0683 | BPB: 4.4266 | LR: 0.000121 | Tokens: 9,279,897,600 | TPS: 285,219 | SWA: 155 | Mem: 46.7GB | 542.3min |
| [2026-04-08 00:32:23] Step 17750/20000 [decay] | Loss: 2.6592 | BPB: 3.8365 | LR: 0.000118 | Tokens: 9,306,112,000 | TPS: 283,907 | SWA: 156 | Mem: 46.7GB | 546.3min |
| [2026-04-08 00:32:47] Saved checkpoint at step 17750 |
| [2026-04-08 00:37:24] Step 17800/20000 [decay] | Loss: 2.8725 | BPB: 4.1441 | LR: 0.000116 | Tokens: 9,332,326,400 | TPS: 282,115 | SWA: 158 | Mem: 46.7GB | 551.3min |
| [2026-04-08 00:41:26] Step 17850/20000 [decay] | Loss: 2.8233 | BPB: 4.0732 | LR: 0.000113 | Tokens: 9,358,540,800 | TPS: 280,852 | SWA: 159 | Mem: 46.7GB | 555.4min |
| [2026-04-08 00:45:29] Step 17900/20000 [decay] | Loss: 2.5007 | BPB: 3.6078 | LR: 0.000111 | Tokens: 9,384,755,200 | TPS: 279,602 | SWA: 160 | Mem: 46.7GB | 559.4min |
| [2026-04-08 00:49:32] Step 17950/20000 [decay] | Loss: 2.6195 | BPB: 3.7792 | LR: 0.000108 | Tokens: 9,410,969,600 | TPS: 278,370 | SWA: 161 | Mem: 46.7GB | 563.5min |
| [2026-04-08 00:54:08] Step 18000/20000 [decay] | Loss: 2.6110 | BPB: 3.7668 | LR: 0.000106 | Tokens: 9,437,184,000 | TPS: 276,879 | SWA: 163 | Mem: 46.7GB | 568.1min |
| [2026-04-08 00:54:32] Saved checkpoint at step 18000 |
| [2026-04-08 00:54:38] --- Evaluation at step 18000 --- |
| [2026-04-08 00:54:38] Combined val BPB: 3.0849 |
| [2026-04-08 00:54:43] en val BPB: 3.8684 |
| [2026-04-08 00:54:48] ar val BPB: 3.7672 |
| [2026-04-08 00:54:53] he val BPB: 1.7790 |
| [2026-04-08 00:54:59] fa val BPB: 3.1862 |
| [2026-04-08 00:54:59] New best! BPB: 3.0849 |
| [2026-04-08 00:55:53] Saved model to /tmp/checkpoints/best_model.pt |
| [2026-04-08 00:59:55] Step 18050/20000 [decay] | Loss: 2.7159 | BPB: 3.9182 | LR: 0.000104 | Tokens: 9,463,398,400 | TPS: 274,856 | SWA: 164 | Mem: 46.7GB | 573.8min |
| [2026-04-08 01:03:56] Step 18100/20000 [decay] | Loss: 2.8846 | BPB: 4.1617 | LR: 0.000101 | Tokens: 9,489,612,800 | TPS: 273,696 | SWA: 165 | Mem: 46.7GB | 577.9min |
| [2026-04-08 01:07:59] Step 18150/20000 [decay] | Loss: 3.0780 | BPB: 4.4406 | LR: 0.000099 | Tokens: 9,515,827,200 | TPS: 272,547 | SWA: 166 | Mem: 46.7GB | 581.9min |
| [2026-04-08 01:12:35] Step 18200/20000 [decay] | Loss: 2.9512 | BPB: 4.2576 | LR: 0.000096 | Tokens: 9,542,041,600 | TPS: 271,152 | SWA: 168 | Mem: 46.7GB | 586.5min |
| [2026-04-08 01:16:36] Step 18250/20000 [decay] | Loss: 2.6801 | BPB: 3.8666 | LR: 0.000094 | Tokens: 9,568,256,000 | TPS: 270,045 | SWA: 169 | Mem: 46.7GB | 590.5min |
| [2026-04-08 01:17:00] Saved checkpoint at step 18250 |
| [2026-04-08 01:21:03] Step 18300/20000 [decay] | Loss: 2.5476 | BPB: 3.6754 | LR: 0.000091 | Tokens: 9,594,470,400 | TPS: 268,766 | SWA: 170 | Mem: 46.7GB | 595.0min |
| [2026-04-08 01:25:05] Step 18350/20000 [decay] | Loss: 2.8480 | BPB: 4.1087 | LR: 0.000089 | Tokens: 9,620,684,800 | TPS: 267,685 | SWA: 171 | Mem: 46.7GB | 599.0min |
| [2026-04-08 01:29:41] Step 18400/20000 [decay] | Loss: 2.9482 | BPB: 4.2534 | LR: 0.000087 | Tokens: 9,646,899,200 | TPS: 266,364 | SWA: 173 | Mem: 46.7GB | 603.6min |
| [2026-04-08 01:33:43] Step 18450/20000 [decay] | Loss: 2.9971 | BPB: 4.3239 | LR: 0.000084 | Tokens: 9,673,113,600 | TPS: 265,317 | SWA: 174 | Mem: 46.7GB | 607.6min |
| [2026-04-08 01:37:45] Step 18500/20000 [decay] | Loss: 2.8399 | BPB: 4.0972 | LR: 0.000082 | Tokens: 9,699,328,000 | TPS: 264,279 | SWA: 175 | Mem: 46.7GB | 611.7min |
| [2026-04-08 01:38:10] Saved checkpoint at step 18500 |
| [2026-04-08 01:38:15] --- Evaluation at step 18500 --- |
| [2026-04-08 01:38:15] Combined val BPB: 3.0756 |
| [2026-04-08 01:38:20] en val BPB: 3.8558 |
| [2026-04-08 01:38:25] ar val BPB: 3.7564 |
| [2026-04-08 01:38:31] he val BPB: 1.7712 |
| [2026-04-08 01:38:36] fa val BPB: 3.1739 |
| [2026-04-08 01:38:36] New best! BPB: 3.0756 |
| [2026-04-08 01:39:30] Saved model to /tmp/checkpoints/best_model.pt |
| [2026-04-08 01:43:33] Step 18550/20000 [decay] | Loss: 3.0638 | BPB: 4.4201 | LR: 0.000079 | Tokens: 9,725,542,400 | TPS: 262,510 | SWA: 176 | Mem: 46.7GB | 617.5min |
| [2026-04-08 01:48:08] Step 18600/20000 [decay] | Loss: 3.0227 | BPB: 4.3608 | LR: 0.000077 | Tokens: 9,751,756,800 | TPS: 261,275 | SWA: 178 | Mem: 46.7GB | 622.1min |
| [2026-04-08 01:52:10] Step 18650/20000 [decay] | Loss: 2.7842 | BPB: 4.0167 | LR: 0.000074 | Tokens: 9,777,971,200 | TPS: 260,290 | SWA: 179 | Mem: 46.7GB | 626.1min |
| [2026-04-08 01:56:12] Step 18700/20000 [decay] | Loss: 2.8273 | BPB: 4.0789 | LR: 0.000072 | Tokens: 9,804,185,600 | TPS: 259,317 | SWA: 180 | Mem: 46.7GB | 630.1min |
| [2026-04-08 02:00:14] Step 18750/20000 [decay] | Loss: 2.7938 | BPB: 4.0306 | LR: 0.000070 | Tokens: 9,830,400,000 | TPS: 258,356 | SWA: 181 | Mem: 46.7GB | 634.2min |
| [2026-04-08 02:00:38] Saved checkpoint at step 18750 |
| [2026-04-08 02:05:16] Step 18800/20000 [decay] | Loss: 2.9880 | BPB: 4.3108 | LR: 0.000067 | Tokens: 9,856,614,400 | TPS: 257,006 | SWA: 183 | Mem: 46.7GB | 639.2min |
| [2026-04-08 02:09:18] Step 18850/20000 [decay] | Loss: 2.7168 | BPB: 3.9195 | LR: 0.000065 | Tokens: 9,882,828,800 | TPS: 256,072 | SWA: 184 | Mem: 46.7GB | 643.2min |
| [2026-04-08 02:13:20] Step 18900/20000 [decay] | Loss: 2.9967 | BPB: 4.3233 | LR: 0.000062 | Tokens: 9,909,043,200 | TPS: 255,152 | SWA: 185 | Mem: 46.7GB | 647.3min |
| [2026-04-08 02:17:23] Step 18950/20000 [decay] | Loss: 2.8269 | BPB: 4.0784 | LR: 0.000060 | Tokens: 9,935,257,600 | TPS: 254,238 | SWA: 186 | Mem: 46.7GB | 651.3min |
| [2026-04-08 02:22:00] Step 19000/20000 [decay] | Loss: 3.1107 | BPB: 4.4878 | LR: 0.000057 | Tokens: 9,961,472,000 | TPS: 253,117 | SWA: 188 | Mem: 46.7GB | 655.9min |
| [2026-04-08 02:22:24] Saved checkpoint at step 19000 |
| [2026-04-08 02:22:30] --- Evaluation at step 19000 --- |
| [2026-04-08 02:22:30] Combined val BPB: 3.0642 |
| [2026-04-08 02:22:35] en val BPB: 3.8451 |
| [2026-04-08 02:22:40] ar val BPB: 3.7425 |
| [2026-04-08 02:22:45] he val BPB: 1.7616 |
| [2026-04-08 02:22:50] fa val BPB: 3.1646 |
| [2026-04-08 02:22:50] New best! BPB: 3.0642 |
| [2026-04-08 02:23:45] Saved model to /tmp/checkpoints/best_model.pt |
| [2026-04-08 02:27:47] Step 19050/20000 [decay] | Loss: 2.8345 | BPB: 4.0894 | LR: 0.000055 | Tokens: 9,987,686,400 | TPS: 251,562 | SWA: 189 | Mem: 46.7GB | 661.7min |
| [2026-04-08 02:31:50] Step 19100/20000 [decay] | Loss: 2.7320 | BPB: 3.9415 | LR: 0.000053 | Tokens: 10,013,900,800 | TPS: 250,691 | SWA: 190 | Mem: 46.7GB | 665.8min |
| [2026-04-08 02:35:52] Step 19150/20000 [decay] | Loss: 2.6194 | BPB: 3.7790 | LR: 0.000050 | Tokens: 10,040,115,200 | TPS: 249,833 | SWA: 191 | Mem: 46.7GB | 669.8min |
| [2026-04-08 02:40:28] Step 19200/20000 [decay] | Loss: 2.7017 | BPB: 3.8977 | LR: 0.000048 | Tokens: 10,066,329,600 | TPS: 248,772 | SWA: 193 | Mem: 46.7GB | 674.4min |
| [2026-04-08 02:44:30] Step 19250/20000 [decay] | Loss: 2.9662 | BPB: 4.2794 | LR: 0.000045 | Tokens: 10,092,544,000 | TPS: 247,937 | SWA: 194 | Mem: 46.7GB | 678.4min |
| [2026-04-08 02:44:55] Saved checkpoint at step 19250 |
| [2026-04-08 02:48:58] Step 19300/20000 [decay] | Loss: 2.9888 | BPB: 4.3120 | LR: 0.000043 | Tokens: 10,118,758,400 | TPS: 246,955 | SWA: 195 | Mem: 46.7GB | 682.9min |
| [2026-04-08 02:53:00] Step 19350/20000 [decay] | Loss: 2.9994 | BPB: 4.3272 | LR: 0.000041 | Tokens: 10,144,972,800 | TPS: 246,145 | SWA: 196 | Mem: 46.7GB | 686.9min |
| [2026-04-08 02:57:36] Step 19400/20000 [decay] | Loss: 2.5938 | BPB: 3.7421 | LR: 0.000038 | Tokens: 10,171,187,200 | TPS: 245,139 | SWA: 198 | Mem: 46.7GB | 691.5min |
| [2026-04-08 03:01:38] Step 19450/20000 [decay] | Loss: 2.4771 | BPB: 3.5737 | LR: 0.000036 | Tokens: 10,197,401,600 | TPS: 244,344 | SWA: 199 | Mem: 46.7GB | 695.6min |
| [2026-04-08 03:05:41] Step 19500/20000 [decay] | Loss: 2.8016 | BPB: 4.0419 | LR: 0.000033 | Tokens: 10,223,616,000 | TPS: 243,552 | SWA: 200 | Mem: 46.7GB | 699.6min |
| [2026-04-08 03:06:06] Saved checkpoint at step 19500 |
| [2026-04-08 03:06:11] --- Evaluation at step 19500 --- |
| [2026-04-08 03:06:11] Combined val BPB: 3.0561 |
| [2026-04-08 03:06:16] en val BPB: 3.8371 |
| [2026-04-08 03:06:22] ar val BPB: 3.7319 |
| [2026-04-08 03:06:27] he val BPB: 1.7534 |
| [2026-04-08 03:06:32] fa val BPB: 3.1473 |
| [2026-04-08 03:06:32] New best! BPB: 3.0561 |
| [2026-04-08 03:07:26] Saved model to /tmp/checkpoints/best_model.pt |
| [2026-04-08 03:11:29] Step 19550/20000 [decay] | Loss: 3.0456 | BPB: 4.3939 | LR: 0.000031 | Tokens: 10,249,830,400 | TPS: 242,173 | SWA: 201 | Mem: 46.7GB | 705.4min |
| [2026-04-08 03:16:06] Step 19600/20000 [decay] | Loss: 2.7994 | BPB: 4.0387 | LR: 0.000028 | Tokens: 10,276,044,800 | TPS: 241,214 | SWA: 203 | Mem: 46.7GB | 710.0min |
| [2026-04-08 03:20:07] Step 19650/20000 [decay] | Loss: 2.5570 | BPB: 3.6890 | LR: 0.000026 | Tokens: 10,302,259,200 | TPS: 240,464 | SWA: 204 | Mem: 46.7GB | 714.1min |
| [2026-04-08 03:24:10] Step 19700/20000 [decay] | Loss: 2.7050 | BPB: 3.9026 | LR: 0.000024 | Tokens: 10,328,473,600 | TPS: 239,722 | SWA: 205 | Mem: 46.7GB | 718.1min |
| [2026-04-08 03:28:12] Step 19750/20000 [decay] | Loss: 2.6749 | BPB: 3.8590 | LR: 0.000021 | Tokens: 10,354,688,000 | TPS: 238,987 | SWA: 206 | Mem: 46.7GB | 722.1min |
| [2026-04-08 03:28:36] Saved checkpoint at step 19750 |
| [2026-04-08 03:33:13] Step 19800/20000 [decay] | Loss: 2.7954 | BPB: 4.0329 | LR: 0.000019 | Tokens: 10,380,902,400 | TPS: 237,937 | SWA: 208 | Mem: 46.7GB | 727.1min |
| [2026-04-08 03:37:15] Step 19850/20000 [decay] | Loss: 2.6950 | BPB: 3.8880 | LR: 0.000016 | Tokens: 10,407,116,800 | TPS: 237,225 | SWA: 209 | Mem: 46.7GB | 731.2min |
| [2026-04-08 03:41:17] Step 19900/20000 [decay] | Loss: 2.8130 | BPB: 4.0583 | LR: 0.000014 | Tokens: 10,433,331,200 | TPS: 236,516 | SWA: 210 | Mem: 46.7GB | 735.2min |
| [2026-04-08 03:45:19] Step 19950/20000 [decay] | Loss: 2.8208 | BPB: 4.0696 | LR: 0.000011 | Tokens: 10,459,545,600 | TPS: 235,815 | SWA: 211 | Mem: 46.7GB | 739.2min |
| [2026-04-08 03:49:56] Step 20000/20000 [decay] | Loss: 2.6108 | BPB: 3.7665 | LR: 0.000009 | Tokens: 10,485,760,000 | TPS: 234,937 | SWA: 213 | Mem: 46.7GB | 743.9min |
| [2026-04-08 03:50:21] Saved checkpoint at step 20000 |
| [2026-04-08 03:50:26] --- Evaluation at step 20000 --- |
| [2026-04-08 03:50:26] Combined val BPB: 3.0510 |
| [2026-04-08 03:50:31] en val BPB: 3.8314 |
| [2026-04-08 03:50:37] ar val BPB: 3.7262 |
| [2026-04-08 03:50:42] he val BPB: 1.7495 |
| [2026-04-08 03:50:47] fa val BPB: 3.1401 |
| [2026-04-08 03:50:47] New best! BPB: 3.0510 |
| [2026-04-08 03:51:42] Saved model to /tmp/checkpoints/best_model.pt |
| [2026-04-08 03:52:05] Saved model to /tmp/checkpoints/final_model.pt |
| [2026-04-08 03:52:05] Evaluating SWA model (213 checkpoints)... |
| [2026-04-08 03:52:46] SWA model combined BPB: 3.0635 (vs best raw: 3.0510) |
| [2026-04-08 03:52:52] SWA en BPB: 3.8503 |
| [2026-04-08 03:52:57] SWA ar BPB: 3.7420 |
| [2026-04-08 03:53:03] SWA he BPB: 1.7553 |
| [2026-04-08 03:53:08] SWA fa BPB: 3.1727 |
| [2026-04-08 03:53:24] Uploading all artifacts to S3... |
|
|