ronnengmail commited on
Commit
76de2b6
·
verified ·
1 Parent(s): af130fd

Upload training_logs/pretraining.log with huggingface_hub

Browse files
Files changed (1) hide show
  1. training_logs/pretraining.log +375 -0
training_logs/pretraining.log ADDED
@@ -0,0 +1,375 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [2026-04-07 15:25:15] === Multilingual 3.14B Training — FSDP (Arabic-Rebalanced) ===
2
+ [2026-04-07 15:25:15] World size: 8, Batch/GPU: 32, Grad accum: 1
3
+ [2026-04-07 15:25:15] Effective batch: 256 seqs = 524,288 tokens/step
4
+ [2026-04-07 15:25:15] Total steps: 20000 = 10,485,760,000 tokens
5
+ [2026-04-07 15:25:15] Schedule: WSD-LINEAR | warmup=600 | stable_end=14000 | total=20000
6
+ [2026-04-07 15:25:15] AdamW LR=0.0003, betas=(0.9, 0.98), WD=0.02
7
+ [2026-04-07 15:25:15] Label smoothing=0.06, min_lr=0.03, grad_clip=1.0
8
+ [2026-04-07 15:25:15] SWA: start=8000, freq=40
9
+ [2026-04-07 15:25:15] Model: dim=3072, depth=26, heads=24, dropout=0.05
10
+ [2026-04-07 15:25:15] FSDP: FULL_SHARD, MixedPrecision(bf16/fp32), Block-level wrapping
11
+ [2026-04-07 15:25:15] Loading training data...
12
+ [2026-04-07 15:25:15] Train tokens: 4,479,984,561
13
+ [2026-04-07 15:25:15] Loading validation data...
14
+ [2026-04-07 15:25:15] val_en: 20 batches
15
+ [2026-04-07 15:25:15] val_ar: 20 batches
16
+ [2026-04-07 15:25:15] val_he: 20 batches
17
+ [2026-04-07 15:25:15] val_fa: 20 batches
18
+ [2026-04-07 15:25:15] Creating model...
19
+ [2026-04-07 15:25:45] Model params: 3,042,868,224 (non-embedding: 2,944,564,224)
20
+ [2026-04-07 15:25:45] Wrapping model with FSDP...
21
+ [2026-04-07 15:25:47] FSDP wrapped. GPU memory: 1.5 GB
22
+ [2026-04-07 15:25:47] Applied FSDP activation checkpointing to Block layers
23
+ [2026-04-07 15:25:47] Using standard AdamW (bitsandbytes not available)
24
+ [2026-04-07 15:25:54] Loading checkpoint from step 11500...
25
+ [2026-04-07 15:26:03] Resumed from step 11500. Tokens: 6,029,312,000, Best BPB: 3.2654
26
+ [2026-04-07 15:26:03] NOTE: Optimizer state reset (fresh AdamW). LR schedule continues from step 11501.
27
+ [2026-04-07 15:26:04] Starting training from step 11501...
28
+ [2026-04-07 15:29:51] Step 11550/20000 [stable] | Loss: 2.8941 | BPB: 4.1753 | LR: 0.000300 | Tokens: 6,055,526,400 | TPS: 26,710,866 | SWA: 1 | Mem: 46.7GB | 3.8min
29
+ [2026-04-07 15:34:25] Step 11600/20000 [stable] | Loss: 3.0006 | BPB: 4.3289 | LR: 0.000300 | Tokens: 6,081,740,800 | TPS: 12,149,351 | SWA: 3 | Mem: 46.7GB | 8.3min
30
+ [2026-04-07 15:38:25] Step 11650/20000 [stable] | Loss: 2.9979 | BPB: 4.3250 | LR: 0.000300 | Tokens: 6,107,955,200 | TPS: 8,243,253 | SWA: 4 | Mem: 46.7GB | 12.3min
31
+ [2026-04-07 15:42:26] Step 11700/20000 [stable] | Loss: 2.8086 | BPB: 4.0520 | LR: 0.000300 | Tokens: 6,134,169,600 | TPS: 6,249,806 | SWA: 5 | Mem: 46.7GB | 16.4min
32
+ [2026-04-07 15:46:26] Step 11750/20000 [stable] | Loss: 2.9363 | BPB: 4.2361 | LR: 0.000300 | Tokens: 6,160,384,000 | TPS: 5,040,765 | SWA: 6 | Mem: 46.7GB | 20.4min
33
+ [2026-04-07 15:46:50] Saved checkpoint at step 11750
34
+ [2026-04-07 15:51:23] Step 11800/20000 [stable] | Loss: 3.1766 | BPB: 4.5828 | LR: 0.000300 | Tokens: 6,186,598,400 | TPS: 4,072,372 | SWA: 8 | Mem: 46.7GB | 25.3min
35
+ [2026-04-07 15:55:24] Step 11850/20000 [stable] | Loss: 2.9082 | BPB: 4.1957 | LR: 0.000300 | Tokens: 6,212,812,800 | TPS: 3,530,105 | SWA: 9 | Mem: 46.7GB | 29.3min
36
+ [2026-04-07 15:59:25] Step 11900/20000 [stable] | Loss: 3.0306 | BPB: 4.3722 | LR: 0.000300 | Tokens: 6,239,027,200 | TPS: 3,118,964 | SWA: 10 | Mem: 46.7GB | 33.3min
37
+ [2026-04-07 16:03:26] Step 11950/20000 [stable] | Loss: 3.0494 | BPB: 4.3993 | LR: 0.000300 | Tokens: 6,265,241,600 | TPS: 2,795,440 | SWA: 11 | Mem: 46.7GB | 37.4min
38
+ [2026-04-07 16:08:01] Step 12000/20000 [stable] | Loss: 2.7197 | BPB: 3.9236 | LR: 0.000300 | Tokens: 6,291,456,000 | TPS: 2,499,902 | SWA: 13 | Mem: 46.7GB | 41.9min
39
+ [2026-04-07 16:08:25] Saved checkpoint at step 12000
40
+ [2026-04-07 16:08:31] --- Evaluation at step 12000 ---
41
+ [2026-04-07 16:08:31] Combined val BPB: 3.2512
42
+ [2026-04-07 16:08:36] en val BPB: 4.0340
43
+ [2026-04-07 16:08:41] ar val BPB: 3.9566
44
+ [2026-04-07 16:08:46] he val BPB: 1.8995
45
+ [2026-04-07 16:08:52] fa val BPB: 3.3992
46
+ [2026-04-07 16:08:52] New best! BPB: 3.2512
47
+ [2026-04-07 16:09:45] Saved model to /tmp/checkpoints/best_model.pt
48
+ [2026-04-07 16:13:46] Step 12050/20000 [stable] | Loss: 3.1092 | BPB: 4.4856 | LR: 0.000300 | Tokens: 6,317,670,400 | TPS: 2,207,326 | SWA: 14 | Mem: 46.7GB | 47.7min
49
+ [2026-04-07 16:17:48] Step 12100/20000 [stable] | Loss: 3.0531 | BPB: 4.4046 | LR: 0.000300 | Tokens: 6,343,884,800 | TPS: 2,044,072 | SWA: 15 | Mem: 46.7GB | 51.7min
50
+ [2026-04-07 16:21:49] Step 12150/20000 [stable] | Loss: 3.0343 | BPB: 4.3776 | LR: 0.000300 | Tokens: 6,370,099,200 | TPS: 1,904,576 | SWA: 16 | Mem: 46.7GB | 55.7min
51
+ [2026-04-07 16:26:24] Step 12200/20000 [stable] | Loss: 3.0501 | BPB: 4.4004 | LR: 0.000300 | Tokens: 6,396,313,600 | TPS: 1,767,257 | SWA: 18 | Mem: 46.7GB | 60.3min
52
+ [2026-04-07 16:30:25] Step 12250/20000 [stable] | Loss: 3.1364 | BPB: 4.5249 | LR: 0.000300 | Tokens: 6,422,528,000 | TPS: 1,663,761 | SWA: 19 | Mem: 46.7GB | 64.3min
53
+ [2026-04-07 16:30:48] Saved checkpoint at step 12250
54
+ [2026-04-07 16:34:49] Step 12300/20000 [stable] | Loss: 3.1457 | BPB: 4.5383 | LR: 0.000300 | Tokens: 6,448,742,400 | TPS: 1,563,292 | SWA: 20 | Mem: 46.7GB | 68.8min
55
+ [2026-04-07 16:38:51] Step 12350/20000 [stable] | Loss: 3.0414 | BPB: 4.3879 | LR: 0.000300 | Tokens: 6,474,956,800 | TPS: 1,482,944 | SWA: 21 | Mem: 46.7GB | 72.8min
56
+ [2026-04-07 16:43:25] Step 12400/20000 [stable] | Loss: 3.0295 | BPB: 4.3707 | LR: 0.000300 | Tokens: 6,501,171,200 | TPS: 1,400,772 | SWA: 23 | Mem: 46.7GB | 77.4min
57
+ [2026-04-07 16:47:26] Step 12450/20000 [stable] | Loss: 2.9572 | BPB: 4.2663 | LR: 0.000300 | Tokens: 6,527,385,600 | TPS: 1,337,132 | SWA: 24 | Mem: 46.7GB | 81.4min
58
+ [2026-04-07 16:51:27] Step 12500/20000 [stable] | Loss: 3.0041 | BPB: 4.3339 | LR: 0.000300 | Tokens: 6,553,600,000 | TPS: 1,279,217 | SWA: 25 | Mem: 46.7GB | 85.4min
59
+ [2026-04-07 16:51:51] Saved checkpoint at step 12500
60
+ [2026-04-07 16:51:56] --- Evaluation at step 12500 ---
61
+ [2026-04-07 16:51:56] Combined val BPB: 3.2389
62
+ [2026-04-07 16:52:02] en val BPB: 4.0211
63
+ [2026-04-07 16:52:07] ar val BPB: 3.9414
64
+ [2026-04-07 16:52:12] he val BPB: 1.8939
65
+ [2026-04-07 16:52:17] fa val BPB: 3.3833
66
+ [2026-04-07 16:52:17] New best! BPB: 3.2389
67
+ [2026-04-07 16:53:11] Saved model to /tmp/checkpoints/best_model.pt
68
+ [2026-04-07 16:57:12] Step 12550/20000 [stable] | Loss: 3.0963 | BPB: 4.4670 | LR: 0.000300 | Tokens: 6,579,814,400 | TPS: 1,203,315 | SWA: 26 | Mem: 46.7GB | 91.1min
69
+ [2026-04-07 17:01:47] Step 12600/20000 [stable] | Loss: 3.0497 | BPB: 4.3998 | LR: 0.000300 | Tokens: 6,606,028,800 | TPS: 1,150,394 | SWA: 28 | Mem: 46.7GB | 95.7min
70
+ [2026-04-07 17:05:47] Step 12650/20000 [stable] | Loss: 3.0359 | BPB: 4.3798 | LR: 0.000300 | Tokens: 6,632,243,200 | TPS: 1,108,491 | SWA: 29 | Mem: 46.7GB | 99.7min
71
+ [2026-04-07 17:09:48] Step 12700/20000 [stable] | Loss: 3.1734 | BPB: 4.5783 | LR: 0.000300 | Tokens: 6,658,457,600 | TPS: 1,069,814 | SWA: 30 | Mem: 46.7GB | 103.7min
72
+ [2026-04-07 17:13:50] Step 12750/20000 [stable] | Loss: 2.9372 | BPB: 4.2374 | LR: 0.000300 | Tokens: 6,684,672,000 | TPS: 1,033,934 | SWA: 31 | Mem: 46.7GB | 107.8min
73
+ [2026-04-07 17:14:13] Saved checkpoint at step 12750
74
+ [2026-04-07 17:18:48] Step 12800/20000 [stable] | Loss: 2.6958 | BPB: 3.8893 | LR: 0.000300 | Tokens: 6,710,886,400 | TPS: 992,175 | SWA: 33 | Mem: 46.7GB | 112.7min
75
+ [2026-04-07 17:22:49] Step 12850/20000 [stable] | Loss: 3.3123 | BPB: 4.7786 | LR: 0.000300 | Tokens: 6,737,100,800 | TPS: 961,854 | SWA: 34 | Mem: 46.7GB | 116.7min
76
+ [2026-04-07 17:26:49] Step 12900/20000 [stable] | Loss: 2.9599 | BPB: 4.2702 | LR: 0.000300 | Tokens: 6,763,315,200 | TPS: 933,516 | SWA: 35 | Mem: 46.7GB | 120.7min
77
+ [2026-04-07 17:30:51] Step 12950/20000 [stable] | Loss: 2.8022 | BPB: 4.0428 | LR: 0.000300 | Tokens: 6,789,529,600 | TPS: 906,909 | SWA: 36 | Mem: 46.7GB | 124.8min
78
+ [2026-04-07 17:35:26] Step 13000/20000 [stable] | Loss: 2.9235 | BPB: 4.2177 | LR: 0.000300 | Tokens: 6,815,744,000 | TPS: 878,094 | SWA: 38 | Mem: 46.7GB | 129.4min
79
+ [2026-04-07 17:35:50] Saved checkpoint at step 13000
80
+ [2026-04-07 17:35:55] --- Evaluation at step 13000 ---
81
+ [2026-04-07 17:35:55] Combined val BPB: 3.2196
82
+ [2026-04-07 17:36:01] en val BPB: 4.0101
83
+ [2026-04-07 17:36:06] ar val BPB: 3.9315
84
+ [2026-04-07 17:36:11] he val BPB: 1.8794
85
+ [2026-04-07 17:36:16] fa val BPB: 3.3678
86
+ [2026-04-07 17:36:16] New best! BPB: 3.2196
87
+ [2026-04-07 17:37:10] Saved model to /tmp/checkpoints/best_model.pt
88
+ [2026-04-07 17:41:13] Step 13050/20000 [stable] | Loss: 2.9555 | BPB: 4.2639 | LR: 0.000300 | Tokens: 6,841,958,400 | TPS: 843,812 | SWA: 39 | Mem: 46.7GB | 135.1min
89
+ [2026-04-07 17:45:14] Step 13100/20000 [stable] | Loss: 2.9740 | BPB: 4.2906 | LR: 0.000300 | Tokens: 6,868,172,800 | TPS: 822,576 | SWA: 40 | Mem: 46.7GB | 139.2min
90
+ [2026-04-07 17:49:15] Step 13150/20000 [stable] | Loss: 3.1133 | BPB: 4.4915 | LR: 0.000300 | Tokens: 6,894,387,200 | TPS: 802,504 | SWA: 41 | Mem: 46.7GB | 143.2min
91
+ [2026-04-07 17:53:50] Step 13200/20000 [stable] | Loss: 3.0578 | BPB: 4.4115 | LR: 0.000300 | Tokens: 6,920,601,600 | TPS: 780,574 | SWA: 43 | Mem: 46.7GB | 147.8min
92
+ [2026-04-07 17:57:52] Step 13250/20000 [stable] | Loss: 3.0051 | BPB: 4.3354 | LR: 0.000300 | Tokens: 6,946,816,000 | TPS: 762,773 | SWA: 44 | Mem: 46.7GB | 151.8min
93
+ [2026-04-07 17:58:15] Saved checkpoint at step 13250
94
+ [2026-04-07 18:02:17] Step 13300/20000 [stable] | Loss: 3.0474 | BPB: 4.3965 | LR: 0.000300 | Tokens: 6,973,030,400 | TPS: 743,988 | SWA: 45 | Mem: 46.7GB | 156.2min
95
+ [2026-04-07 18:06:18] Step 13350/20000 [stable] | Loss: 3.0762 | BPB: 4.4380 | LR: 0.000300 | Tokens: 6,999,244,800 | TPS: 728,028 | SWA: 46 | Mem: 46.7GB | 160.2min
96
+ [2026-04-07 18:10:54] Step 13400/20000 [stable] | Loss: 2.9245 | BPB: 4.2191 | LR: 0.000300 | Tokens: 7,025,459,200 | TPS: 710,379 | SWA: 48 | Mem: 46.7GB | 164.8min
97
+ [2026-04-07 18:14:55] Step 13450/20000 [stable] | Loss: 2.9225 | BPB: 4.2163 | LR: 0.000300 | Tokens: 7,051,673,600 | TPS: 696,044 | SWA: 49 | Mem: 46.7GB | 168.9min
98
+ [2026-04-07 18:18:56] Step 13500/20000 [stable] | Loss: 3.0370 | BPB: 4.3815 | LR: 0.000300 | Tokens: 7,077,888,000 | TPS: 682,398 | SWA: 50 | Mem: 46.7GB | 172.9min
99
+ [2026-04-07 18:19:20] Saved checkpoint at step 13500
100
+ [2026-04-07 18:19:25] --- Evaluation at step 13500 ---
101
+ [2026-04-07 18:19:25] Combined val BPB: 3.2094
102
+ [2026-04-07 18:19:31] en val BPB: 4.0007
103
+ [2026-04-07 18:19:36] ar val BPB: 3.9208
104
+ [2026-04-07 18:19:41] he val BPB: 1.8821
105
+ [2026-04-07 18:19:46] fa val BPB: 3.3442
106
+ [2026-04-07 18:19:46] New best! BPB: 3.2094
107
+ [2026-04-07 18:20:41] Saved model to /tmp/checkpoints/best_model.pt
108
+ [2026-04-07 18:24:42] Step 13550/20000 [stable] | Loss: 3.1894 | BPB: 4.6014 | LR: 0.000300 | Tokens: 7,104,102,400 | TPS: 662,824 | SWA: 51 | Mem: 46.7GB | 178.6min
109
+ [2026-04-07 18:29:18] Step 13600/20000 [stable] | Loss: 2.8078 | BPB: 4.0508 | LR: 0.000300 | Tokens: 7,130,316,800 | TPS: 648,578 | SWA: 53 | Mem: 46.7GB | 183.2min
110
+ [2026-04-07 18:33:19] Step 13650/20000 [stable] | Loss: 2.8769 | BPB: 4.1505 | LR: 0.000300 | Tokens: 7,156,531,200 | TPS: 636,984 | SWA: 54 | Mem: 46.7GB | 187.3min
111
+ [2026-04-07 18:37:21] Step 13700/20000 [stable] | Loss: 2.9699 | BPB: 4.2846 | LR: 0.000300 | Tokens: 7,182,745,600 | TPS: 625,875 | SWA: 55 | Mem: 46.7GB | 191.3min
112
+ [2026-04-07 18:41:23] Step 13750/20000 [stable] | Loss: 2.9296 | BPB: 4.2265 | LR: 0.000300 | Tokens: 7,208,960,000 | TPS: 615,186 | SWA: 56 | Mem: 46.7GB | 195.3min
113
+ [2026-04-07 18:41:47] Saved checkpoint at step 13750
114
+ [2026-04-07 18:46:24] Step 13800/20000 [stable] | Loss: 3.0591 | BPB: 4.4133 | LR: 0.000300 | Tokens: 7,235,174,400 | TPS: 601,926 | SWA: 58 | Mem: 46.7GB | 200.3min
115
+ [2026-04-07 18:50:26] Step 13850/20000 [stable] | Loss: 3.2637 | BPB: 4.7085 | LR: 0.000300 | Tokens: 7,261,388,800 | TPS: 592,200 | SWA: 59 | Mem: 46.7GB | 204.4min
116
+ [2026-04-07 18:54:27] Step 13900/20000 [stable] | Loss: 3.2091 | BPB: 4.6297 | LR: 0.000300 | Tokens: 7,287,603,200 | TPS: 582,867 | SWA: 60 | Mem: 46.7GB | 208.4min
117
+ [2026-04-07 18:58:28] Step 13950/20000 [stable] | Loss: 2.9200 | BPB: 4.2127 | LR: 0.000300 | Tokens: 7,313,817,600 | TPS: 573,894 | SWA: 61 | Mem: 46.7GB | 212.4min
118
+ [2026-04-07 19:03:05] Step 14000/20000 [decay] | Loss: 2.7259 | BPB: 3.9326 | LR: 0.000300 | Tokens: 7,340,032,000 | TPS: 563,718 | SWA: 63 | Mem: 46.7GB | 217.0min
119
+ [2026-04-07 19:03:29] Saved checkpoint at step 14000
120
+ [2026-04-07 19:03:34] --- Evaluation at step 14000 ---
121
+ [2026-04-07 19:03:34] Combined val BPB: 3.2072
122
+ [2026-04-07 19:03:39] en val BPB: 3.9933
123
+ [2026-04-07 19:03:44] ar val BPB: 3.9105
124
+ [2026-04-07 19:03:50] he val BPB: 1.8704
125
+ [2026-04-07 19:03:55] fa val BPB: 3.3445
126
+ [2026-04-07 19:03:55] New best! BPB: 3.2072
127
+ [2026-04-07 19:04:48] Saved model to /tmp/checkpoints/best_model.pt
128
+ [2026-04-07 19:08:51] Step 14050/20000 [decay] | Loss: 2.9732 | BPB: 4.2894 | LR: 0.000298 | Tokens: 7,366,246,400 | TPS: 551,076 | SWA: 64 | Mem: 46.7GB | 222.8min
129
+ [2026-04-07 19:12:54] Step 14100/20000 [decay] | Loss: 2.9500 | BPB: 4.2560 | LR: 0.000295 | Tokens: 7,392,460,800 | TPS: 543,194 | SWA: 65 | Mem: 46.7GB | 226.8min
130
+ [2026-04-07 19:16:56] Step 14150/20000 [decay] | Loss: 2.9680 | BPB: 4.2820 | LR: 0.000293 | Tokens: 7,418,675,200 | TPS: 535,593 | SWA: 66 | Mem: 46.7GB | 230.9min
131
+ [2026-04-07 19:21:31] Step 14200/20000 [decay] | Loss: 3.0077 | BPB: 4.3392 | LR: 0.000290 | Tokens: 7,444,889,600 | TPS: 527,012 | SWA: 68 | Mem: 46.7GB | 235.4min
132
+ [2026-04-07 19:25:32] Step 14250/20000 [decay] | Loss: 3.0809 | BPB: 4.4448 | LR: 0.000288 | Tokens: 7,471,104,000 | TPS: 519,984 | SWA: 69 | Mem: 46.7GB | 239.5min
133
+ [2026-04-07 19:25:57] Saved checkpoint at step 14250
134
+ [2026-04-07 19:29:59] Step 14300/20000 [decay] | Loss: 3.0979 | BPB: 4.4693 | LR: 0.000285 | Tokens: 7,497,318,400 | TPS: 512,307 | SWA: 70 | Mem: 46.7GB | 243.9min
135
+ [2026-04-07 19:34:01] Step 14350/20000 [decay] | Loss: 3.1389 | BPB: 4.5284 | LR: 0.000283 | Tokens: 7,523,532,800 | TPS: 505,738 | SWA: 71 | Mem: 46.7GB | 247.9min
136
+ [2026-04-07 19:38:36] Step 14400/20000 [decay] | Loss: 2.7258 | BPB: 3.9325 | LR: 0.000281 | Tokens: 7,549,747,200 | TPS: 498,262 | SWA: 73 | Mem: 46.7GB | 252.5min
137
+ [2026-04-07 19:42:38] Step 14450/20000 [decay] | Loss: 3.0346 | BPB: 4.3781 | LR: 0.000278 | Tokens: 7,575,961,600 | TPS: 492,141 | SWA: 74 | Mem: 46.7GB | 256.6min
138
+ [2026-04-07 19:46:41] Step 14500/20000 [decay] | Loss: 2.8087 | BPB: 4.0521 | LR: 0.000276 | Tokens: 7,602,176,000 | TPS: 486,179 | SWA: 75 | Mem: 46.7GB | 260.6min
139
+ [2026-04-07 19:47:05] Saved checkpoint at step 14500
140
+ [2026-04-07 19:47:10] --- Evaluation at step 14500 ---
141
+ [2026-04-07 19:47:10] Combined val BPB: 3.1949
142
+ [2026-04-07 19:47:16] en val BPB: 3.9780
143
+ [2026-04-07 19:47:21] ar val BPB: 3.8893
144
+ [2026-04-07 19:47:26] he val BPB: 1.8582
145
+ [2026-04-07 19:47:31] fa val BPB: 3.3323
146
+ [2026-04-07 19:47:31] New best! BPB: 3.1949
147
+ [2026-04-07 19:48:25] Saved model to /tmp/checkpoints/best_model.pt
148
+ [2026-04-07 19:52:27] Step 14550/20000 [decay] | Loss: 2.9876 | BPB: 4.3103 | LR: 0.000273 | Tokens: 7,628,390,400 | TPS: 477,294 | SWA: 76 | Mem: 46.7GB | 266.4min
149
+ [2026-04-07 19:57:02] Step 14600/20000 [decay] | Loss: 2.9462 | BPB: 4.2504 | LR: 0.000271 | Tokens: 7,654,604,800 | TPS: 470,841 | SWA: 78 | Mem: 46.7GB | 271.0min
150
+ [2026-04-07 20:01:03] Step 14650/20000 [decay] | Loss: 2.9918 | BPB: 4.3162 | LR: 0.000268 | Tokens: 7,680,819,200 | TPS: 465,549 | SWA: 79 | Mem: 46.7GB | 275.0min
151
+ [2026-04-07 20:05:04] Step 14700/20000 [decay] | Loss: 2.8854 | BPB: 4.1628 | LR: 0.000266 | Tokens: 7,707,033,600 | TPS: 460,412 | SWA: 80 | Mem: 46.7GB | 279.0min
152
+ [2026-04-07 20:09:06] Step 14750/20000 [decay] | Loss: 3.2876 | BPB: 4.7430 | LR: 0.000264 | Tokens: 7,733,248,000 | TPS: 455,389 | SWA: 81 | Mem: 46.7GB | 283.0min
153
+ [2026-04-07 20:09:29] Saved checkpoint at step 14750
154
+ [2026-04-07 20:14:04] Step 14800/20000 [decay] | Loss: 2.8727 | BPB: 4.1444 | LR: 0.000261 | Tokens: 7,759,462,400 | TPS: 449,043 | SWA: 83 | Mem: 46.7GB | 288.0min
155
+ [2026-04-07 20:18:06] Step 14850/20000 [decay] | Loss: 3.0526 | BPB: 4.4040 | LR: 0.000259 | Tokens: 7,785,676,800 | TPS: 444,357 | SWA: 84 | Mem: 46.7GB | 292.0min
156
+ [2026-04-07 20:22:07] Step 14900/20000 [decay] | Loss: 2.6748 | BPB: 3.8590 | LR: 0.000256 | Tokens: 7,811,891,200 | TPS: 439,802 | SWA: 85 | Mem: 46.7GB | 296.0min
157
+ [2026-04-07 20:26:09] Step 14950/20000 [decay] | Loss: 2.9423 | BPB: 4.2449 | LR: 0.000254 | Tokens: 7,838,105,600 | TPS: 435,348 | SWA: 86 | Mem: 46.7GB | 300.1min
158
+ [2026-04-07 20:30:44] Step 15000/20000 [decay] | Loss: 2.7843 | BPB: 4.0169 | LR: 0.000251 | Tokens: 7,864,320,000 | TPS: 430,231 | SWA: 88 | Mem: 46.7GB | 304.7min
159
+ [2026-04-07 20:31:08] Saved checkpoint at step 15000
160
+ [2026-04-07 20:31:13] --- Evaluation at step 15000 ---
161
+ [2026-04-07 20:31:13] Combined val BPB: 3.1722
162
+ [2026-04-07 20:31:19] en val BPB: 3.9594
163
+ [2026-04-07 20:31:24] ar val BPB: 3.8730
164
+ [2026-04-07 20:31:29] he val BPB: 1.8443
165
+ [2026-04-07 20:31:34] fa val BPB: 3.3170
166
+ [2026-04-07 20:31:34] New best! BPB: 3.1722
167
+ [2026-04-07 20:32:28] Saved model to /tmp/checkpoints/best_model.pt
168
+ [2026-04-07 20:36:31] Step 15050/20000 [decay] | Loss: 2.9149 | BPB: 4.2053 | LR: 0.000249 | Tokens: 7,890,534,400 | TPS: 423,622 | SWA: 89 | Mem: 46.7GB | 310.4min
169
+ [2026-04-07 20:40:32] Step 15100/20000 [decay] | Loss: 2.8421 | BPB: 4.1003 | LR: 0.000247 | Tokens: 7,916,748,800 | TPS: 419,582 | SWA: 90 | Mem: 46.7GB | 314.5min
170
+ [2026-04-07 20:44:34] Step 15150/20000 [decay] | Loss: 3.0003 | BPB: 4.3285 | LR: 0.000244 | Tokens: 7,942,963,200 | TPS: 415,645 | SWA: 91 | Mem: 46.7GB | 318.5min
171
+ [2026-04-07 20:49:10] Step 15200/20000 [decay] | Loss: 2.9927 | BPB: 4.3175 | LR: 0.000242 | Tokens: 7,969,177,600 | TPS: 411,081 | SWA: 93 | Mem: 46.7GB | 323.1min
172
+ [2026-04-07 20:53:12] Step 15250/20000 [decay] | Loss: 2.9789 | BPB: 4.2977 | LR: 0.000239 | Tokens: 7,995,392,000 | TPS: 407,346 | SWA: 94 | Mem: 46.7GB | 327.1min
173
+ [2026-04-07 20:53:36] Saved checkpoint at step 15250
174
+ [2026-04-07 20:57:38] Step 15300/20000 [decay] | Loss: 2.7740 | BPB: 4.0020 | LR: 0.000237 | Tokens: 8,021,606,400 | TPS: 403,221 | SWA: 95 | Mem: 46.7GB | 331.6min
175
+ [2026-04-07 21:01:40] Step 15350/20000 [decay] | Loss: 2.8095 | BPB: 4.0533 | LR: 0.000235 | Tokens: 8,047,820,800 | TPS: 399,685 | SWA: 96 | Mem: 46.7GB | 335.6min
176
+ [2026-04-07 21:06:15] Step 15400/20000 [decay] | Loss: 3.0071 | BPB: 4.3383 | LR: 0.000232 | Tokens: 8,074,035,200 | TPS: 395,569 | SWA: 98 | Mem: 46.7GB | 340.2min
177
+ [2026-04-07 21:10:18] Step 15450/20000 [decay] | Loss: 2.8658 | BPB: 4.1345 | LR: 0.000230 | Tokens: 8,100,249,600 | TPS: 392,187 | SWA: 99 | Mem: 46.7GB | 344.2min
178
+ [2026-04-07 21:14:21] Step 15500/20000 [decay] | Loss: 2.9056 | BPB: 4.1919 | LR: 0.000227 | Tokens: 8,126,464,000 | TPS: 388,892 | SWA: 100 | Mem: 46.7GB | 348.3min
179
+ [2026-04-07 21:14:45] Saved checkpoint at step 15500
180
+ [2026-04-07 21:14:50] --- Evaluation at step 15500 ---
181
+ [2026-04-07 21:14:50] Combined val BPB: 3.1581
182
+ [2026-04-07 21:14:56] en val BPB: 3.9423
183
+ [2026-04-07 21:15:01] ar val BPB: 3.8530
184
+ [2026-04-07 21:15:06] he val BPB: 1.8316
185
+ [2026-04-07 21:15:11] fa val BPB: 3.2893
186
+ [2026-04-07 21:15:11] New best! BPB: 3.1581
187
+ [2026-04-07 21:16:05] Saved model to /tmp/checkpoints/best_model.pt
188
+ [2026-04-07 21:20:07] Step 15550/20000 [decay] | Loss: 3.0142 | BPB: 4.3485 | LR: 0.000225 | Tokens: 8,152,678,400 | TPS: 383,789 | SWA: 101 | Mem: 46.7GB | 354.0min
189
+ [2026-04-07 21:24:44] Step 15600/20000 [decay] | Loss: 3.0668 | BPB: 4.4245 | LR: 0.000222 | Tokens: 8,178,892,800 | TPS: 380,074 | SWA: 103 | Mem: 46.7GB | 358.7min
190
+ [2026-04-07 21:28:45] Step 15650/20000 [decay] | Loss: 2.9354 | BPB: 4.2348 | LR: 0.000220 | Tokens: 8,205,107,200 | TPS: 377,066 | SWA: 104 | Mem: 46.7GB | 362.7min
191
+ [2026-04-07 21:32:46] Step 15700/20000 [decay] | Loss: 2.8455 | BPB: 4.1051 | LR: 0.000218 | Tokens: 8,231,321,600 | TPS: 374,120 | SWA: 105 | Mem: 46.7GB | 366.7min
192
+ [2026-04-07 21:36:48] Step 15750/20000 [decay] | Loss: 2.9204 | BPB: 4.2133 | LR: 0.000215 | Tokens: 8,257,536,000 | TPS: 371,223 | SWA: 106 | Mem: 46.7GB | 370.7min
193
+ [2026-04-07 21:37:13] Saved checkpoint at step 15750
194
+ [2026-04-07 21:41:49] Step 15800/20000 [decay] | Loss: 3.0251 | BPB: 4.3643 | LR: 0.000213 | Tokens: 8,283,750,400 | TPS: 367,438 | SWA: 108 | Mem: 46.7GB | 375.7min
195
+ [2026-04-07 21:45:51] Step 15850/20000 [decay] | Loss: 2.7579 | BPB: 3.9788 | LR: 0.000210 | Tokens: 8,309,964,800 | TPS: 364,691 | SWA: 109 | Mem: 46.7GB | 379.8min
196
+ [2026-04-07 21:49:53] Step 15900/20000 [decay] | Loss: 3.0697 | BPB: 4.4286 | LR: 0.000208 | Tokens: 8,336,179,200 | TPS: 361,997 | SWA: 110 | Mem: 46.7GB | 383.8min
197
+ [2026-04-07 21:53:55] Step 15950/20000 [decay] | Loss: 3.0842 | BPB: 4.4495 | LR: 0.000205 | Tokens: 8,362,393,600 | TPS: 359,355 | SWA: 111 | Mem: 46.7GB | 387.8min
198
+ [2026-04-07 21:58:31] Step 16000/20000 [decay] | Loss: 2.7730 | BPB: 4.0006 | LR: 0.000203 | Tokens: 8,388,608,000 | TPS: 356,248 | SWA: 113 | Mem: 46.7GB | 392.5min
199
+ [2026-04-07 21:58:56] Saved checkpoint at step 16000
200
+ [2026-04-07 21:59:01] --- Evaluation at step 16000 ---
201
+ [2026-04-07 21:59:01] Combined val BPB: 3.1337
202
+ [2026-04-07 21:59:06] en val BPB: 3.9261
203
+ [2026-04-07 21:59:12] ar val BPB: 3.8328
204
+ [2026-04-07 21:59:17] he val BPB: 1.8194
205
+ [2026-04-07 21:59:22] fa val BPB: 3.2746
206
+ [2026-04-07 21:59:22] New best! BPB: 3.1337
207
+ [2026-04-07 22:00:16] Saved model to /tmp/checkpoints/best_model.pt
208
+ [2026-04-07 22:04:18] Step 16050/20000 [decay] | Loss: 2.9140 | BPB: 4.2040 | LR: 0.000201 | Tokens: 8,414,822,400 | TPS: 352,177 | SWA: 114 | Mem: 46.7GB | 398.2min
209
+ [2026-04-07 22:08:20] Step 16100/20000 [decay] | Loss: 3.1789 | BPB: 4.5862 | LR: 0.000198 | Tokens: 8,441,036,800 | TPS: 349,734 | SWA: 115 | Mem: 46.7GB | 402.3min
210
+ [2026-04-07 22:12:21] Step 16150/20000 [decay] | Loss: 2.8371 | BPB: 4.0931 | LR: 0.000196 | Tokens: 8,467,251,200 | TPS: 347,345 | SWA: 116 | Mem: 46.7GB | 406.3min
211
+ [2026-04-07 22:16:57] Step 16200/20000 [decay] | Loss: 2.9875 | BPB: 4.3100 | LR: 0.000193 | Tokens: 8,493,465,600 | TPS: 344,518 | SWA: 118 | Mem: 46.7GB | 410.9min
212
+ [2026-04-07 22:21:00] Step 16250/20000 [decay] | Loss: 2.7889 | BPB: 4.0235 | LR: 0.000191 | Tokens: 8,519,680,000 | TPS: 342,221 | SWA: 119 | Mem: 46.7GB | 414.9min
213
+ [2026-04-07 22:21:24] Saved checkpoint at step 16250
214
+ [2026-04-07 22:25:25] Step 16300/20000 [decay] | Loss: 3.0371 | BPB: 4.3816 | LR: 0.000188 | Tokens: 8,545,894,400 | TPS: 339,646 | SWA: 120 | Mem: 46.7GB | 419.4min
215
+ [2026-04-07 22:29:27] Step 16350/20000 [decay] | Loss: 2.8663 | BPB: 4.1352 | LR: 0.000186 | Tokens: 8,572,108,800 | TPS: 337,451 | SWA: 121 | Mem: 46.7GB | 423.4min
216
+ [2026-04-07 22:34:03] Step 16400/20000 [decay] | Loss: 2.9042 | BPB: 4.1899 | LR: 0.000184 | Tokens: 8,598,323,200 | TPS: 334,840 | SWA: 123 | Mem: 46.7GB | 428.0min
217
+ [2026-04-07 22:38:05] Step 16450/20000 [decay] | Loss: 2.8025 | BPB: 4.0432 | LR: 0.000181 | Tokens: 8,624,537,600 | TPS: 332,728 | SWA: 124 | Mem: 46.7GB | 432.0min
218
+ [2026-04-07 22:42:06] Step 16500/20000 [decay] | Loss: 2.7192 | BPB: 3.9230 | LR: 0.000179 | Tokens: 8,650,752,000 | TPS: 330,660 | SWA: 125 | Mem: 46.7GB | 436.0min
219
+ [2026-04-07 22:42:30] Saved checkpoint at step 16500
220
+ [2026-04-07 22:42:36] --- Evaluation at step 16500 ---
221
+ [2026-04-07 22:42:36] Combined val BPB: 3.1256
222
+ [2026-04-07 22:42:41] en val BPB: 3.9089
223
+ [2026-04-07 22:42:46] ar val BPB: 3.8154
224
+ [2026-04-07 22:42:51] he val BPB: 1.8087
225
+ [2026-04-07 22:42:57] fa val BPB: 3.2425
226
+ [2026-04-07 22:42:57] New best! BPB: 3.1256
227
+ [2026-04-07 22:43:51] Saved model to /tmp/checkpoints/best_model.pt
228
+ [2026-04-07 22:47:52] Step 16550/20000 [decay] | Loss: 2.9323 | BPB: 4.2304 | LR: 0.000176 | Tokens: 8,676,966,400 | TPS: 327,332 | SWA: 126 | Mem: 46.7GB | 441.8min
229
+ [2026-04-07 22:52:29] Step 16600/20000 [decay] | Loss: 3.1493 | BPB: 4.5435 | LR: 0.000174 | Tokens: 8,703,180,800 | TPS: 324,931 | SWA: 128 | Mem: 46.7GB | 446.4min
230
+ [2026-04-07 22:56:31] Step 16650/20000 [decay] | Loss: 2.8272 | BPB: 4.0789 | LR: 0.000171 | Tokens: 8,729,395,200 | TPS: 322,991 | SWA: 129 | Mem: 46.7GB | 450.4min
231
+ [2026-04-07 23:00:33] Step 16700/20000 [decay] | Loss: 2.7261 | BPB: 3.9330 | LR: 0.000169 | Tokens: 8,755,609,600 | TPS: 321,087 | SWA: 130 | Mem: 46.7GB | 454.5min
232
+ [2026-04-07 23:04:35] Step 16750/20000 [decay] | Loss: 2.8333 | BPB: 4.0876 | LR: 0.000167 | Tokens: 8,781,824,000 | TPS: 319,218 | SWA: 131 | Mem: 46.7GB | 458.5min
233
+ [2026-04-07 23:04:58] Saved checkpoint at step 16750
234
+ [2026-04-07 23:09:34] Step 16800/20000 [decay] | Loss: 3.0743 | BPB: 4.4353 | LR: 0.000164 | Tokens: 8,808,038,400 | TPS: 316,729 | SWA: 133 | Mem: 46.7GB | 463.5min
235
+ [2026-04-07 23:13:35] Step 16850/20000 [decay] | Loss: 2.8271 | BPB: 4.0787 | LR: 0.000162 | Tokens: 8,834,252,800 | TPS: 314,938 | SWA: 134 | Mem: 46.7GB | 467.5min
236
+ [2026-04-07 23:17:36] Step 16900/20000 [decay] | Loss: 2.6806 | BPB: 3.8673 | LR: 0.000159 | Tokens: 8,860,467,200 | TPS: 313,180 | SWA: 135 | Mem: 46.7GB | 471.5min
237
+ [2026-04-07 23:21:38] Step 16950/20000 [decay] | Loss: 3.0622 | BPB: 4.4178 | LR: 0.000157 | Tokens: 8,886,681,600 | TPS: 311,442 | SWA: 136 | Mem: 46.7GB | 475.6min
238
+ [2026-04-07 23:26:14] Step 17000/20000 [decay] | Loss: 2.6824 | BPB: 3.8699 | LR: 0.000154 | Tokens: 8,912,896,000 | TPS: 309,369 | SWA: 138 | Mem: 46.7GB | 480.2min
239
+ [2026-04-07 23:26:38] Saved checkpoint at step 17000
240
+ [2026-04-07 23:26:44] --- Evaluation at step 17000 ---
241
+ [2026-04-07 23:26:44] Combined val BPB: 3.1151
242
+ [2026-04-07 23:26:49] en val BPB: 3.8959
243
+ [2026-04-07 23:26:54] ar val BPB: 3.7989
244
+ [2026-04-07 23:27:00] he val BPB: 1.7978
245
+ [2026-04-07 23:27:05] fa val BPB: 3.2179
246
+ [2026-04-07 23:27:05] New best! BPB: 3.1151
247
+ [2026-04-07 23:27:59] Saved model to /tmp/checkpoints/best_model.pt
248
+ [2026-04-07 23:32:01] Step 17050/20000 [decay] | Loss: 2.7147 | BPB: 3.9164 | LR: 0.000152 | Tokens: 8,939,110,400 | TPS: 306,584 | SWA: 139 | Mem: 46.7GB | 486.0min
249
+ [2026-04-07 23:36:04] Step 17100/20000 [decay] | Loss: 2.7153 | BPB: 3.9174 | LR: 0.000150 | Tokens: 8,965,324,800 | TPS: 304,951 | SWA: 140 | Mem: 46.7GB | 490.0min
250
+ [2026-04-07 23:40:06] Step 17150/20000 [decay] | Loss: 2.8889 | BPB: 4.1678 | LR: 0.000147 | Tokens: 8,991,539,200 | TPS: 303,346 | SWA: 141 | Mem: 46.7GB | 494.0min
251
+ [2026-04-07 23:44:43] Step 17200/20000 [decay] | Loss: 2.7390 | BPB: 3.9516 | LR: 0.000145 | Tokens: 9,017,753,600 | TPS: 301,406 | SWA: 143 | Mem: 46.7GB | 498.7min
252
+ [2026-04-07 23:48:46] Step 17250/20000 [decay] | Loss: 2.9655 | BPB: 4.2784 | LR: 0.000142 | Tokens: 9,043,968,000 | TPS: 299,847 | SWA: 144 | Mem: 46.7GB | 502.7min
253
+ [2026-04-07 23:49:11] Saved checkpoint at step 17250
254
+ [2026-04-07 23:53:13] Step 17300/20000 [decay] | Loss: 2.9317 | BPB: 4.2296 | LR: 0.000140 | Tokens: 9,070,182,400 | TPS: 298,081 | SWA: 145 | Mem: 46.7GB | 507.1min
255
+ [2026-04-07 23:57:15] Step 17350/20000 [decay] | Loss: 2.8900 | BPB: 4.1694 | LR: 0.000138 | Tokens: 9,096,396,800 | TPS: 296,584 | SWA: 146 | Mem: 46.7GB | 511.2min
256
+ [2026-04-08 00:01:51] Step 17400/20000 [decay] | Loss: 2.7505 | BPB: 3.9681 | LR: 0.000135 | Tokens: 9,122,611,200 | TPS: 294,786 | SWA: 148 | Mem: 46.7GB | 515.8min
257
+ [2026-04-08 00:05:53] Step 17450/20000 [decay] | Loss: 2.7415 | BPB: 3.9551 | LR: 0.000133 | Tokens: 9,148,825,600 | TPS: 293,342 | SWA: 149 | Mem: 46.7GB | 519.8min
258
+ [2026-04-08 00:09:54] Step 17500/20000 [decay] | Loss: 3.1388 | BPB: 4.5283 | LR: 0.000130 | Tokens: 9,175,040,000 | TPS: 291,924 | SWA: 150 | Mem: 46.7GB | 523.8min
259
+ [2026-04-08 00:10:18] Saved checkpoint at step 17500
260
+ [2026-04-08 00:10:23] --- Evaluation at step 17500 ---
261
+ [2026-04-08 00:10:23] Combined val BPB: 3.1000
262
+ [2026-04-08 00:10:28] en val BPB: 3.8792
263
+ [2026-04-08 00:10:34] ar val BPB: 3.7830
264
+ [2026-04-08 00:10:39] he val BPB: 1.7858
265
+ [2026-04-08 00:10:44] fa val BPB: 3.2099
266
+ [2026-04-08 00:10:44] New best! BPB: 3.1000
267
+ [2026-04-08 00:11:38] Saved model to /tmp/checkpoints/best_model.pt
268
+ [2026-04-08 00:15:39] Step 17550/20000 [decay] | Loss: 2.9232 | BPB: 4.2173 | LR: 0.000128 | Tokens: 9,201,254,400 | TPS: 289,576 | SWA: 151 | Mem: 46.7GB | 529.6min
269
+ [2026-04-08 00:20:16] Step 17600/20000 [decay] | Loss: 2.9519 | BPB: 4.2587 | LR: 0.000125 | Tokens: 9,227,468,800 | TPS: 287,897 | SWA: 153 | Mem: 46.7GB | 534.2min
270
+ [2026-04-08 00:24:17] Step 17650/20000 [decay] | Loss: 2.8536 | BPB: 4.1168 | LR: 0.000123 | Tokens: 9,253,683,200 | TPS: 286,553 | SWA: 154 | Mem: 46.7GB | 538.2min
271
+ [2026-04-08 00:28:20] Step 17700/20000 [decay] | Loss: 3.0683 | BPB: 4.4266 | LR: 0.000121 | Tokens: 9,279,897,600 | TPS: 285,219 | SWA: 155 | Mem: 46.7GB | 542.3min
272
+ [2026-04-08 00:32:23] Step 17750/20000 [decay] | Loss: 2.6592 | BPB: 3.8365 | LR: 0.000118 | Tokens: 9,306,112,000 | TPS: 283,907 | SWA: 156 | Mem: 46.7GB | 546.3min
273
+ [2026-04-08 00:32:47] Saved checkpoint at step 17750
274
+ [2026-04-08 00:37:24] Step 17800/20000 [decay] | Loss: 2.8725 | BPB: 4.1441 | LR: 0.000116 | Tokens: 9,332,326,400 | TPS: 282,115 | SWA: 158 | Mem: 46.7GB | 551.3min
275
+ [2026-04-08 00:41:26] Step 17850/20000 [decay] | Loss: 2.8233 | BPB: 4.0732 | LR: 0.000113 | Tokens: 9,358,540,800 | TPS: 280,852 | SWA: 159 | Mem: 46.7GB | 555.4min
276
+ [2026-04-08 00:45:29] Step 17900/20000 [decay] | Loss: 2.5007 | BPB: 3.6078 | LR: 0.000111 | Tokens: 9,384,755,200 | TPS: 279,602 | SWA: 160 | Mem: 46.7GB | 559.4min
277
+ [2026-04-08 00:49:32] Step 17950/20000 [decay] | Loss: 2.6195 | BPB: 3.7792 | LR: 0.000108 | Tokens: 9,410,969,600 | TPS: 278,370 | SWA: 161 | Mem: 46.7GB | 563.5min
278
+ [2026-04-08 00:54:08] Step 18000/20000 [decay] | Loss: 2.6110 | BPB: 3.7668 | LR: 0.000106 | Tokens: 9,437,184,000 | TPS: 276,879 | SWA: 163 | Mem: 46.7GB | 568.1min
279
+ [2026-04-08 00:54:32] Saved checkpoint at step 18000
280
+ [2026-04-08 00:54:38] --- Evaluation at step 18000 ---
281
+ [2026-04-08 00:54:38] Combined val BPB: 3.0849
282
+ [2026-04-08 00:54:43] en val BPB: 3.8684
283
+ [2026-04-08 00:54:48] ar val BPB: 3.7672
284
+ [2026-04-08 00:54:53] he val BPB: 1.7790
285
+ [2026-04-08 00:54:59] fa val BPB: 3.1862
286
+ [2026-04-08 00:54:59] New best! BPB: 3.0849
287
+ [2026-04-08 00:55:53] Saved model to /tmp/checkpoints/best_model.pt
288
+ [2026-04-08 00:59:55] Step 18050/20000 [decay] | Loss: 2.7159 | BPB: 3.9182 | LR: 0.000104 | Tokens: 9,463,398,400 | TPS: 274,856 | SWA: 164 | Mem: 46.7GB | 573.8min
289
+ [2026-04-08 01:03:56] Step 18100/20000 [decay] | Loss: 2.8846 | BPB: 4.1617 | LR: 0.000101 | Tokens: 9,489,612,800 | TPS: 273,696 | SWA: 165 | Mem: 46.7GB | 577.9min
290
+ [2026-04-08 01:07:59] Step 18150/20000 [decay] | Loss: 3.0780 | BPB: 4.4406 | LR: 0.000099 | Tokens: 9,515,827,200 | TPS: 272,547 | SWA: 166 | Mem: 46.7GB | 581.9min
291
+ [2026-04-08 01:12:35] Step 18200/20000 [decay] | Loss: 2.9512 | BPB: 4.2576 | LR: 0.000096 | Tokens: 9,542,041,600 | TPS: 271,152 | SWA: 168 | Mem: 46.7GB | 586.5min
292
+ [2026-04-08 01:16:36] Step 18250/20000 [decay] | Loss: 2.6801 | BPB: 3.8666 | LR: 0.000094 | Tokens: 9,568,256,000 | TPS: 270,045 | SWA: 169 | Mem: 46.7GB | 590.5min
293
+ [2026-04-08 01:17:00] Saved checkpoint at step 18250
294
+ [2026-04-08 01:21:03] Step 18300/20000 [decay] | Loss: 2.5476 | BPB: 3.6754 | LR: 0.000091 | Tokens: 9,594,470,400 | TPS: 268,766 | SWA: 170 | Mem: 46.7GB | 595.0min
295
+ [2026-04-08 01:25:05] Step 18350/20000 [decay] | Loss: 2.8480 | BPB: 4.1087 | LR: 0.000089 | Tokens: 9,620,684,800 | TPS: 267,685 | SWA: 171 | Mem: 46.7GB | 599.0min
296
+ [2026-04-08 01:29:41] Step 18400/20000 [decay] | Loss: 2.9482 | BPB: 4.2534 | LR: 0.000087 | Tokens: 9,646,899,200 | TPS: 266,364 | SWA: 173 | Mem: 46.7GB | 603.6min
297
+ [2026-04-08 01:33:43] Step 18450/20000 [decay] | Loss: 2.9971 | BPB: 4.3239 | LR: 0.000084 | Tokens: 9,673,113,600 | TPS: 265,317 | SWA: 174 | Mem: 46.7GB | 607.6min
298
+ [2026-04-08 01:37:45] Step 18500/20000 [decay] | Loss: 2.8399 | BPB: 4.0972 | LR: 0.000082 | Tokens: 9,699,328,000 | TPS: 264,279 | SWA: 175 | Mem: 46.7GB | 611.7min
299
+ [2026-04-08 01:38:10] Saved checkpoint at step 18500
300
+ [2026-04-08 01:38:15] --- Evaluation at step 18500 ---
301
+ [2026-04-08 01:38:15] Combined val BPB: 3.0756
302
+ [2026-04-08 01:38:20] en val BPB: 3.8558
303
+ [2026-04-08 01:38:25] ar val BPB: 3.7564
304
+ [2026-04-08 01:38:31] he val BPB: 1.7712
305
+ [2026-04-08 01:38:36] fa val BPB: 3.1739
306
+ [2026-04-08 01:38:36] New best! BPB: 3.0756
307
+ [2026-04-08 01:39:30] Saved model to /tmp/checkpoints/best_model.pt
308
+ [2026-04-08 01:43:33] Step 18550/20000 [decay] | Loss: 3.0638 | BPB: 4.4201 | LR: 0.000079 | Tokens: 9,725,542,400 | TPS: 262,510 | SWA: 176 | Mem: 46.7GB | 617.5min
309
+ [2026-04-08 01:48:08] Step 18600/20000 [decay] | Loss: 3.0227 | BPB: 4.3608 | LR: 0.000077 | Tokens: 9,751,756,800 | TPS: 261,275 | SWA: 178 | Mem: 46.7GB | 622.1min
310
+ [2026-04-08 01:52:10] Step 18650/20000 [decay] | Loss: 2.7842 | BPB: 4.0167 | LR: 0.000074 | Tokens: 9,777,971,200 | TPS: 260,290 | SWA: 179 | Mem: 46.7GB | 626.1min
311
+ [2026-04-08 01:56:12] Step 18700/20000 [decay] | Loss: 2.8273 | BPB: 4.0789 | LR: 0.000072 | Tokens: 9,804,185,600 | TPS: 259,317 | SWA: 180 | Mem: 46.7GB | 630.1min
312
+ [2026-04-08 02:00:14] Step 18750/20000 [decay] | Loss: 2.7938 | BPB: 4.0306 | LR: 0.000070 | Tokens: 9,830,400,000 | TPS: 258,356 | SWA: 181 | Mem: 46.7GB | 634.2min
313
+ [2026-04-08 02:00:38] Saved checkpoint at step 18750
314
+ [2026-04-08 02:05:16] Step 18800/20000 [decay] | Loss: 2.9880 | BPB: 4.3108 | LR: 0.000067 | Tokens: 9,856,614,400 | TPS: 257,006 | SWA: 183 | Mem: 46.7GB | 639.2min
315
+ [2026-04-08 02:09:18] Step 18850/20000 [decay] | Loss: 2.7168 | BPB: 3.9195 | LR: 0.000065 | Tokens: 9,882,828,800 | TPS: 256,072 | SWA: 184 | Mem: 46.7GB | 643.2min
316
+ [2026-04-08 02:13:20] Step 18900/20000 [decay] | Loss: 2.9967 | BPB: 4.3233 | LR: 0.000062 | Tokens: 9,909,043,200 | TPS: 255,152 | SWA: 185 | Mem: 46.7GB | 647.3min
317
+ [2026-04-08 02:17:23] Step 18950/20000 [decay] | Loss: 2.8269 | BPB: 4.0784 | LR: 0.000060 | Tokens: 9,935,257,600 | TPS: 254,238 | SWA: 186 | Mem: 46.7GB | 651.3min
318
+ [2026-04-08 02:22:00] Step 19000/20000 [decay] | Loss: 3.1107 | BPB: 4.4878 | LR: 0.000057 | Tokens: 9,961,472,000 | TPS: 253,117 | SWA: 188 | Mem: 46.7GB | 655.9min
319
+ [2026-04-08 02:22:24] Saved checkpoint at step 19000
320
+ [2026-04-08 02:22:30] --- Evaluation at step 19000 ---
321
+ [2026-04-08 02:22:30] Combined val BPB: 3.0642
322
+ [2026-04-08 02:22:35] en val BPB: 3.8451
323
+ [2026-04-08 02:22:40] ar val BPB: 3.7425
324
+ [2026-04-08 02:22:45] he val BPB: 1.7616
325
+ [2026-04-08 02:22:50] fa val BPB: 3.1646
326
+ [2026-04-08 02:22:50] New best! BPB: 3.0642
327
+ [2026-04-08 02:23:45] Saved model to /tmp/checkpoints/best_model.pt
328
+ [2026-04-08 02:27:47] Step 19050/20000 [decay] | Loss: 2.8345 | BPB: 4.0894 | LR: 0.000055 | Tokens: 9,987,686,400 | TPS: 251,562 | SWA: 189 | Mem: 46.7GB | 661.7min
329
+ [2026-04-08 02:31:50] Step 19100/20000 [decay] | Loss: 2.7320 | BPB: 3.9415 | LR: 0.000053 | Tokens: 10,013,900,800 | TPS: 250,691 | SWA: 190 | Mem: 46.7GB | 665.8min
330
+ [2026-04-08 02:35:52] Step 19150/20000 [decay] | Loss: 2.6194 | BPB: 3.7790 | LR: 0.000050 | Tokens: 10,040,115,200 | TPS: 249,833 | SWA: 191 | Mem: 46.7GB | 669.8min
331
+ [2026-04-08 02:40:28] Step 19200/20000 [decay] | Loss: 2.7017 | BPB: 3.8977 | LR: 0.000048 | Tokens: 10,066,329,600 | TPS: 248,772 | SWA: 193 | Mem: 46.7GB | 674.4min
332
+ [2026-04-08 02:44:30] Step 19250/20000 [decay] | Loss: 2.9662 | BPB: 4.2794 | LR: 0.000045 | Tokens: 10,092,544,000 | TPS: 247,937 | SWA: 194 | Mem: 46.7GB | 678.4min
333
+ [2026-04-08 02:44:55] Saved checkpoint at step 19250
334
+ [2026-04-08 02:48:58] Step 19300/20000 [decay] | Loss: 2.9888 | BPB: 4.3120 | LR: 0.000043 | Tokens: 10,118,758,400 | TPS: 246,955 | SWA: 195 | Mem: 46.7GB | 682.9min
335
+ [2026-04-08 02:53:00] Step 19350/20000 [decay] | Loss: 2.9994 | BPB: 4.3272 | LR: 0.000041 | Tokens: 10,144,972,800 | TPS: 246,145 | SWA: 196 | Mem: 46.7GB | 686.9min
336
+ [2026-04-08 02:57:36] Step 19400/20000 [decay] | Loss: 2.5938 | BPB: 3.7421 | LR: 0.000038 | Tokens: 10,171,187,200 | TPS: 245,139 | SWA: 198 | Mem: 46.7GB | 691.5min
337
+ [2026-04-08 03:01:38] Step 19450/20000 [decay] | Loss: 2.4771 | BPB: 3.5737 | LR: 0.000036 | Tokens: 10,197,401,600 | TPS: 244,344 | SWA: 199 | Mem: 46.7GB | 695.6min
338
+ [2026-04-08 03:05:41] Step 19500/20000 [decay] | Loss: 2.8016 | BPB: 4.0419 | LR: 0.000033 | Tokens: 10,223,616,000 | TPS: 243,552 | SWA: 200 | Mem: 46.7GB | 699.6min
339
+ [2026-04-08 03:06:06] Saved checkpoint at step 19500
340
+ [2026-04-08 03:06:11] --- Evaluation at step 19500 ---
341
+ [2026-04-08 03:06:11] Combined val BPB: 3.0561
342
+ [2026-04-08 03:06:16] en val BPB: 3.8371
343
+ [2026-04-08 03:06:22] ar val BPB: 3.7319
344
+ [2026-04-08 03:06:27] he val BPB: 1.7534
345
+ [2026-04-08 03:06:32] fa val BPB: 3.1473
346
+ [2026-04-08 03:06:32] New best! BPB: 3.0561
347
+ [2026-04-08 03:07:26] Saved model to /tmp/checkpoints/best_model.pt
348
+ [2026-04-08 03:11:29] Step 19550/20000 [decay] | Loss: 3.0456 | BPB: 4.3939 | LR: 0.000031 | Tokens: 10,249,830,400 | TPS: 242,173 | SWA: 201 | Mem: 46.7GB | 705.4min
349
+ [2026-04-08 03:16:06] Step 19600/20000 [decay] | Loss: 2.7994 | BPB: 4.0387 | LR: 0.000028 | Tokens: 10,276,044,800 | TPS: 241,214 | SWA: 203 | Mem: 46.7GB | 710.0min
350
+ [2026-04-08 03:20:07] Step 19650/20000 [decay] | Loss: 2.5570 | BPB: 3.6890 | LR: 0.000026 | Tokens: 10,302,259,200 | TPS: 240,464 | SWA: 204 | Mem: 46.7GB | 714.1min
351
+ [2026-04-08 03:24:10] Step 19700/20000 [decay] | Loss: 2.7050 | BPB: 3.9026 | LR: 0.000024 | Tokens: 10,328,473,600 | TPS: 239,722 | SWA: 205 | Mem: 46.7GB | 718.1min
352
+ [2026-04-08 03:28:12] Step 19750/20000 [decay] | Loss: 2.6749 | BPB: 3.8590 | LR: 0.000021 | Tokens: 10,354,688,000 | TPS: 238,987 | SWA: 206 | Mem: 46.7GB | 722.1min
353
+ [2026-04-08 03:28:36] Saved checkpoint at step 19750
354
+ [2026-04-08 03:33:13] Step 19800/20000 [decay] | Loss: 2.7954 | BPB: 4.0329 | LR: 0.000019 | Tokens: 10,380,902,400 | TPS: 237,937 | SWA: 208 | Mem: 46.7GB | 727.1min
355
+ [2026-04-08 03:37:15] Step 19850/20000 [decay] | Loss: 2.6950 | BPB: 3.8880 | LR: 0.000016 | Tokens: 10,407,116,800 | TPS: 237,225 | SWA: 209 | Mem: 46.7GB | 731.2min
356
+ [2026-04-08 03:41:17] Step 19900/20000 [decay] | Loss: 2.8130 | BPB: 4.0583 | LR: 0.000014 | Tokens: 10,433,331,200 | TPS: 236,516 | SWA: 210 | Mem: 46.7GB | 735.2min
357
+ [2026-04-08 03:45:19] Step 19950/20000 [decay] | Loss: 2.8208 | BPB: 4.0696 | LR: 0.000011 | Tokens: 10,459,545,600 | TPS: 235,815 | SWA: 211 | Mem: 46.7GB | 739.2min
358
+ [2026-04-08 03:49:56] Step 20000/20000 [decay] | Loss: 2.6108 | BPB: 3.7665 | LR: 0.000009 | Tokens: 10,485,760,000 | TPS: 234,937 | SWA: 213 | Mem: 46.7GB | 743.9min
359
+ [2026-04-08 03:50:21] Saved checkpoint at step 20000
360
+ [2026-04-08 03:50:26] --- Evaluation at step 20000 ---
361
+ [2026-04-08 03:50:26] Combined val BPB: 3.0510
362
+ [2026-04-08 03:50:31] en val BPB: 3.8314
363
+ [2026-04-08 03:50:37] ar val BPB: 3.7262
364
+ [2026-04-08 03:50:42] he val BPB: 1.7495
365
+ [2026-04-08 03:50:47] fa val BPB: 3.1401
366
+ [2026-04-08 03:50:47] New best! BPB: 3.0510
367
+ [2026-04-08 03:51:42] Saved model to /tmp/checkpoints/best_model.pt
368
+ [2026-04-08 03:52:05] Saved model to /tmp/checkpoints/final_model.pt
369
+ [2026-04-08 03:52:05] Evaluating SWA model (213 checkpoints)...
370
+ [2026-04-08 03:52:46] SWA model combined BPB: 3.0635 (vs best raw: 3.0510)
371
+ [2026-04-08 03:52:52] SWA en BPB: 3.8503
372
+ [2026-04-08 03:52:57] SWA ar BPB: 3.7420
373
+ [2026-04-08 03:53:03] SWA he BPB: 1.7553
374
+ [2026-04-08 03:53:08] SWA fa BPB: 3.1727
375
+ [2026-04-08 03:53:24] Uploading all artifacts to S3...