NLPGroupProject commited on
Commit
24876ef
·
verified ·
1 Parent(s): 7ced15e

Upload pretraining_20260507_063420.log

Browse files
GPU_Run_Checkpoints/final_c2_muon_bs512_lr12_seed3_mix3to1/pretraining_20260507_063420.log ADDED
@@ -0,0 +1,547 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2026-05-07 06:34:20,302 | INFO | === Raw Model ===
2
+ GPT(
3
+ (transformer): ModuleDict(
4
+ (drop): Dropout(p=0.0, inplace=False)
5
+ (h): ModuleList(
6
+ (0-17): 18 x Block(
7
+ (ln_1): RMSNorm()
8
+ (attn): CausalSelfAttention(
9
+ (rotary): RotaryEmbedding()
10
+ (q_proj): Linear(in_features=320, out_features=320, bias=False)
11
+ (k_proj): Linear(in_features=320, out_features=64, bias=False)
12
+ (v_proj): Linear(in_features=320, out_features=64, bias=False)
13
+ (c_proj): Linear(in_features=320, out_features=320, bias=False)
14
+ (resid_dropout): Dropout(p=0.0, inplace=False)
15
+ )
16
+ (ln_2): RMSNorm()
17
+ (mlp): MLP(
18
+ (c_fc): Linear(in_features=320, out_features=2048, bias=False)
19
+ (c_proj): Linear(in_features=1024, out_features=320, bias=False)
20
+ (dropout): Dropout(p=0.0, inplace=False)
21
+ )
22
+ )
23
+ )
24
+ (ln_f): RMSNorm()
25
+ (wte): Embedding(50304, 320)
26
+ )
27
+ (lm_head): Linear(in_features=320, out_features=50304, bias=False)
28
+ )
29
+
30
+ === Forward Summary (torchinfo, uncompiled model) ===
31
+ ====================================================================================================
32
+ Layer (type:depth-idx) Output Shape Param #
33
+ ====================================================================================================
34
+ GPT [1, 1, 50304] --
35
+ ├─ModuleDict: 1-1 -- --
36
+ │ └─Embedding: 2-1 [1, 1024, 320] 16,097,280
37
+ │ └─Dropout: 2-2 [1, 1024, 320] --
38
+ │ └─ModuleList: 2-3 -- --
39
+ │ │ └─Block: 3-1 [1, 1024, 320] --
40
+ │ │ │ └─RMSNorm: 4-1 [1, 1024, 320] 320
41
+ │ │ │ └─CausalSelfAttention: 4-2 [1, 1024, 320] --
42
+ │ │ │ │ └─Linear: 5-1 [1, 1024, 320] 102,400
43
+ │ │ │ │ └─Linear: 5-2 [1, 1024, 64] 20,480
44
+ │ │ │ │ └─Linear: 5-3 [1, 1024, 64] 20,480
45
+ │ │ │ │ └─RotaryEmbedding: 5-4 [1, 1, 1024, 64] --
46
+ │ │ │ │ └─Linear: 5-5 [1, 1024, 320] 102,400
47
+ │ │ │ │ └─Dropout: 5-6 [1, 1024, 320] --
48
+ │ │ │ └─RMSNorm: 4-3 [1, 1024, 320] 320
49
+ │ │ │ └─MLP: 4-4 [1, 1024, 320] --
50
+ │ │ │ │ └─Linear: 5-7 [1, 1024, 2048] 655,360
51
+ │ │ │ │ └─Linear: 5-8 [1, 1024, 320] 327,680
52
+ │ │ │ │ └─Dropout: 5-9 [1, 1024, 320] --
53
+ │ │ └─Block: 3-2 [1, 1024, 320] --
54
+ │ │ │ └─RMSNorm: 4-5 [1, 1024, 320] 320
55
+ │ │ │ └─CausalSelfAttention: 4-6 [1, 1024, 320] --
56
+ │ │ │ │ └─Linear: 5-10 [1, 1024, 320] 102,400
57
+ │ │ │ │ └─Linear: 5-11 [1, 1024, 64] 20,480
58
+ │ │ │ │ └─Linear: 5-12 [1, 1024, 64] 20,480
59
+ │ │ │ │ └─RotaryEmbedding: 5-13 [1, 1, 1024, 64] --
60
+ │ │ │ │ └─Linear: 5-14 [1, 1024, 320] 102,400
61
+ │ │ │ │ └─Dropout: 5-15 [1, 1024, 320] --
62
+ │ │ │ └─RMSNorm: 4-7 [1, 1024, 320] 320
63
+ │ │ │ └─MLP: 4-8 [1, 1024, 320] --
64
+ │ │ │ │ └─Linear: 5-16 [1, 1024, 2048] 655,360
65
+ │ │ │ │ └─Linear: 5-17 [1, 1024, 320] 327,680
66
+ │ │ │ │ └─Dropout: 5-18 [1, 1024, 320] --
67
+ │ │ └─Block: 3-3 [1, 1024, 320] --
68
+ │ │ │ └─RMSNorm: 4-9 [1, 1024, 320] 320
69
+ │ │ │ └─CausalSelfAttention: 4-10 [1, 1024, 320] --
70
+ │ │ │ │ └─Linear: 5-19 [1, 1024, 320] 102,400
71
+ │ │ │ │ └─Linear: 5-20 [1, 1024, 64] 20,480
72
+ │ │ │ │ └─Linear: 5-21 [1, 1024, 64] 20,480
73
+ │ │ │ │ └─RotaryEmbedding: 5-22 [1, 1, 1024, 64] --
74
+ │ │ │ │ └─Linear: 5-23 [1, 1024, 320] 102,400
75
+ │ │ │ │ └─Dropout: 5-24 [1, 1024, 320] --
76
+ │ │ │ └─RMSNorm: 4-11 [1, 1024, 320] 320
77
+ │ │ │ └─MLP: 4-12 [1, 1024, 320] --
78
+ │ │ │ │ └─Linear: 5-25 [1, 1024, 2048] 655,360
79
+ │ │ │ │ └─Linear: 5-26 [1, 1024, 320] 327,680
80
+ │ │ │ │ └─Dropout: 5-27 [1, 1024, 320] --
81
+ │ │ └─Block: 3-4 [1, 1024, 320] --
82
+ │ │ │ └─RMSNorm: 4-13 [1, 1024, 320] 320
83
+ │ │ │ └─CausalSelfAttention: 4-14 [1, 1024, 320] --
84
+ │ │ │ │ └─Linear: 5-28 [1, 1024, 320] 102,400
85
+ │ │ │ │ └─Linear: 5-29 [1, 1024, 64] 20,480
86
+ │ │ │ │ └─Linear: 5-30 [1, 1024, 64] 20,480
87
+ │ │ │ │ └─RotaryEmbedding: 5-31 [1, 1, 1024, 64] --
88
+ │ │ │ │ └─Linear: 5-32 [1, 1024, 320] 102,400
89
+ │ │ │ │ └─Dropout: 5-33 [1, 1024, 320] --
90
+ │ │ │ └─RMSNorm: 4-15 [1, 1024, 320] 320
91
+ │ │ │ └─MLP: 4-16 [1, 1024, 320] --
92
+ │ │ │ │ └─Linear: 5-34 [1, 1024, 2048] 655,360
93
+ │ │ │ │ └─Linear: 5-35 [1, 1024, 320] 327,680
94
+ │ │ │ │ └─Dropout: 5-36 [1, 1024, 320] --
95
+ │ │ └─Block: 3-5 [1, 1024, 320] --
96
+ │ │ │ └─RMSNorm: 4-17 [1, 1024, 320] 320
97
+ │ │ │ └─CausalSelfAttention: 4-18 [1, 1024, 320] --
98
+ │ │ │ │ └─Linear: 5-37 [1, 1024, 320] 102,400
99
+ │ │ │ │ └─Linear: 5-38 [1, 1024, 64] 20,480
100
+ │ │ │ │ └─Linear: 5-39 [1, 1024, 64] 20,480
101
+ │ │ │ │ └─RotaryEmbedding: 5-40 [1, 1, 1024, 64] --
102
+ │ │ │ │ └─Linear: 5-41 [1, 1024, 320] 102,400
103
+ │ │ │ │ └─Dropout: 5-42 [1, 1024, 320] --
104
+ │ │ │ └─RMSNorm: 4-19 [1, 1024, 320] 320
105
+ │ │ │ └─MLP: 4-20 [1, 1024, 320] --
106
+ │ │ │ │ └─Linear: 5-43 [1, 1024, 2048] 655,360
107
+ │ │ │ │ └─Linear: 5-44 [1, 1024, 320] 327,680
108
+ │ │ │ │ └─Dropout: 5-45 [1, 1024, 320] --
109
+ │ │ └─Block: 3-6 [1, 1024, 320] --
110
+ │ │ │ └─RMSNorm: 4-21 [1, 1024, 320] 320
111
+ │ │ │ └─CausalSelfAttention: 4-22 [1, 1024, 320] --
112
+ │ │ │ │ └─Linear: 5-46 [1, 1024, 320] 102,400
113
+ │ │ │ │ └─Linear: 5-47 [1, 1024, 64] 20,480
114
+ │ │ │ │ └─Linear: 5-48 [1, 1024, 64] 20,480
115
+ │ │ │ │ └─RotaryEmbedding: 5-49 [1, 1, 1024, 64] --
116
+ │ │ │ │ └─Linear: 5-50 [1, 1024, 320] 102,400
117
+ │ │ │ │ └─Dropout: 5-51 [1, 1024, 320] --
118
+ │ │ │ └─RMSNorm: 4-23 [1, 1024, 320] 320
119
+ │ │ │ └─MLP: 4-24 [1, 1024, 320] --
120
+ │ │ │ │ └─Linear: 5-52 [1, 1024, 2048] 655,360
121
+ │ │ │ │ └─Linear: 5-53 [1, 1024, 320] 327,680
122
+ │ │ │ │ └─Dropout: 5-54 [1, 1024, 320] --
123
+ │ │ └─Block: 3-7 [1, 1024, 320] --
124
+ │ │ │ └─RMSNorm: 4-25 [1, 1024, 320] 320
125
+ │ │ │ └─CausalSelfAttention: 4-26 [1, 1024, 320] --
126
+ │ │ │ │ └─Linear: 5-55 [1, 1024, 320] 102,400
127
+ │ │ │ │ └─Linear: 5-56 [1, 1024, 64] 20,480
128
+ │ │ │ │ └─Linear: 5-57 [1, 1024, 64] 20,480
129
+ │ │ │ │ └─RotaryEmbedding: 5-58 [1, 1, 1024, 64] --
130
+ │ │ │ │ └─Linear: 5-59 [1, 1024, 320] 102,400
131
+ │ │ │ │ └─Dropout: 5-60 [1, 1024, 320] --
132
+ │ │ │ └─RMSNorm: 4-27 [1, 1024, 320] 320
133
+ │ │ │ └─MLP: 4-28 [1, 1024, 320] --
134
+ │ │ │ │ └─Linear: 5-61 [1, 1024, 2048] 655,360
135
+ │ │ │ │ └─Linear: 5-62 [1, 1024, 320] 327,680
136
+ │ │ │ │ └─Dropout: 5-63 [1, 1024, 320] --
137
+ │ │ └─Block: 3-8 [1, 1024, 320] --
138
+ │ │ │ └─RMSNorm: 4-29 [1, 1024, 320] 320
139
+ │ │ │ └─CausalSelfAttention: 4-30 [1, 1024, 320] --
140
+ │ │ │ │ └─Linear: 5-64 [1, 1024, 320] 102,400
141
+ │ │ │ │ └─Linear: 5-65 [1, 1024, 64] 20,480
142
+ │ │ │ │ └─Linear: 5-66 [1, 1024, 64] 20,480
143
+ │ │ │ │ └─RotaryEmbedding: 5-67 [1, 1, 1024, 64] --
144
+ │ │ │ │ └─Linear: 5-68 [1, 1024, 320] 102,400
145
+ │ │ │ │ └─Dropout: 5-69 [1, 1024, 320] --
146
+ │ │ │ └─RMSNorm: 4-31 [1, 1024, 320] 320
147
+ │ │ │ └─MLP: 4-32 [1, 1024, 320] --
148
+ │ │ │ │ └─Linear: 5-70 [1, 1024, 2048] 655,360
149
+ │ │ │ │ └─Linear: 5-71 [1, 1024, 320] 327,680
150
+ │ │ │ │ └─Dropout: 5-72 [1, 1024, 320] --
151
+ │ │ └─Block: 3-9 [1, 1024, 320] --
152
+ │ │ │ └─RMSNorm: 4-33 [1, 1024, 320] 320
153
+ │ │ │ └─CausalSelfAttention: 4-34 [1, 1024, 320] --
154
+ │ │ │ │ └─Linear: 5-73 [1, 1024, 320] 102,400
155
+ │ │ │ │ └─Linear: 5-74 [1, 1024, 64] 20,480
156
+ │ │ │ │ └─Linear: 5-75 [1, 1024, 64] 20,480
157
+ │ │ │ │ └─RotaryEmbedding: 5-76 [1, 1, 1024, 64] --
158
+ │ │ │ │ └─Linear: 5-77 [1, 1024, 320] 102,400
159
+ │ │ │ │ └─Dropout: 5-78 [1, 1024, 320] --
160
+ │ │ │ └─RMSNorm: 4-35 [1, 1024, 320] 320
161
+ │ │ │ └─MLP: 4-36 [1, 1024, 320] --
162
+ │ │ │ │ └─Linear: 5-79 [1, 1024, 2048] 655,360
163
+ │ │ │ │ └─Linear: 5-80 [1, 1024, 320] 327,680
164
+ │ │ │ │ └─Dropout: 5-81 [1, 1024, 320] --
165
+ │ │ └─Block: 3-10 [1, 1024, 320] --
166
+ │ │ │ └─RMSNorm: 4-37 [1, 1024, 320] 320
167
+ │ │ │ └─CausalSelfAttention: 4-38 [1, 1024, 320] --
168
+ │ │ │ │ └─Linear: 5-82 [1, 1024, 320] 102,400
169
+ │ │ │ │ └─Linear: 5-83 [1, 1024, 64] 20,480
170
+ │ │ │ │ └─Linear: 5-84 [1, 1024, 64] 20,480
171
+ │ │ │ │ └─RotaryEmbedding: 5-85 [1, 1, 1024, 64] --
172
+ │ │ │ │ └─Linear: 5-86 [1, 1024, 320] 102,400
173
+ │ │ │ │ └─Dropout: 5-87 [1, 1024, 320] --
174
+ │ │ │ └─RMSNorm: 4-39 [1, 1024, 320] 320
175
+ │ │ │ └─MLP: 4-40 [1, 1024, 320] --
176
+ │ │ │ │ └─Linear: 5-88 [1, 1024, 2048] 655,360
177
+ │ │ │ │ └─Linear: 5-89 [1, 1024, 320] 327,680
178
+ │ │ │ │ └─Dropout: 5-90 [1, 1024, 320] --
179
+ │ │ └─Block: 3-11 [1, 1024, 320] --
180
+ │ │ │ └─RMSNorm: 4-41 [1, 1024, 320] 320
181
+ │ │ │ └─CausalSelfAttention: 4-42 [1, 1024, 320] --
182
+ │ │ │ │ └─Linear: 5-91 [1, 1024, 320] 102,400
183
+ │ │ │ │ └─Linear: 5-92 [1, 1024, 64] 20,480
184
+ │ │ │ │ └─Linear: 5-93 [1, 1024, 64] 20,480
185
+ │ │ │ │ └─RotaryEmbedding: 5-94 [1, 1, 1024, 64] --
186
+ │ │ │ │ └─Linear: 5-95 [1, 1024, 320] 102,400
187
+ │ │ │ │ └─Dropout: 5-96 [1, 1024, 320] --
188
+ │ │ │ └─RMSNorm: 4-43 [1, 1024, 320] 320
189
+ │ │ │ └─MLP: 4-44 [1, 1024, 320] --
190
+ │ │ │ │ └─Linear: 5-97 [1, 1024, 2048] 655,360
191
+ │ │ │ │ └─Linear: 5-98 [1, 1024, 320] 327,680
192
+ │ │ │ │ └─Dropout: 5-99 [1, 1024, 320] --
193
+ │ │ └─Block: 3-12 [1, 1024, 320] --
194
+ │ │ │ └─RMSNorm: 4-45 [1, 1024, 320] 320
195
+ │ │ │ └─CausalSelfAttention: 4-46 [1, 1024, 320] --
196
+ │ │ │ │ └─Linear: 5-100 [1, 1024, 320] 102,400
197
+ │ │ │ │ └─Linear: 5-101 [1, 1024, 64] 20,480
198
+ │ │ │ │ └─Linear: 5-102 [1, 1024, 64] 20,480
199
+ │ │ │ │ └─RotaryEmbedding: 5-103 [1, 1, 1024, 64] --
200
+ │ │ │ │ └─Linear: 5-104 [1, 1024, 320] 102,400
201
+ │ │ │ │ └─Dropout: 5-105 [1, 1024, 320] --
202
+ │ │ │ └─RMSNorm: 4-47 [1, 1024, 320] 320
203
+ │ │ │ └─MLP: 4-48 [1, 1024, 320] --
204
+ │ │ │ │ └─Linear: 5-106 [1, 1024, 2048] 655,360
205
+ │ │ │ │ └─Linear: 5-107 [1, 1024, 320] 327,680
206
+ │ │ │ │ └─Dropout: 5-108 [1, 1024, 320] --
207
+ │ │ └─Block: 3-13 [1, 1024, 320] --
208
+ │ │ │ └─RMSNorm: 4-49 [1, 1024, 320] 320
209
+ │ │ │ └─CausalSelfAttention: 4-50 [1, 1024, 320] --
210
+ │ │ │ │ └─Linear: 5-109 [1, 1024, 320] 102,400
211
+ │ │ │ │ └─Linear: 5-110 [1, 1024, 64] 20,480
212
+ │ │ │ │ └─Linear: 5-111 [1, 1024, 64] 20,480
213
+ │ │ │ │ └─RotaryEmbedding: 5-112 [1, 1, 1024, 64] --
214
+ │ │ │ │ └─Linear: 5-113 [1, 1024, 320] 102,400
215
+ │ │ │ │ └─Dropout: 5-114 [1, 1024, 320] --
216
+ │ │ │ └─RMSNorm: 4-51 [1, 1024, 320] 320
217
+ │ │ │ └─MLP: 4-52 [1, 1024, 320] --
218
+ │ │ │ │ └─Linear: 5-115 [1, 1024, 2048] 655,360
219
+ │ │ │ │ └─Linear: 5-116 [1, 1024, 320] 327,680
220
+ │ │ │ │ └─Dropout: 5-117 [1, 1024, 320] --
221
+ │ │ └─Block: 3-14 [1, 1024, 320] --
222
+ │ │ │ └─RMSNorm: 4-53 [1, 1024, 320] 320
223
+ │ │ │ └─CausalSelfAttention: 4-54 [1, 1024, 320] --
224
+ │ │ │ │ └─Linear: 5-118 [1, 1024, 320] 102,400
225
+ │ │ │ │ └─Linear: 5-119 [1, 1024, 64] 20,480
226
+ │ │ │ │ └─Linear: 5-120 [1, 1024, 64] 20,480
227
+ │ │ │ │ └─RotaryEmbedding: 5-121 [1, 1, 1024, 64] --
228
+ │ │ │ │ └─Linear: 5-122 [1, 1024, 320] 102,400
229
+ │ │ │ │ └─Dropout: 5-123 [1, 1024, 320] --
230
+ │ │ │ └─RMSNorm: 4-55 [1, 1024, 320] 320
231
+ │ │ │ └─MLP: 4-56 [1, 1024, 320] --
232
+ │ │ │ │ └─Linear: 5-124 [1, 1024, 2048] 655,360
233
+ │ │ │ │ └─Linear: 5-125 [1, 1024, 320] 327,680
234
+ │ │ │ │ └─Dropout: 5-126 [1, 1024, 320] --
235
+ │ │ └─Block: 3-15 [1, 1024, 320] --
236
+ │ │ │ └─RMSNorm: 4-57 [1, 1024, 320] 320
237
+ │ │ │ └─CausalSelfAttention: 4-58 [1, 1024, 320] --
238
+ │ │ │ │ └─Linear: 5-127 [1, 1024, 320] 102,400
239
+ │ │ │ │ └─Linear: 5-128 [1, 1024, 64] 20,480
240
+ │ │ │ │ └─Linear: 5-129 [1, 1024, 64] 20,480
241
+ │ │ │ │ └─RotaryEmbedding: 5-130 [1, 1, 1024, 64] --
242
+ │ │ │ │ └─Linear: 5-131 [1, 1024, 320] 102,400
243
+ │ │ │ │ └─Dropout: 5-132 [1, 1024, 320] --
244
+ │ │ │ └─RMSNorm: 4-59 [1, 1024, 320] 320
245
+ │ │ │ └─MLP: 4-60 [1, 1024, 320] --
246
+ │ │ │ │ └─Linear: 5-133 [1, 1024, 2048] 655,360
247
+ │ │ │ │ └─Linear: 5-134 [1, 1024, 320] 327,680
248
+ │ │ │ │ └─Dropout: 5-135 [1, 1024, 320] --
249
+ │ │ └─Block: 3-16 [1, 1024, 320] --
250
+ │ │ │ └─RMSNorm: 4-61 [1, 1024, 320] 320
251
+ │ │ │ └─CausalSelfAttention: 4-62 [1, 1024, 320] --
252
+ │ │ │ │ └─Linear: 5-136 [1, 1024, 320] 102,400
253
+ │ │ │ │ └─Linear: 5-137 [1, 1024, 64] 20,480
254
+ │ │ │ │ └─Linear: 5-138 [1, 1024, 64] 20,480
255
+ │ │ │ │ └─RotaryEmbedding: 5-139 [1, 1, 1024, 64] --
256
+ │ │ │ │ └─Linear: 5-140 [1, 1024, 320] 102,400
257
+ │ │ │ │ └─Dropout: 5-141 [1, 1024, 320] --
258
+ │ │ │ └─RMSNorm: 4-63 [1, 1024, 320] 320
259
+ │ │ │ └─MLP: 4-64 [1, 1024, 320] --
260
+ │ │ │ │ └─Linear: 5-142 [1, 1024, 2048] 655,360
261
+ │ │ │ │ └─Linear: 5-143 [1, 1024, 320] 327,680
262
+ │ │ │ │ └─Dropout: 5-144 [1, 1024, 320] --
263
+ │ │ └─Block: 3-17 [1, 1024, 320] --
264
+ │ │ │ └─RMSNorm: 4-65 [1, 1024, 320] 320
265
+ │ │ │ └─CausalSelfAttention: 4-66 [1, 1024, 320] --
266
+ │ │ │ │ └─Linear: 5-145 [1, 1024, 320] 102,400
267
+ │ │ │ │ └─Linear: 5-146 [1, 1024, 64] 20,480
268
+ │ │ │ │ └─Linear: 5-147 [1, 1024, 64] 20,480
269
+ │ │ │ │ └─RotaryEmbedding: 5-148 [1, 1, 1024, 64] --
270
+ │ │ │ │ └─Linear: 5-149 [1, 1024, 320] 102,400
271
+ │ │ │ │ └─Dropout: 5-150 [1, 1024, 320] --
272
+ │ │ │ └─RMSNorm: 4-67 [1, 1024, 320] 320
273
+ │ │ │ └─MLP: 4-68 [1, 1024, 320] --
274
+ │ │ │ │ └─Linear: 5-151 [1, 1024, 2048] 655,360
275
+ │ │ │ │ └─Linear: 5-152 [1, 1024, 320] 327,680
276
+ │ │ │ │ └─Dropout: 5-153 [1, 1024, 320] --
277
+ │ │ └─Block: 3-18 [1, 1024, 320] --
278
+ │ │ │ └─RMSNorm: 4-69 [1, 1024, 320] 320
279
+ │ │ │ └─CausalSelfAttention: 4-70 [1, 1024, 320] --
280
+ │ │ │ │ └─Linear: 5-154 [1, 1024, 320] 102,400
281
+ │ │ │ │ └─Linear: 5-155 [1, 1024, 64] 20,480
282
+ │ │ │ │ └─Linear: 5-156 [1, 1024, 64] 20,480
283
+ │ │ │ │ └─RotaryEmbedding: 5-157 [1, 1, 1024, 64] --
284
+ │ │ │ │ └─Linear: 5-158 [1, 1024, 320] 102,400
285
+ │ │ │ │ └─Dropout: 5-159 [1, 1024, 320] --
286
+ │ │ │ └─RMSNorm: 4-71 [1, 1024, 320] 320
287
+ │ │ │ └─MLP: 4-72 [1, 1024, 320] --
288
+ │ │ │ │ └─Linear: 5-160 [1, 1024, 2048] 655,360
289
+ │ │ │ │ └─Linear: 5-161 [1, 1024, 320] 327,680
290
+ │ │ │ │ └─Dropout: 5-162 [1, 1024, 320] --
291
+ │ └─RMSNorm: 2-4 [1, 1024, 320] 320
292
+ ├─Linear: 1-2 [1, 1, 50304] 16,097,280
293
+ ====================================================================================================
294
+
295
+ === Parameter Counts (unique tensors) ===
296
+ Total params: 38,227,520
297
+ Trainable params: 38,227,520
298
+ Weight tying (wte = lm_head): True
299
+ Embedding mode: standard tied token embedding
300
+ Note: module-level torchinfo totals may double-count the tied LM head; use the unique counts above.
301
+ 2026-05-07 06:34:20,360 | INFO | === Pretraining Started ===
302
+ 2026-05-07 06:34:20,360 | INFO | Device: cuda | dtype: bfloat16 | distributed: False (world_size=1)
303
+ 2026-05-07 06:34:20,360 | INFO | Model: 18 layers, 5 heads, 320 embd, context_len=1024
304
+ 2026-05-07 06:34:20,360 | INFO | Training: max_iters=15259, batch_size=4, grad_accum=128, lr=1.11e-02, warmup=153 steps
305
+ 2026-05-07 06:34:20,360 | INFO | Data: 7062542559 train tokens | tokens/step=524288
306
+ 2026-05-07 06:34:20,360 | INFO | Data mix: data/processed_owt/train=75.0%, data/processed_nonwiki_2b/train=25.0%
307
+ 2026-05-07 06:34:26,584 | INFO | Resumed from checkpoint artifacts/final_c2_muon/final_c2_muon_bs512_lr12_seed3_mix3to1/checkpoints/latest_ckpt.pt at step 13200
308
+ 2026-05-07 06:35:49,994 | INFO | step 13210/15259 | epoch 0 | loss 3.5143 | ppl 33.59 | lr 1.00e-02 | grad_norm 0.234 | 62857 tok/s | dt 83.41s | ETA 4:44:50
309
+ 2026-05-07 06:36:24,112 | INFO | step 13220/15259 | epoch 0 | loss 3.4970 | ppl 33.02 | lr 1.00e-02 | grad_norm 0.214 | 153670 tok/s | dt 34.12s | ETA 3:19:41
310
+ 2026-05-07 06:36:58,549 | INFO | step 13230/15259 | epoch 0 | loss 3.5149 | ppl 33.61 | lr 9.95e-03 | grad_norm 0.191 | 152248 tok/s | dt 34.44s | ETA 2:51:17
311
+ 2026-05-07 06:37:32,417 | INFO | step 13240/15259 | epoch 0 | loss 3.5049 | ppl 33.28 | lr 9.91e-03 | grad_norm 0.172 | 154801 tok/s | dt 33.87s | ETA 2:36:19
312
+ 2026-05-07 06:38:07,047 | INFO | step 13250/15259 | epoch 0 | loss 3.5104 | ppl 33.46 | lr 9.87e-03 | grad_norm 0.209 | 151398 tok/s | dt 34.63s | ETA 2:27:37
313
+ 2026-05-07 06:38:41,493 | INFO | step 13260/15259 | epoch 0 | loss 3.4893 | ppl 32.76 | lr 9.82e-03 | grad_norm 0.254 | 152207 tok/s | dt 34.45s | ETA 1:54:16
314
+ 2026-05-07 06:39:15,212 | INFO | step 13270/15259 | epoch 0 | loss 3.5077 | ppl 33.37 | lr 9.78e-03 | grad_norm 0.221 | 155484 tok/s | dt 33.72s | ETA 1:53:26
315
+ 2026-05-07 06:39:49,389 | INFO | step 13280/15259 | epoch 0 | loss 3.5351 | ppl 34.30 | lr 9.74e-03 | grad_norm 0.201 | 153405 tok/s | dt 34.18s | ETA 1:52:41
316
+ 2026-05-07 06:40:23,601 | INFO | step 13290/15259 | epoch 0 | loss 3.5499 | ppl 34.81 | lr 9.69e-03 | grad_norm 0.204 | 153248 tok/s | dt 34.21s | ETA 1:52:21
317
+ 2026-05-07 06:40:58,004 | INFO | step 13300/15259 | epoch 0 | loss 3.5136 | ppl 33.57 | lr 9.65e-03 | grad_norm 0.173 | 152395 tok/s | dt 34.40s | ETA 1:51:37
318
+ 2026-05-07 06:41:19,954 | INFO | step 13300 | val_loss 3.6002 | val_ppl 36.61
319
+ 2026-05-07 06:42:04,308 | INFO | step 13310/15259 | epoch 0 | loss 3.5649 | ppl 35.34 | lr 9.61e-03 | grad_norm 0.193 | 79074 tok/s | dt 66.30s | ETA 1:51:04
320
+ 2026-05-07 06:42:38,896 | INFO | step 13320/15259 | epoch 0 | loss 3.5135 | ppl 33.56 | lr 9.56e-03 | grad_norm 0.256 | 151578 tok/s | dt 34.59s | ETA 1:51:03
321
+ 2026-05-07 06:43:13,449 | INFO | step 13330/15259 | epoch 0 | loss 3.5171 | ppl 33.69 | lr 9.52e-03 | grad_norm 0.206 | 151737 tok/s | dt 34.55s | ETA 1:50:43
322
+ 2026-05-07 06:43:47,812 | INFO | step 13340/15259 | epoch 0 | loss 3.5745 | ppl 35.68 | lr 9.47e-03 | grad_norm 0.241 | 152574 tok/s | dt 34.36s | ETA 1:50:15
323
+ 2026-05-07 06:44:22,826 | INFO | step 13350/15259 | epoch 0 | loss 3.4603 | ppl 31.83 | lr 9.43e-03 | grad_norm 0.207 | 149734 tok/s | dt 35.01s | ETA 1:50:04
324
+ 2026-05-07 06:44:57,160 | INFO | step 13360/15259 | epoch 0 | loss 3.4766 | ppl 32.35 | lr 9.39e-03 | grad_norm 0.147 | 152701 tok/s | dt 34.33s | ETA 1:49:24
325
+ 2026-05-07 06:45:31,551 | INFO | step 13370/15259 | epoch 0 | loss 3.4897 | ppl 32.78 | lr 9.34e-03 | grad_norm 0.201 | 152448 tok/s | dt 34.39s | ETA 1:48:42
326
+ 2026-05-07 06:46:05,911 | INFO | step 13380/15259 | epoch 0 | loss 3.5253 | ppl 33.96 | lr 9.30e-03 | grad_norm 0.196 | 152591 tok/s | dt 34.36s | ETA 1:48:00
327
+ 2026-05-07 06:46:40,259 | INFO | step 13390/15259 | epoch 0 | loss 3.5304 | ppl 34.14 | lr 9.26e-03 | grad_norm 0.162 | 152636 tok/s | dt 34.35s | ETA 1:47:25
328
+ 2026-05-07 06:47:14,770 | INFO | step 13400/15259 | epoch 0 | loss 3.5191 | ppl 33.75 | lr 9.21e-03 | grad_norm 0.162 | 151920 tok/s | dt 34.51s | ETA 1:46:32
329
+ 2026-05-07 06:47:14,957 | INFO | step 13400 | val_loss 3.3677 | val_ppl 29.01 ** New best validation loss! **
330
+ 2026-05-07 06:47:30,555 | WARNING | New best checkpoint at step 13400 | val_loss=3.3677 | saved to artifacts/final_c2_muon/final_c2_muon_bs512_lr12_seed3_mix3to1/checkpoints/best_ckpt.pt
331
+ 2026-05-07 06:48:05,293 | INFO | step 13410/15259 | epoch 0 | loss 3.4872 | ppl 32.69 | lr 9.17e-03 | grad_norm 0.137 | 103772 tok/s | dt 50.52s | ETA 1:46:13
332
+ 2026-05-07 06:48:39,903 | INFO | step 13420/15259 | epoch 0 | loss 3.5143 | ppl 33.59 | lr 9.13e-03 | grad_norm 0.200 | 151484 tok/s | dt 34.61s | ETA 1:45:46
333
+ 2026-05-07 06:49:14,544 | INFO | step 13430/15259 | epoch 0 | loss 3.4930 | ppl 32.89 | lr 9.08e-03 | grad_norm 0.153 | 151349 tok/s | dt 34.64s | ETA 1:45:22
334
+ 2026-05-07 06:49:49,662 | INFO | step 13440/15259 | epoch 0 | loss 3.5093 | ppl 33.43 | lr 9.04e-03 | grad_norm 0.182 | 149296 tok/s | dt 35.12s | ETA 1:45:16
335
+ 2026-05-07 06:50:24,600 | INFO | step 13450/15259 | epoch 0 | loss 3.5218 | ppl 33.85 | lr 9.00e-03 | grad_norm 0.174 | 150061 tok/s | dt 34.94s | ETA 1:44:56
336
+ 2026-05-07 06:50:59,541 | INFO | step 13460/15259 | epoch 0 | loss 3.5333 | ppl 34.24 | lr 8.95e-03 | grad_norm 0.161 | 150049 tok/s | dt 34.94s | ETA 1:44:29
337
+ 2026-05-07 06:51:34,376 | INFO | step 13470/15259 | epoch 0 | loss 3.4957 | ppl 32.97 | lr 8.91e-03 | grad_norm 0.175 | 150506 tok/s | dt 34.84s | ETA 1:44:02
338
+ 2026-05-07 06:52:09,097 | INFO | step 13480/15259 | epoch 1 | loss 3.5247 | ppl 33.94 | lr 8.86e-03 | grad_norm 0.153 | 151003 tok/s | dt 34.72s | ETA 1:43:30
339
+ 2026-05-07 06:52:43,742 | INFO | step 13490/15259 | epoch 1 | loss 3.5071 | ppl 33.35 | lr 8.82e-03 | grad_norm 0.267 | 151329 tok/s | dt 34.65s | ETA 1:42:38
340
+ 2026-05-07 06:53:18,381 | INFO | step 13500/15259 | epoch 1 | loss 3.7590 | ppl 42.90 | lr 8.78e-03 | grad_norm 0.192 | 151359 tok/s | dt 34.64s | ETA 1:41:53
341
+ 2026-05-07 06:53:18,567 | INFO | step 13500 | val_loss 3.4374 | val_ppl 31.11
342
+ 2026-05-07 06:54:02,683 | INFO | step 13510/15259 | epoch 1 | loss 3.5038 | ppl 33.24 | lr 8.73e-03 | grad_norm 0.221 | 118344 tok/s | dt 44.30s | ETA 1:41:32
343
+ 2026-05-07 06:54:37,248 | INFO | step 13520/15259 | epoch 1 | loss 3.5296 | ppl 34.11 | lr 8.69e-03 | grad_norm 0.171 | 151680 tok/s | dt 34.57s | ETA 1:40:48
344
+ 2026-05-07 06:55:11,594 | INFO | step 13530/15259 | epoch 1 | loss 3.5474 | ppl 34.72 | lr 8.65e-03 | grad_norm 0.150 | 152653 tok/s | dt 34.35s | ETA 1:40:00
345
+ 2026-05-07 06:55:45,716 | INFO | step 13540/15259 | epoch 1 | loss 3.5175 | ppl 33.70 | lr 8.60e-03 | grad_norm 0.156 | 153649 tok/s | dt 34.12s | ETA 1:39:08
346
+ 2026-05-07 06:56:19,861 | INFO | step 13550/15259 | epoch 1 | loss 3.5147 | ppl 33.60 | lr 8.56e-03 | grad_norm 0.156 | 153547 tok/s | dt 34.15s | ETA 1:38:16
347
+ 2026-05-07 06:56:54,398 | INFO | step 13560/15259 | epoch 1 | loss 3.4620 | ppl 31.88 | lr 8.52e-03 | grad_norm 0.189 | 151807 tok/s | dt 34.54s | ETA 1:37:14
348
+ 2026-05-07 06:57:29,031 | INFO | step 13570/15259 | epoch 1 | loss 3.5114 | ppl 33.50 | lr 8.47e-03 | grad_norm 0.157 | 151382 tok/s | dt 34.63s | ETA 1:36:42
349
+ 2026-05-07 06:58:03,664 | INFO | step 13580/15259 | epoch 1 | loss 3.5009 | ppl 33.14 | lr 8.43e-03 | grad_norm 0.176 | 151383 tok/s | dt 34.63s | ETA 1:36:17
350
+ 2026-05-07 06:58:38,286 | INFO | step 13590/15259 | epoch 1 | loss 3.4892 | ppl 32.76 | lr 8.39e-03 | grad_norm 0.141 | 151435 tok/s | dt 34.62s | ETA 1:36:00
351
+ 2026-05-07 06:59:12,820 | INFO | step 13600/15259 | epoch 1 | loss 3.5015 | ppl 33.16 | lr 8.34e-03 | grad_norm 0.182 | 151818 tok/s | dt 34.53s | ETA 1:35:38
352
+ 2026-05-07 06:59:13,002 | INFO | step 13600 | val_loss 3.4226 | val_ppl 30.65
353
+ 2026-05-07 06:59:59,485 | INFO | step 13610/15259 | epoch 1 | loss 3.4709 | ppl 32.16 | lr 8.30e-03 | grad_norm 0.180 | 112351 tok/s | dt 46.67s | ETA 1:35:03
354
+ 2026-05-07 07:00:34,150 | INFO | step 13620/15259 | epoch 1 | loss 3.4918 | ppl 32.84 | lr 8.25e-03 | grad_norm 0.202 | 151243 tok/s | dt 34.67s | ETA 1:34:29
355
+ 2026-05-07 07:01:08,673 | INFO | step 13630/15259 | epoch 1 | loss 3.4872 | ppl 32.69 | lr 8.21e-03 | grad_norm 0.150 | 151868 tok/s | dt 34.52s | ETA 1:33:51
356
+ 2026-05-07 07:01:43,133 | INFO | step 13640/15259 | epoch 1 | loss 3.4941 | ppl 32.92 | lr 8.17e-03 | grad_norm 0.162 | 152140 tok/s | dt 34.46s | ETA 1:33:11
357
+ 2026-05-07 07:02:17,770 | INFO | step 13650/15259 | epoch 1 | loss 3.5046 | ppl 33.27 | lr 8.12e-03 | grad_norm 0.166 | 151366 tok/s | dt 34.64s | ETA 1:32:40
358
+ 2026-05-07 07:02:52,733 | INFO | step 13660/15259 | epoch 1 | loss 3.4701 | ppl 32.14 | lr 8.08e-03 | grad_norm 0.162 | 149956 tok/s | dt 34.96s | ETA 1:32:20
359
+ 2026-05-07 07:03:27,186 | INFO | step 13670/15259 | epoch 1 | loss 3.4949 | ppl 32.95 | lr 8.04e-03 | grad_norm 0.180 | 152174 tok/s | dt 34.45s | ETA 1:31:38
360
+ 2026-05-07 07:04:01,790 | INFO | step 13680/15259 | epoch 1 | loss 3.4496 | ppl 31.49 | lr 7.99e-03 | grad_norm 0.210 | 151514 tok/s | dt 34.60s | ETA 1:31:06
361
+ 2026-05-07 07:04:36,144 | INFO | step 13690/15259 | epoch 1 | loss 3.4082 | ppl 30.21 | lr 7.95e-03 | grad_norm 0.179 | 152613 tok/s | dt 34.35s | ETA 1:30:28
362
+ 2026-05-07 07:05:10,705 | INFO | step 13700/15259 | epoch 1 | loss 3.4926 | ppl 32.87 | lr 7.91e-03 | grad_norm 0.137 | 151697 tok/s | dt 34.56s | ETA 1:29:51
363
+ 2026-05-07 07:05:10,895 | INFO | step 13700 | val_loss 3.4340 | val_ppl 31.00
364
+ 2026-05-07 07:05:55,925 | INFO | step 13710/15259 | epoch 1 | loss 3.4980 | ppl 33.05 | lr 7.86e-03 | grad_norm 0.124 | 115943 tok/s | dt 45.22s | ETA 1:29:08
365
+ 2026-05-07 07:06:30,118 | INFO | step 13720/15259 | epoch 1 | loss 3.4797 | ppl 32.45 | lr 7.82e-03 | grad_norm 0.131 | 153333 tok/s | dt 34.19s | ETA 1:28:25
366
+ 2026-05-07 07:07:04,421 | INFO | step 13730/15259 | epoch 1 | loss 3.4600 | ppl 31.82 | lr 7.77e-03 | grad_norm 0.185 | 152836 tok/s | dt 34.30s | ETA 1:27:41
367
+ 2026-05-07 07:07:38,398 | INFO | step 13740/15259 | epoch 1 | loss 3.5195 | ppl 33.77 | lr 7.73e-03 | grad_norm 0.193 | 154307 tok/s | dt 33.98s | ETA 1:26:56
368
+ 2026-05-07 07:08:12,346 | INFO | step 13750/15259 | epoch 1 | loss 3.5011 | ppl 33.15 | lr 7.69e-03 | grad_norm 0.153 | 154438 tok/s | dt 33.95s | ETA 1:26:03
369
+ 2026-05-07 07:08:46,773 | INFO | step 13760/15259 | epoch 1 | loss 3.5008 | ppl 33.14 | lr 7.64e-03 | grad_norm 0.178 | 152290 tok/s | dt 34.43s | ETA 1:25:21
370
+ 2026-05-07 07:09:21,595 | INFO | step 13770/15259 | epoch 1 | loss 3.5206 | ppl 33.80 | lr 7.60e-03 | grad_norm 0.153 | 150564 tok/s | dt 34.82s | ETA 1:25:06
371
+ 2026-05-07 07:09:56,166 | INFO | step 13780/15259 | epoch 1 | loss 3.4685 | ppl 32.09 | lr 7.56e-03 | grad_norm 0.170 | 151656 tok/s | dt 34.57s | ETA 1:24:40
372
+ 2026-05-07 07:10:30,407 | INFO | step 13790/15259 | epoch 1 | loss 3.4459 | ppl 31.37 | lr 7.51e-03 | grad_norm 0.143 | 153118 tok/s | dt 34.24s | ETA 1:24:13
373
+ 2026-05-07 07:11:04,411 | INFO | step 13800/15259 | epoch 1 | loss 3.4981 | ppl 33.05 | lr 7.47e-03 | grad_norm 0.199 | 154181 tok/s | dt 34.00s | ETA 1:23:40
374
+ 2026-05-07 07:11:04,596 | INFO | step 13800 | val_loss 3.5191 | val_ppl 33.75
375
+ 2026-05-07 07:11:48,981 | INFO | step 13810/15259 | epoch 1 | loss 3.4680 | ppl 32.07 | lr 7.43e-03 | grad_norm 0.156 | 117634 tok/s | dt 44.57s | ETA 1:23:07
376
+ 2026-05-07 07:12:23,670 | INFO | step 13820/15259 | epoch 1 | loss 3.5082 | ppl 33.39 | lr 7.38e-03 | grad_norm 0.194 | 151137 tok/s | dt 34.69s | ETA 1:22:29
377
+ 2026-05-07 07:12:58,506 | INFO | step 13830/15259 | epoch 1 | loss 3.4734 | ppl 32.25 | lr 7.34e-03 | grad_norm 0.164 | 150502 tok/s | dt 34.84s | ETA 1:22:02
378
+ 2026-05-07 07:13:32,857 | INFO | step 13840/15259 | epoch 1 | loss 3.4963 | ppl 32.99 | lr 7.30e-03 | grad_norm 0.171 | 152626 tok/s | dt 34.35s | ETA 1:21:31
379
+ 2026-05-07 07:14:07,342 | INFO | step 13850/15259 | epoch 1 | loss 3.5225 | ppl 33.87 | lr 7.25e-03 | grad_norm 0.201 | 152034 tok/s | dt 34.48s | ETA 1:21:10
380
+ 2026-05-07 07:14:41,507 | INFO | step 13860/15259 | epoch 1 | loss 3.4580 | ppl 31.75 | lr 7.21e-03 | grad_norm 0.155 | 153459 tok/s | dt 34.16s | ETA 1:20:27
381
+ 2026-05-07 07:15:15,643 | INFO | step 13870/15259 | epoch 1 | loss 3.4459 | ppl 31.37 | lr 7.16e-03 | grad_norm 0.153 | 153586 tok/s | dt 34.14s | ETA 1:19:37
382
+ 2026-05-07 07:15:49,903 | INFO | step 13880/15259 | epoch 1 | loss 3.4870 | ppl 32.69 | lr 7.12e-03 | grad_norm 0.139 | 153034 tok/s | dt 34.26s | ETA 1:18:46
383
+ 2026-05-07 07:16:24,175 | INFO | step 13890/15259 | epoch 1 | loss 3.4775 | ppl 32.38 | lr 7.08e-03 | grad_norm 0.185 | 152978 tok/s | dt 34.27s | ETA 1:18:10
384
+ 2026-05-07 07:16:58,430 | INFO | step 13900/15259 | epoch 1 | loss 3.4699 | ppl 32.13 | lr 7.03e-03 | grad_norm 0.138 | 153054 tok/s | dt 34.26s | ETA 1:17:30
385
+ 2026-05-07 07:16:58,612 | WARNING | Step 13900: val_loss has not improved for 5 consecutive evals (best=3.3677, current=3.4687).
386
+ 2026-05-07 07:16:58,613 | INFO | step 13900 | val_loss 3.4687 | val_ppl 32.10
387
+ 2026-05-07 07:17:42,975 | INFO | step 13910/15259 | epoch 1 | loss 3.4850 | ppl 32.62 | lr 6.99e-03 | grad_norm 0.144 | 117699 tok/s | dt 44.54s | ETA 1:17:05
388
+ 2026-05-07 07:18:17,412 | INFO | step 13920/15259 | epoch 1 | loss 3.4360 | ppl 31.06 | lr 6.95e-03 | grad_norm 0.139 | 152245 tok/s | dt 34.44s | ETA 1:16:38
389
+ 2026-05-07 07:18:51,847 | INFO | step 13930/15259 | epoch 1 | loss 3.4891 | ppl 32.76 | lr 6.90e-03 | grad_norm 0.137 | 152256 tok/s | dt 34.43s | ETA 1:16:09
390
+ 2026-05-07 07:19:26,193 | INFO | step 13940/15259 | epoch 1 | loss 3.4434 | ppl 31.29 | lr 6.86e-03 | grad_norm 0.149 | 152648 tok/s | dt 34.35s | ETA 1:15:36
391
+ 2026-05-07 07:20:00,474 | INFO | step 13950/15259 | epoch 1 | loss 3.4170 | ppl 30.48 | lr 6.82e-03 | grad_norm 0.198 | 152938 tok/s | dt 34.28s | ETA 1:15:03
392
+ 2026-05-07 07:20:34,950 | INFO | step 13960/15259 | epoch 1 | loss 3.4158 | ppl 30.44 | lr 6.77e-03 | grad_norm 0.165 | 152071 tok/s | dt 34.48s | ETA 1:14:27
393
+ 2026-05-07 07:21:09,574 | INFO | step 13970/15259 | epoch 1 | loss 3.4779 | ppl 32.39 | lr 6.73e-03 | grad_norm 0.124 | 151425 tok/s | dt 34.62s | ETA 1:13:58
394
+ 2026-05-07 07:21:44,417 | INFO | step 13980/15259 | epoch 1 | loss 3.4564 | ppl 31.70 | lr 6.68e-03 | grad_norm 0.132 | 150473 tok/s | dt 34.84s | ETA 1:13:34
395
+ 2026-05-07 07:22:18,789 | INFO | step 13990/15259 | epoch 1 | loss 3.4518 | ppl 31.56 | lr 6.64e-03 | grad_norm 0.133 | 152531 tok/s | dt 34.37s | ETA 1:13:00
396
+ 2026-05-07 07:22:53,228 | INFO | step 14000/15259 | epoch 1 | loss 3.4841 | ppl 32.59 | lr 6.60e-03 | grad_norm 0.174 | 152238 tok/s | dt 34.44s | ETA 1:12:29
397
+ 2026-05-07 07:22:53,412 | WARNING | Step 14000: val_loss has not improved for 6 consecutive evals (best=3.3677, current=3.4683).
398
+ 2026-05-07 07:22:53,412 | INFO | step 14000 | val_loss 3.4683 | val_ppl 32.08
399
+ 2026-05-07 07:23:37,500 | INFO | step 14010/15259 | epoch 1 | loss 3.4702 | ppl 32.14 | lr 6.55e-03 | grad_norm 0.132 | 118424 tok/s | dt 44.27s | ETA 1:11:57
400
+ 2026-05-07 07:24:12,037 | INFO | step 14020/15259 | epoch 1 | loss 3.4517 | ppl 31.55 | lr 6.51e-03 | grad_norm 0.143 | 151804 tok/s | dt 34.54s | ETA 1:11:20
401
+ 2026-05-07 07:24:46,343 | INFO | step 14030/15259 | epoch 1 | loss 3.4394 | ppl 31.17 | lr 6.47e-03 | grad_norm 0.164 | 152830 tok/s | dt 34.31s | ETA 1:10:33
402
+ 2026-05-07 07:25:20,692 | INFO | step 14040/15259 | epoch 1 | loss 3.4303 | ppl 30.89 | lr 6.42e-03 | grad_norm 0.152 | 152635 tok/s | dt 34.35s | ETA 1:09:58
403
+ 2026-05-07 07:25:55,235 | INFO | step 14050/15259 | epoch 1 | loss 3.4534 | ppl 31.61 | lr 6.38e-03 | grad_norm 0.142 | 151776 tok/s | dt 34.54s | ETA 1:09:26
404
+ 2026-05-07 07:26:29,511 | INFO | step 14060/15259 | epoch 1 | loss 3.4545 | ppl 31.64 | lr 6.34e-03 | grad_norm 0.174 | 152962 tok/s | dt 34.28s | ETA 1:08:44
405
+ 2026-05-07 07:27:03,852 | INFO | step 14070/15259 | epoch 1 | loss 3.4559 | ppl 31.69 | lr 6.29e-03 | grad_norm 0.140 | 152673 tok/s | dt 34.34s | ETA 1:08:05
406
+ 2026-05-07 07:27:38,106 | INFO | step 14080/15259 | epoch 1 | loss 3.4684 | ppl 32.09 | lr 6.25e-03 | grad_norm 0.123 | 153056 tok/s | dt 34.25s | ETA 1:07:30
407
+ 2026-05-07 07:28:12,503 | INFO | step 14090/15259 | epoch 1 | loss 3.4567 | ppl 31.71 | lr 6.21e-03 | grad_norm 0.137 | 152422 tok/s | dt 34.40s | ETA 1:06:56
408
+ 2026-05-07 07:28:46,898 | INFO | step 14100/15259 | epoch 1 | loss 3.5002 | ppl 33.12 | lr 6.16e-03 | grad_norm 0.174 | 152435 tok/s | dt 34.39s | ETA 1:06:19
409
+ 2026-05-07 07:28:47,081 | WARNING | Step 14100: val_loss has not improved for 7 consecutive evals (best=3.3677, current=3.5897).
410
+ 2026-05-07 07:28:47,081 | INFO | step 14100 | val_loss 3.5897 | val_ppl 36.22
411
+ 2026-05-07 07:29:36,279 | INFO | step 14110/15259 | epoch 1 | loss 3.4514 | ppl 31.54 | lr 6.12e-03 | grad_norm 0.161 | 106171 tok/s | dt 49.38s | ETA 1:05:51
412
+ 2026-05-07 07:30:10,712 | INFO | step 14120/15259 | epoch 1 | loss 3.4740 | ppl 32.26 | lr 6.07e-03 | grad_norm 0.158 | 152263 tok/s | dt 34.43s | ETA 1:05:19
413
+ 2026-05-07 07:30:45,091 | INFO | step 14130/15259 | epoch 1 | loss 3.4807 | ppl 32.48 | lr 6.03e-03 | grad_norm 0.141 | 152506 tok/s | dt 34.38s | ETA 1:04:47
414
+ 2026-05-07 07:31:19,576 | INFO | step 14140/15259 | epoch 1 | loss 3.4537 | ppl 31.62 | lr 5.99e-03 | grad_norm 0.140 | 152032 tok/s | dt 34.49s | ETA 1:04:15
415
+ 2026-05-07 07:31:53,858 | INFO | step 14150/15259 | epoch 1 | loss 3.4652 | ppl 31.98 | lr 5.94e-03 | grad_norm 0.131 | 152932 tok/s | dt 34.28s | ETA 1:03:38
416
+ 2026-05-07 07:32:28,206 | INFO | step 14160/15259 | epoch 1 | loss 3.4323 | ppl 30.95 | lr 5.90e-03 | grad_norm 0.176 | 152641 tok/s | dt 34.35s | ETA 1:02:58
417
+ 2026-05-07 07:33:02,704 | INFO | step 14170/15259 | epoch 1 | loss 3.4628 | ppl 31.90 | lr 5.86e-03 | grad_norm 0.132 | 151978 tok/s | dt 34.50s | ETA 1:02:25
418
+ 2026-05-07 07:33:36,941 | INFO | step 14180/15259 | epoch 1 | loss 3.4168 | ppl 30.47 | lr 5.81e-03 | grad_norm 0.153 | 153132 tok/s | dt 34.24s | ETA 1:01:48
419
+ 2026-05-07 07:34:11,761 | INFO | step 14190/15259 | epoch 1 | loss 3.4523 | ppl 31.57 | lr 5.77e-03 | grad_norm 0.128 | 150572 tok/s | dt 34.82s | ETA 1:01:21
420
+ 2026-05-07 07:34:46,331 | INFO | step 14200/15259 | epoch 1 | loss 3.4197 | ppl 30.56 | lr 5.73e-03 | grad_norm 0.125 | 151661 tok/s | dt 34.57s | ETA 1:00:52
421
+ 2026-05-07 07:34:46,517 | WARNING | Step 14200: val_loss has not improved for 8 consecutive evals (best=3.3677, current=3.3797).
422
+ 2026-05-07 07:34:46,517 | INFO | step 14200 | val_loss 3.3797 | val_ppl 29.36
423
+ 2026-05-07 07:35:30,460 | INFO | step 14210/15259 | epoch 1 | loss 3.4574 | ppl 31.74 | lr 5.68e-03 | grad_norm 0.128 | 118807 tok/s | dt 44.13s | ETA 1:00:27
424
+ 2026-05-07 07:36:04,895 | INFO | step 14220/15259 | epoch 1 | loss 3.4256 | ppl 30.74 | lr 5.64e-03 | grad_norm 0.122 | 152252 tok/s | dt 34.44s | ETA 0:59:51
425
+ 2026-05-07 07:36:39,059 | INFO | step 14230/15259 | epoch 1 | loss 3.4247 | ppl 30.71 | lr 5.59e-03 | grad_norm 0.132 | 153466 tok/s | dt 34.16s | ETA 0:59:15
426
+ 2026-05-07 07:37:13,612 | INFO | step 14240/15259 | epoch 1 | loss 3.4440 | ppl 31.31 | lr 5.55e-03 | grad_norm 0.133 | 151733 tok/s | dt 34.55s | ETA 0:58:35
427
+ 2026-05-07 07:37:47,959 | INFO | step 14250/15259 | epoch 1 | loss 3.4323 | ppl 30.95 | lr 5.51e-03 | grad_norm 0.159 | 152645 tok/s | dt 34.35s | ETA 0:57:56
428
+ 2026-05-07 07:38:21,933 | INFO | step 14260/15259 | epoch 1 | loss 3.4846 | ppl 32.61 | lr 5.46e-03 | grad_norm 0.144 | 154318 tok/s | dt 33.97s | ETA 0:57:05
429
+ 2026-05-07 07:38:56,413 | INFO | step 14270/15259 | epoch 1 | loss 3.4778 | ppl 32.39 | lr 5.42e-03 | grad_norm 0.156 | 152058 tok/s | dt 34.48s | ETA 0:56:32
430
+ 2026-05-07 07:39:31,031 | INFO | step 14280/15259 | epoch 1 | loss 3.4491 | ppl 31.47 | lr 5.38e-03 | grad_norm 0.163 | 151448 tok/s | dt 34.62s | ETA 0:56:07
431
+ 2026-05-07 07:40:05,142 | INFO | step 14290/15259 | epoch 1 | loss 3.4691 | ppl 32.11 | lr 5.33e-03 | grad_norm 0.132 | 153703 tok/s | dt 34.11s | ETA 0:55:24
432
+ 2026-05-07 07:40:39,244 | INFO | step 14300/15259 | epoch 1 | loss 3.4538 | ppl 31.62 | lr 5.29e-03 | grad_norm 0.126 | 153741 tok/s | dt 34.10s | ETA 0:54:45
433
+ 2026-05-07 07:40:39,426 | WARNING | Step 14300: val_loss has not improved for 9 consecutive evals (best=3.3677, current=3.4289).
434
+ 2026-05-07 07:40:39,426 | INFO | step 14300 | val_loss 3.4289 | val_ppl 30.84
435
+ 2026-05-07 07:41:23,185 | INFO | step 14310/15259 | epoch 1 | loss 3.4575 | ppl 31.74 | lr 5.25e-03 | grad_norm 0.138 | 119316 tok/s | dt 43.94s | ETA 0:54:12
436
+ 2026-05-07 07:41:57,567 | INFO | step 14320/15259 | epoch 1 | loss 3.4281 | ppl 30.82 | lr 5.20e-03 | grad_norm 0.121 | 152489 tok/s | dt 34.38s | ETA 0:53:36
437
+ 2026-05-07 07:42:32,070 | INFO | step 14330/15259 | epoch 1 | loss 3.4154 | ppl 30.43 | lr 5.16e-03 | grad_norm 0.123 | 151956 tok/s | dt 34.50s | ETA 0:53:00
438
+ 2026-05-07 07:43:06,532 | INFO | step 14340/15259 | epoch 1 | loss 3.4669 | ppl 32.04 | lr 5.12e-03 | grad_norm 0.153 | 152135 tok/s | dt 34.46s | ETA 0:52:32
439
+ 2026-05-07 07:43:40,864 | INFO | step 14350/15259 | epoch 1 | loss 3.4400 | ppl 31.19 | lr 5.07e-03 | grad_norm 0.127 | 152711 tok/s | dt 34.33s | ETA 0:52:02
440
+ 2026-05-07 07:44:15,377 | INFO | step 14360/15259 | epoch 1 | loss 3.4511 | ppl 31.54 | lr 5.03e-03 | grad_norm 0.155 | 151908 tok/s | dt 34.51s | ETA 0:51:35
441
+ 2026-05-07 07:44:50,032 | INFO | step 14370/15259 | epoch 1 | loss 3.4377 | ppl 31.12 | lr 4.98e-03 | grad_norm 0.125 | 151289 tok/s | dt 34.65s | ETA 0:51:06
442
+ 2026-05-07 07:45:24,605 | INFO | step 14380/15259 | epoch 1 | loss 3.4346 | ppl 31.02 | lr 4.94e-03 | grad_norm 0.122 | 151648 tok/s | dt 34.57s | ETA 0:50:33
443
+ 2026-05-07 07:45:58,983 | INFO | step 14390/15259 | epoch 1 | loss 3.4117 | ppl 30.32 | lr 4.90e-03 | grad_norm 0.127 | 152506 tok/s | dt 34.38s | ETA 0:49:57
444
+ 2026-05-07 07:46:33,522 | INFO | step 14400/15259 | epoch 1 | loss 3.3932 | ppl 29.76 | lr 4.85e-03 | grad_norm 0.139 | 151797 tok/s | dt 34.54s | ETA 0:49:26
445
+ 2026-05-07 07:46:33,703 | WARNING | Step 14400: val_loss has not improved for 10 consecutive evals (best=3.3677, current=3.6085).
446
+ 2026-05-07 07:46:33,703 | INFO | step 14400 | val_loss 3.6085 | val_ppl 36.91
447
+ 2026-05-07 07:47:22,529 | INFO | step 14410/15259 | epoch 1 | loss 3.4224 | ppl 30.64 | lr 4.81e-03 | grad_norm 0.120 | 106982 tok/s | dt 49.01s | ETA 0:48:52
448
+ 2026-05-07 07:47:56,912 | INFO | step 14420/15259 | epoch 1 | loss 3.4418 | ppl 31.24 | lr 4.77e-03 | grad_norm 0.167 | 152486 tok/s | dt 34.38s | ETA 0:48:13
449
+ 2026-05-07 07:48:31,387 | INFO | step 14430/15259 | epoch 1 | loss 3.4356 | ppl 31.05 | lr 4.72e-03 | grad_norm 0.118 | 152077 tok/s | dt 34.48s | ETA 0:47:37
450
+ 2026-05-07 07:49:05,885 | INFO | step 14440/15259 | epoch 1 | loss 3.4387 | ppl 31.15 | lr 4.68e-03 | grad_norm 0.123 | 151973 tok/s | dt 34.50s | ETA 0:47:05
451
+ 2026-05-07 07:49:40,222 | INFO | step 14450/15259 | epoch 1 | loss 3.3903 | ppl 29.67 | lr 4.64e-03 | grad_norm 0.118 | 152690 tok/s | dt 34.34s | ETA 0:46:27
452
+ 2026-05-07 07:50:14,730 | INFO | step 14460/15259 | epoch 1 | loss 3.4178 | ppl 30.50 | lr 4.59e-03 | grad_norm 0.116 | 151932 tok/s | dt 34.51s | ETA 0:45:51
453
+ 2026-05-07 07:50:49,033 | INFO | step 14470/15259 | epoch 1 | loss 3.4488 | ppl 31.46 | lr 4.55e-03 | grad_norm 0.124 | 152842 tok/s | dt 34.30s | ETA 0:45:15
454
+ 2026-05-07 07:51:23,539 | INFO | step 14480/15259 | epoch 1 | loss 3.4575 | ppl 31.74 | lr 4.50e-03 | grad_norm 0.115 | 151941 tok/s | dt 34.51s | ETA 0:44:42
455
+ 2026-05-07 07:51:58,262 | INFO | step 14490/15259 | epoch 1 | loss 3.4221 | ppl 30.63 | lr 4.46e-03 | grad_norm 0.115 | 150989 tok/s | dt 34.72s | ETA 0:44:11
456
+ 2026-05-07 07:52:32,586 | INFO | step 14500/15259 | epoch 1 | loss 3.4096 | ppl 30.25 | lr 4.42e-03 | grad_norm 0.127 | 152749 tok/s | dt 34.32s | ETA 0:43:36
457
+ 2026-05-07 07:52:32,770 | WARNING | Step 14500: val_loss has not improved for 11 consecutive evals (best=3.3677, current=3.5552).
458
+ 2026-05-07 07:52:32,770 | INFO | step 14500 | val_loss 3.5552 | val_ppl 35.00
459
+ 2026-05-07 07:53:18,260 | INFO | step 14510/15259 | epoch 1 | loss 3.4169 | ppl 30.48 | lr 4.37e-03 | grad_norm 0.118 | 114789 tok/s | dt 45.67s | ETA 0:43:06
460
+ 2026-05-07 07:53:52,847 | INFO | step 14520/15259 | epoch 1 | loss 3.4140 | ppl 30.39 | lr 4.33e-03 | grad_norm 0.138 | 151585 tok/s | dt 34.59s | ETA 0:42:36
461
+ 2026-05-07 07:54:27,631 | INFO | step 14530/15259 | epoch 1 | loss 3.4367 | ppl 31.08 | lr 4.29e-03 | grad_norm 0.138 | 150728 tok/s | dt 34.78s | ETA 0:42:05
462
+ 2026-05-07 07:55:02,315 | INFO | step 14540/15259 | epoch 1 | loss 3.4139 | ppl 30.38 | lr 4.24e-03 | grad_norm 0.125 | 151163 tok/s | dt 34.68s | ETA 0:41:30
463
+ 2026-05-07 07:55:36,839 | INFO | step 14550/15259 | epoch 1 | loss 3.3966 | ppl 29.86 | lr 4.20e-03 | grad_norm 0.118 | 151862 tok/s | dt 34.52s | ETA 0:40:58
464
+ 2026-05-07 07:56:11,199 | INFO | step 14560/15259 | epoch 1 | loss 3.4175 | ppl 30.49 | lr 4.16e-03 | grad_norm 0.124 | 152585 tok/s | dt 34.36s | ETA 0:40:17
465
+ 2026-05-07 07:56:45,540 | INFO | step 14570/15259 | epoch 1 | loss 3.4059 | ppl 30.14 | lr 4.11e-03 | grad_norm 0.118 | 152674 tok/s | dt 34.34s | ETA 0:39:39
466
+ 2026-05-07 07:57:19,886 | INFO | step 14580/15259 | epoch 1 | loss 3.3942 | ppl 29.79 | lr 4.07e-03 | grad_norm 0.151 | 152645 tok/s | dt 34.35s | ETA 0:38:59
467
+ 2026-05-07 07:57:54,285 | INFO | step 14590/15259 | epoch 1 | loss 3.3677 | ppl 29.01 | lr 4.03e-03 | grad_norm 0.122 | 152415 tok/s | dt 34.40s | ETA 0:38:20
468
+ 2026-05-07 07:58:28,590 | INFO | step 14600/15259 | epoch 1 | loss 3.4266 | ppl 30.77 | lr 3.98e-03 | grad_norm 0.120 | 152832 tok/s | dt 34.30s | ETA 0:37:43
469
+ 2026-05-07 07:58:28,772 | INFO | step 14600 | val_loss 3.3107 | val_ppl 27.41 ** New best validation loss! **
470
+ 2026-05-07 07:58:43,858 | WARNING | New best checkpoint at step 14600 | val_loss=3.3107 | saved to artifacts/final_c2_muon/final_c2_muon_bs512_lr12_seed3_mix3to1/checkpoints/best_ckpt.pt
471
+ 2026-05-07 07:59:18,224 | INFO | step 14610/15259 | epoch 1 | loss 3.4512 | ppl 31.54 | lr 3.94e-03 | grad_norm 0.112 | 105630 tok/s | dt 49.63s | ETA 0:37:09
472
+ 2026-05-07 07:59:52,454 | INFO | step 14620/15259 | epoch 1 | loss 3.4053 | ppl 30.12 | lr 3.89e-03 | grad_norm 0.112 | 153166 tok/s | dt 34.23s | ETA 0:36:33
473
+ 2026-05-07 08:00:26,899 | INFO | step 14630/15259 | epoch 1 | loss 3.4238 | ppl 30.68 | lr 3.85e-03 | grad_norm 0.125 | 152211 tok/s | dt 34.44s | ETA 0:36:00
474
+ 2026-05-07 08:01:01,101 | INFO | step 14640/15259 | epoch 1 | loss 3.4174 | ppl 30.49 | lr 3.81e-03 | grad_norm 0.134 | 153292 tok/s | dt 34.20s | ETA 0:35:23
475
+ 2026-05-07 08:01:35,540 | INFO | step 14650/15259 | epoch 1 | loss 3.3915 | ppl 29.71 | lr 3.76e-03 | grad_norm 0.113 | 152236 tok/s | dt 34.44s | ETA 0:34:51
476
+ 2026-05-07 08:02:10,008 | INFO | step 14660/15259 | epoch 1 | loss 3.3931 | ppl 29.76 | lr 3.72e-03 | grad_norm 0.113 | 152106 tok/s | dt 34.47s | ETA 0:34:17
477
+ 2026-05-07 08:02:44,033 | INFO | step 14670/15259 | epoch 1 | loss 3.4063 | ppl 30.15 | lr 3.68e-03 | grad_norm 0.113 | 154091 tok/s | dt 34.02s | ETA 0:33:41
478
+ 2026-05-07 08:03:18,437 | INFO | step 14680/15259 | epoch 1 | loss 3.4488 | ppl 31.46 | lr 3.63e-03 | grad_norm 0.115 | 152393 tok/s | dt 34.40s | ETA 0:33:06
479
+ 2026-05-07 08:03:53,233 | INFO | step 14690/15259 | epoch 1 | loss 3.3709 | ppl 29.10 | lr 3.59e-03 | grad_norm 0.112 | 150673 tok/s | dt 34.80s | ETA 0:32:38
480
+ 2026-05-07 08:04:27,923 | INFO | step 14700/15259 | epoch 1 | loss 3.4137 | ppl 30.38 | lr 3.55e-03 | grad_norm 0.119 | 151135 tok/s | dt 34.69s | ETA 0:32:07
481
+ 2026-05-07 08:04:28,107 | INFO | step 14700 | val_loss 3.3193 | val_ppl 27.64
482
+ 2026-05-07 08:05:13,037 | INFO | step 14710/15259 | epoch 1 | loss 3.4029 | ppl 30.05 | lr 3.50e-03 | grad_norm 0.108 | 116214 tok/s | dt 45.11s | ETA 0:31:33
483
+ 2026-05-07 08:05:47,407 | INFO | step 14720/15259 | epoch 1 | loss 3.4334 | ppl 30.98 | lr 3.46e-03 | grad_norm 0.110 | 152544 tok/s | dt 34.37s | ETA 0:31:02
484
+ 2026-05-07 08:06:21,500 | INFO | step 14730/15259 | epoch 1 | loss 3.4365 | ppl 31.08 | lr 3.42e-03 | grad_norm 0.111 | 153781 tok/s | dt 34.09s | ETA 0:30:24
485
+ 2026-05-07 08:06:55,496 | INFO | step 14740/15259 | epoch 1 | loss 3.4652 | ppl 31.98 | lr 3.37e-03 | grad_norm 0.101 | 154222 tok/s | dt 34.00s | ETA 0:29:41
486
+ 2026-05-07 08:07:29,765 | INFO | step 14750/15259 | epoch 1 | loss 3.4456 | ppl 31.36 | lr 3.33e-03 | grad_norm 0.121 | 152991 tok/s | dt 34.27s | ETA 0:29:03
487
+ 2026-05-07 08:08:03,717 | INFO | step 14760/15259 | epoch 1 | loss 3.4012 | ppl 30.00 | lr 3.28e-03 | grad_norm 0.112 | 154420 tok/s | dt 33.95s | ETA 0:28:23
488
+ 2026-05-07 08:08:37,815 | INFO | step 14770/15259 | epoch 1 | loss 3.3806 | ppl 29.39 | lr 3.24e-03 | grad_norm 0.114 | 153758 tok/s | dt 34.10s | ETA 0:27:46
489
+ 2026-05-07 08:09:12,192 | INFO | step 14780/15259 | epoch 1 | loss 3.4052 | ppl 30.12 | lr 3.20e-03 | grad_norm 0.102 | 152513 tok/s | dt 34.38s | ETA 0:27:15
490
+ 2026-05-07 08:09:46,153 | INFO | step 14790/15259 | epoch 1 | loss 3.4183 | ppl 30.52 | lr 3.15e-03 | grad_norm 0.107 | 154377 tok/s | dt 33.96s | ETA 0:26:40
491
+ 2026-05-07 08:10:20,319 | INFO | step 14800/15259 | epoch 1 | loss 3.4210 | ppl 30.60 | lr 3.11e-03 | grad_norm 0.110 | 153453 tok/s | dt 34.17s | ETA 0:26:05
492
+ 2026-05-07 08:10:20,501 | INFO | step 14800 | val_loss 3.3248 | val_ppl 27.79
493
+ 2026-05-07 08:11:04,731 | INFO | step 14810/15259 | epoch 1 | loss 3.4488 | ppl 31.46 | lr 3.07e-03 | grad_norm 0.123 | 118052 tok/s | dt 44.41s | ETA 0:25:33
494
+ 2026-05-07 08:11:38,799 | INFO | step 14820/15259 | epoch 1 | loss 3.4084 | ppl 30.22 | lr 3.02e-03 | grad_norm 0.109 | 153894 tok/s | dt 34.07s | ETA 0:24:59
495
+ 2026-05-07 08:12:12,904 | INFO | step 14830/15259 | epoch 1 | loss 3.4229 | ppl 30.66 | lr 2.98e-03 | grad_norm 0.111 | 153730 tok/s | dt 34.10s | ETA 0:24:22
496
+ 2026-05-07 08:12:46,708 | INFO | step 14840/15259 | epoch 1 | loss 3.4145 | ppl 30.40 | lr 2.94e-03 | grad_norm 0.106 | 155096 tok/s | dt 33.80s | ETA 0:23:47
497
+ 2026-05-07 08:13:20,613 | INFO | step 14850/15259 | epoch 1 | loss 3.3855 | ppl 29.53 | lr 2.89e-03 | grad_norm 0.112 | 154632 tok/s | dt 33.91s | ETA 0:23:11
498
+ 2026-05-07 08:13:54,721 | INFO | step 14860/15259 | epoch 1 | loss 3.3912 | ppl 29.70 | lr 2.85e-03 | grad_norm 0.116 | 153715 tok/s | dt 34.11s | ETA 0:22:36
499
+ 2026-05-07 08:14:29,145 | INFO | step 14870/15259 | epoch 1 | loss 3.4288 | ppl 30.84 | lr 2.80e-03 | grad_norm 0.106 | 152306 tok/s | dt 34.42s | ETA 0:22:05
500
+ 2026-05-07 08:15:03,049 | INFO | step 14880/15259 | epoch 1 | loss 3.4119 | ppl 30.32 | lr 2.76e-03 | grad_norm 0.111 | 154637 tok/s | dt 33.90s | ETA 0:21:29
501
+ 2026-05-07 08:15:37,097 | INFO | step 14890/15259 | epoch 1 | loss 3.4453 | ppl 31.35 | lr 2.72e-03 | grad_norm 0.111 | 153984 tok/s | dt 34.05s | ETA 0:20:57
502
+ 2026-05-07 08:16:11,114 | INFO | step 14900/15259 | epoch 1 | loss 3.3889 | ppl 29.63 | lr 2.67e-03 | grad_norm 0.104 | 154124 tok/s | dt 34.02s | ETA 0:20:24
503
+ 2026-05-07 08:16:11,295 | INFO | step 14900 | val_loss 3.4178 | val_ppl 30.50
504
+ 2026-05-07 08:16:54,196 | INFO | step 14910/15259 | epoch 1 | loss 3.3938 | ppl 29.78 | lr 2.63e-03 | grad_norm 0.097 | 121696 tok/s | dt 43.08s | ETA 0:19:50
505
+ 2026-05-07 08:17:28,091 | INFO | step 14920/15259 | epoch 1 | loss 3.3948 | ppl 29.81 | lr 2.59e-03 | grad_norm 0.109 | 154682 tok/s | dt 33.89s | ETA 0:19:13
506
+ 2026-05-07 08:18:01,918 | INFO | step 14930/15259 | epoch 1 | loss 3.3871 | ppl 29.58 | lr 2.54e-03 | grad_norm 0.106 | 154991 tok/s | dt 33.83s | ETA 0:18:38
507
+ 2026-05-07 08:18:35,817 | INFO | step 14940/15259 | epoch 1 | loss 3.4039 | ppl 30.08 | lr 2.50e-03 | grad_norm 0.101 | 154661 tok/s | dt 33.90s | ETA 0:18:03
508
+ 2026-05-07 08:19:09,692 | INFO | step 14950/15259 | epoch 1 | loss 3.3922 | ppl 29.73 | lr 2.46e-03 | grad_norm 0.110 | 154772 tok/s | dt 33.87s | ETA 0:17:28
509
+ 2026-05-07 08:19:43,480 | INFO | step 14960/15259 | epoch 1 | loss 3.3891 | ppl 29.64 | lr 2.41e-03 | grad_norm 0.095 | 155168 tok/s | dt 33.79s | ETA 0:16:52
510
+ 2026-05-07 08:20:17,607 | INFO | step 14970/15259 | epoch 1 | loss 3.3477 | ppl 28.44 | lr 2.37e-03 | grad_norm 0.111 | 153630 tok/s | dt 34.13s | ETA 0:16:19
511
+ 2026-05-07 08:20:51,571 | INFO | step 14980/15259 | epoch 1 | loss 3.3929 | ppl 29.75 | lr 2.33e-03 | grad_norm 0.106 | 154363 tok/s | dt 33.96s | ETA 0:15:46
512
+ 2026-05-07 08:21:25,579 | INFO | step 14990/15259 | epoch 1 | loss 3.3760 | ppl 29.25 | lr 2.28e-03 | grad_norm 0.097 | 154167 tok/s | dt 34.01s | ETA 0:15:13
513
+ 2026-05-07 08:21:59,464 | INFO | step 15000/15259 | epoch 1 | loss 3.4321 | ppl 30.94 | lr 2.24e-03 | grad_norm 0.105 | 154725 tok/s | dt 33.89s | ETA 0:14:39
514
+ 2026-05-07 08:21:59,647 | INFO | step 15000 | val_loss 3.4002 | val_ppl 29.97
515
+ 2026-05-07 08:22:44,068 | INFO | step 15010/15259 | epoch 1 | loss 3.4614 | ppl 31.86 | lr 2.19e-03 | grad_norm 0.109 | 117543 tok/s | dt 44.60s | ETA 0:14:08
516
+ 2026-05-07 08:23:18,058 | INFO | step 15020/15259 | epoch 1 | loss 3.4082 | ppl 30.21 | lr 2.15e-03 | grad_norm 0.109 | 154250 tok/s | dt 33.99s | ETA 0:13:33
517
+ 2026-05-07 08:23:52,139 | INFO | step 15030/15259 | epoch 1 | loss 3.4596 | ppl 31.81 | lr 2.11e-03 | grad_norm 0.105 | 153836 tok/s | dt 34.08s | ETA 0:12:59
518
+ 2026-05-07 08:24:26,353 | INFO | step 15040/15259 | epoch 1 | loss 3.3540 | ppl 28.62 | lr 2.06e-03 | grad_norm 0.101 | 153237 tok/s | dt 34.21s | ETA 0:12:26
519
+ 2026-05-07 08:25:00,515 | INFO | step 15050/15259 | epoch 1 | loss 3.3611 | ppl 28.82 | lr 2.02e-03 | grad_norm 0.099 | 153471 tok/s | dt 34.16s | ETA 0:11:53
520
+ 2026-05-07 08:25:34,697 | INFO | step 15060/15259 | epoch 1 | loss 3.3791 | ppl 29.34 | lr 1.98e-03 | grad_norm 0.102 | 153378 tok/s | dt 34.18s | ETA 0:11:19
521
+ 2026-05-07 08:26:08,880 | INFO | step 15070/15259 | epoch 1 | loss 3.3400 | ppl 28.22 | lr 1.93e-03 | grad_norm 0.101 | 153378 tok/s | dt 34.18s | ETA 0:10:45
522
+ 2026-05-07 08:26:43,065 | INFO | step 15080/15259 | epoch 1 | loss 3.4014 | ppl 30.01 | lr 1.89e-03 | grad_norm 0.107 | 153371 tok/s | dt 34.18s | ETA 0:10:11
523
+ 2026-05-07 08:27:17,167 | INFO | step 15090/15259 | epoch 1 | loss 3.3326 | ppl 28.01 | lr 1.85e-03 | grad_norm 0.113 | 153738 tok/s | dt 34.10s | ETA 0:09:37
524
+ 2026-05-07 08:27:51,205 | INFO | step 15100/15259 | epoch 1 | loss 3.3760 | ppl 29.25 | lr 1.80e-03 | grad_norm 0.104 | 154034 tok/s | dt 34.04s | ETA 0:09:02
525
+ 2026-05-07 08:27:51,392 | WARNING | Step 15100: val_loss has not improved for 5 consecutive evals (best=3.3107, current=3.3793).
526
+ 2026-05-07 08:27:51,392 | INFO | step 15100 | val_loss 3.3793 | val_ppl 29.35
527
+ 2026-05-07 08:28:39,758 | INFO | step 15110/15259 | epoch 1 | loss 3.4342 | ppl 31.01 | lr 1.76e-03 | grad_norm 0.100 | 107982 tok/s | dt 48.55s | ETA 0:08:28
528
+ 2026-05-07 08:29:13,780 | INFO | step 15120/15259 | epoch 1 | loss 3.4107 | ppl 30.29 | lr 1.71e-03 | grad_norm 0.104 | 154103 tok/s | dt 34.02s | ETA 0:07:54
529
+ 2026-05-07 08:29:47,896 | INFO | step 15130/15259 | epoch 1 | loss 3.3634 | ppl 28.89 | lr 1.67e-03 | grad_norm 0.103 | 153676 tok/s | dt 34.12s | ETA 0:07:19
530
+ 2026-05-07 08:30:22,102 | INFO | step 15140/15259 | epoch 1 | loss 3.3809 | ppl 29.40 | lr 1.63e-03 | grad_norm 0.102 | 153277 tok/s | dt 34.21s | ETA 0:06:45
531
+ 2026-05-07 08:30:56,108 | INFO | step 15150/15259 | epoch 1 | loss 3.3896 | ppl 29.65 | lr 1.58e-03 | grad_norm 0.100 | 154173 tok/s | dt 34.01s | ETA 0:06:11
532
+ 2026-05-07 08:31:30,244 | INFO | step 15160/15259 | epoch 1 | loss 3.3846 | ppl 29.51 | lr 1.54e-03 | grad_norm 0.110 | 153587 tok/s | dt 34.14s | ETA 0:05:37
533
+ 2026-05-07 08:32:04,154 | INFO | step 15170/15259 | epoch 1 | loss 3.4470 | ppl 31.41 | lr 1.50e-03 | grad_norm 0.101 | 154615 tok/s | dt 33.91s | ETA 0:05:03
534
+ 2026-05-07 08:32:38,060 | INFO | step 15180/15259 | epoch 1 | loss 3.3958 | ppl 29.84 | lr 1.45e-03 | grad_norm 0.101 | 154628 tok/s | dt 33.91s | ETA 0:04:28
535
+ 2026-05-07 08:33:11,940 | INFO | step 15190/15259 | epoch 1 | loss 3.3498 | ppl 28.50 | lr 1.41e-03 | grad_norm 0.100 | 154749 tok/s | dt 33.88s | ETA 0:03:54
536
+ 2026-05-07 08:33:45,916 | INFO | step 15200/15259 | epoch 1 | loss 3.4027 | ppl 30.05 | lr 1.37e-03 | grad_norm 0.105 | 154312 tok/s | dt 33.98s | ETA 0:03:20
537
+ 2026-05-07 08:33:46,115 | INFO | step 15200 | val_loss 3.2405 | val_ppl 25.55 ** New best validation loss! **
538
+ 2026-05-07 08:34:01,475 | WARNING | New best checkpoint at step 15200 | val_loss=3.2405 | saved to artifacts/final_c2_muon/final_c2_muon_bs512_lr12_seed3_mix3to1/checkpoints/best_ckpt.pt
539
+ 2026-05-07 08:34:35,761 | INFO | step 15210/15259 | epoch 1 | loss 3.3548 | ppl 28.64 | lr 1.32e-03 | grad_norm 0.106 | 105182 tok/s | dt 49.85s | ETA 0:02:46
540
+ 2026-05-07 08:35:09,998 | INFO | step 15220/15259 | epoch 1 | loss 3.3820 | ppl 29.43 | lr 1.28e-03 | grad_norm 0.100 | 153139 tok/s | dt 34.24s | ETA 0:02:12
541
+ 2026-05-07 08:35:44,072 | INFO | step 15230/15259 | epoch 1 | loss 3.3916 | ppl 29.71 | lr 1.24e-03 | grad_norm 0.102 | 153865 tok/s | dt 34.07s | ETA 0:01:38
542
+ 2026-05-07 08:36:17,977 | INFO | step 15240/15259 | epoch 1 | loss 3.3928 | ppl 29.75 | lr 1.19e-03 | grad_norm 0.101 | 154637 tok/s | dt 33.90s | ETA 0:01:04
543
+ 2026-05-07 08:36:51,958 | INFO | step 15250/15259 | epoch 1 | loss 3.3689 | ppl 29.05 | lr 1.15e-03 | grad_norm 0.095 | 154289 tok/s | dt 33.98s | ETA 0:00:30
544
+ 2026-05-07 08:37:32,797 | CRITICAL | Pretraining complete -- run: final_c2_muon_bs512_lr12_seed3_mix3to1 | best val loss: 3.2405 | total time: 2:03:06 | avg 153871 tok/s
545
+ 2026-05-07 08:37:32,797 | INFO | Pretraining complete. Best val loss: 3.2405
546
+ 2026-05-07 08:37:34,088 | INFO | Saved metrics plot to artifacts/final_c2_muon/final_c2_muon_bs512_lr12_seed3_mix3to1/metrics.png
547
+ 2026-05-07 08:37:34,088 | INFO | Saved results doc to artifacts/final_c2_muon/final_c2_muon_bs512_lr12_seed3_mix3to1/results.md