huiting tang commited on
Commit
44113fa
·
verified ·
1 Parent(s): ba9bb9a

Add files using upload-large-folder tool

Browse files
.gitattributes CHANGED
@@ -37,3 +37,4 @@ final_baseline_modern_6l448/metrics.png filter=lfs diff=lfs merge=lfs -text
37
  final_c1_14l320_standard/metrics.png filter=lfs diff=lfs merge=lfs -text
38
  final_baseline_modern_6l448/.ipynb_checkpoints/metrics-checkpoint.png filter=lfs diff=lfs merge=lfs -text
39
  final/final_c2_18l320_standard/metrics.png filter=lfs diff=lfs merge=lfs -text
 
 
37
  final_c1_14l320_standard/metrics.png filter=lfs diff=lfs merge=lfs -text
38
  final_baseline_modern_6l448/.ipynb_checkpoints/metrics-checkpoint.png filter=lfs diff=lfs merge=lfs -text
39
  final/final_c2_18l320_standard/metrics.png filter=lfs diff=lfs merge=lfs -text
40
+ metrics.png filter=lfs diff=lfs merge=lfs -text
architecture.txt ADDED
@@ -0,0 +1,334 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ === Raw Model ===
2
+ GPT(
3
+ (token_embedding): FactorizedTokenEmbedding(
4
+ (E): Embedding(50304, 128)
5
+ (P_in): Linear(in_features=128, out_features=384, bias=False)
6
+ (P_out): Linear(in_features=384, out_features=128, bias=False)
7
+ )
8
+ (transformer): ModuleDict(
9
+ (drop): Dropout(p=0.0, inplace=False)
10
+ (h): ModuleList(
11
+ (0-19): 20 x Block(
12
+ (ln_1): RMSNorm()
13
+ (attn): CausalSelfAttention(
14
+ (rotary): RotaryEmbedding()
15
+ (q_proj): Linear(in_features=384, out_features=384, bias=False)
16
+ (k_proj): Linear(in_features=384, out_features=128, bias=False)
17
+ (v_proj): Linear(in_features=384, out_features=128, bias=False)
18
+ (c_proj): Linear(in_features=384, out_features=384, bias=False)
19
+ (resid_dropout): Dropout(p=0.0, inplace=False)
20
+ )
21
+ (ln_2): RMSNorm()
22
+ (mlp): MLP(
23
+ (c_fc): Linear(in_features=384, out_features=2048, bias=False)
24
+ (c_proj): Linear(in_features=1024, out_features=384, bias=False)
25
+ (dropout): Dropout(p=0.0, inplace=False)
26
+ )
27
+ )
28
+ )
29
+ (ln_f): RMSNorm()
30
+ )
31
+ )
32
+
33
+ === Forward Summary (torchinfo, uncompiled model) ===
34
+ ====================================================================================================
35
+ Layer (type:depth-idx) Output Shape Param #
36
+ ====================================================================================================
37
+ GPT [1, 1, 50304] --
38
+ ├─FactorizedTokenEmbedding: 1-3 -- (recursive)
39
+ │ └─Embedding: 2-1 [1, 1024, 128] 6,438,912
40
+ │ └─Linear: 2-2 [1, 1024, 384] 49,152
41
+ ├─ModuleDict: 1-2 -- --
42
+ │ └─Dropout: 2-3 [1, 1024, 384] --
43
+ │ └─ModuleList: 2-4 -- --
44
+ │ │ └─Block: 3-1 [1, 1024, 384] --
45
+ │ │ │ └─RMSNorm: 4-1 [1, 1024, 384] 384
46
+ │ │ │ └─CausalSelfAttention: 4-2 [1, 1024, 384] --
47
+ │ │ │ │ └─Linear: 5-1 [1, 1024, 384] 147,456
48
+ │ │ │ │ └─Linear: 5-2 [1, 1024, 128] 49,152
49
+ │ │ │ │ └─Linear: 5-3 [1, 1024, 128] 49,152
50
+ │ │ │ │ └─RotaryEmbedding: 5-4 [1, 1, 1024, 64] --
51
+ │ │ │ │ └─Linear: 5-5 [1, 1024, 384] 147,456
52
+ │ │ │ │ └─Dropout: 5-6 [1, 1024, 384] --
53
+ │ │ │ └─RMSNorm: 4-3 [1, 1024, 384] 384
54
+ │ │ │ └─MLP: 4-4 [1, 1024, 384] --
55
+ │ │ │ │ └─Linear: 5-7 [1, 1024, 2048] 786,432
56
+ │ │ │ │ └─Linear: 5-8 [1, 1024, 384] 393,216
57
+ │ │ │ │ └─Dropout: 5-9 [1, 1024, 384] --
58
+ │ │ └─Block: 3-2 [1, 1024, 384] --
59
+ │ │ │ └─RMSNorm: 4-5 [1, 1024, 384] 384
60
+ │ │ │ └─CausalSelfAttention: 4-6 [1, 1024, 384] --
61
+ │ │ │ │ └─Linear: 5-10 [1, 1024, 384] 147,456
62
+ │ │ │ │ └─Linear: 5-11 [1, 1024, 128] 49,152
63
+ │ │ │ │ └─Linear: 5-12 [1, 1024, 128] 49,152
64
+ │ │ │ │ └─RotaryEmbedding: 5-13 [1, 1, 1024, 64] --
65
+ │ │ │ │ └─Linear: 5-14 [1, 1024, 384] 147,456
66
+ │ │ │ │ └─Dropout: 5-15 [1, 1024, 384] --
67
+ │ │ │ └─RMSNorm: 4-7 [1, 1024, 384] 384
68
+ │ │ │ └─MLP: 4-8 [1, 1024, 384] --
69
+ │ │ │ │ └─Linear: 5-16 [1, 1024, 2048] 786,432
70
+ │ │ │ │ └─Linear: 5-17 [1, 1024, 384] 393,216
71
+ │ │ │ │ └─Dropout: 5-18 [1, 1024, 384] --
72
+ │ │ └─Block: 3-3 [1, 1024, 384] --
73
+ │ │ │ └─RMSNorm: 4-9 [1, 1024, 384] 384
74
+ │ │ │ └─CausalSelfAttention: 4-10 [1, 1024, 384] --
75
+ │ │ │ │ └─Linear: 5-19 [1, 1024, 384] 147,456
76
+ │ │ │ │ └─Linear: 5-20 [1, 1024, 128] 49,152
77
+ │ │ │ │ └─Linear: 5-21 [1, 1024, 128] 49,152
78
+ │ │ │ │ └─RotaryEmbedding: 5-22 [1, 1, 1024, 64] --
79
+ │ │ │ │ └─Linear: 5-23 [1, 1024, 384] 147,456
80
+ │ │ │ │ └─Dropout: 5-24 [1, 1024, 384] --
81
+ │ │ │ └─RMSNorm: 4-11 [1, 1024, 384] 384
82
+ │ │ │ └─MLP: 4-12 [1, 1024, 384] --
83
+ │ │ │ │ └─Linear: 5-25 [1, 1024, 2048] 786,432
84
+ │ │ │ │ └─Linear: 5-26 [1, 1024, 384] 393,216
85
+ │ │ │ │ └─Dropout: 5-27 [1, 1024, 384] --
86
+ │ │ └─Block: 3-4 [1, 1024, 384] --
87
+ │ │ │ └─RMSNorm: 4-13 [1, 1024, 384] 384
88
+ │ │ │ └─CausalSelfAttention: 4-14 [1, 1024, 384] --
89
+ │ │ │ │ └─Linear: 5-28 [1, 1024, 384] 147,456
90
+ │ │ │ │ └─Linear: 5-29 [1, 1024, 128] 49,152
91
+ │ │ │ │ └─Linear: 5-30 [1, 1024, 128] 49,152
92
+ │ │ │ │ └─RotaryEmbedding: 5-31 [1, 1, 1024, 64] --
93
+ │ │ │ │ └─Linear: 5-32 [1, 1024, 384] 147,456
94
+ │ │ │ │ └─Dropout: 5-33 [1, 1024, 384] --
95
+ │ │ │ └─RMSNorm: 4-15 [1, 1024, 384] 384
96
+ │ │ │ └─MLP: 4-16 [1, 1024, 384] --
97
+ │ │ │ │ └─Linear: 5-34 [1, 1024, 2048] 786,432
98
+ │ │ │ │ └─Linear: 5-35 [1, 1024, 384] 393,216
99
+ │ │ │ │ └─Dropout: 5-36 [1, 1024, 384] --
100
+ │ │ └─Block: 3-5 [1, 1024, 384] --
101
+ │ │ │ └─RMSNorm: 4-17 [1, 1024, 384] 384
102
+ │ │ │ └─CausalSelfAttention: 4-18 [1, 1024, 384] --
103
+ │ │ │ │ └─Linear: 5-37 [1, 1024, 384] 147,456
104
+ │ │ │ │ └─Linear: 5-38 [1, 1024, 128] 49,152
105
+ │ │ │ │ └─Linear: 5-39 [1, 1024, 128] 49,152
106
+ │ │ │ │ └─RotaryEmbedding: 5-40 [1, 1, 1024, 64] --
107
+ │ │ │ │ └─Linear: 5-41 [1, 1024, 384] 147,456
108
+ │ │ │ │ └─Dropout: 5-42 [1, 1024, 384] --
109
+ │ │ │ └─RMSNorm: 4-19 [1, 1024, 384] 384
110
+ │ │ │ └─MLP: 4-20 [1, 1024, 384] --
111
+ │ │ │ │ └─Linear: 5-43 [1, 1024, 2048] 786,432
112
+ │ │ │ │ └─Linear: 5-44 [1, 1024, 384] 393,216
113
+ │ │ │ │ └─Dropout: 5-45 [1, 1024, 384] --
114
+ │ │ └─Block: 3-6 [1, 1024, 384] --
115
+ │ │ │ └─RMSNorm: 4-21 [1, 1024, 384] 384
116
+ │ │ │ └─CausalSelfAttention: 4-22 [1, 1024, 384] --
117
+ │ │ │ │ └─Linear: 5-46 [1, 1024, 384] 147,456
118
+ │ │ │ │ └─Linear: 5-47 [1, 1024, 128] 49,152
119
+ │ │ │ │ └─Linear: 5-48 [1, 1024, 128] 49,152
120
+ │ │ │ │ └─RotaryEmbedding: 5-49 [1, 1, 1024, 64] --
121
+ │ │ │ │ └─Linear: 5-50 [1, 1024, 384] 147,456
122
+ │ │ │ │ └─Dropout: 5-51 [1, 1024, 384] --
123
+ │ │ │ └─RMSNorm: 4-23 [1, 1024, 384] 384
124
+ │ │ │ └─MLP: 4-24 [1, 1024, 384] --
125
+ │ │ │ │ └─Linear: 5-52 [1, 1024, 2048] 786,432
126
+ │ │ │ │ └─Linear: 5-53 [1, 1024, 384] 393,216
127
+ │ │ │ │ └─Dropout: 5-54 [1, 1024, 384] --
128
+ │ │ └─Block: 3-7 [1, 1024, 384] --
129
+ │ │ │ └─RMSNorm: 4-25 [1, 1024, 384] 384
130
+ │ │ │ └─CausalSelfAttention: 4-26 [1, 1024, 384] --
131
+ │ │ │ │ └─Linear: 5-55 [1, 1024, 384] 147,456
132
+ │ │ │ │ └─Linear: 5-56 [1, 1024, 128] 49,152
133
+ │ │ │ │ └─Linear: 5-57 [1, 1024, 128] 49,152
134
+ │ │ │ │ └─RotaryEmbedding: 5-58 [1, 1, 1024, 64] --
135
+ │ │ │ │ └─Linear: 5-59 [1, 1024, 384] 147,456
136
+ │ │ │ │ └─Dropout: 5-60 [1, 1024, 384] --
137
+ │ │ │ └─RMSNorm: 4-27 [1, 1024, 384] 384
138
+ │ │ │ └─MLP: 4-28 [1, 1024, 384] --
139
+ │ │ │ │ └─Linear: 5-61 [1, 1024, 2048] 786,432
140
+ │ │ │ │ └─Linear: 5-62 [1, 1024, 384] 393,216
141
+ │ │ │ │ └─Dropout: 5-63 [1, 1024, 384] --
142
+ │ │ └─Block: 3-8 [1, 1024, 384] --
143
+ │ │ │ └─RMSNorm: 4-29 [1, 1024, 384] 384
144
+ │ │ │ └─CausalSelfAttention: 4-30 [1, 1024, 384] --
145
+ │ │ │ │ └─Linear: 5-64 [1, 1024, 384] 147,456
146
+ │ │ │ │ └─Linear: 5-65 [1, 1024, 128] 49,152
147
+ │ │ │ │ └─Linear: 5-66 [1, 1024, 128] 49,152
148
+ │ │ │ │ └─RotaryEmbedding: 5-67 [1, 1, 1024, 64] --
149
+ │ │ │ │ └─Linear: 5-68 [1, 1024, 384] 147,456
150
+ │ │ │ │ └─Dropout: 5-69 [1, 1024, 384] --
151
+ │ │ │ └─RMSNorm: 4-31 [1, 1024, 384] 384
152
+ │ │ │ └─MLP: 4-32 [1, 1024, 384] --
153
+ │ │ │ │ └─Linear: 5-70 [1, 1024, 2048] 786,432
154
+ │ │ │ │ └─Linear: 5-71 [1, 1024, 384] 393,216
155
+ │ │ │ │ └─Dropout: 5-72 [1, 1024, 384] --
156
+ │ │ └─Block: 3-9 [1, 1024, 384] --
157
+ │ │ │ └─RMSNorm: 4-33 [1, 1024, 384] 384
158
+ │ │ │ └─CausalSelfAttention: 4-34 [1, 1024, 384] --
159
+ │ │ │ │ └─Linear: 5-73 [1, 1024, 384] 147,456
160
+ │ │ │ │ └─Linear: 5-74 [1, 1024, 128] 49,152
161
+ │ │ │ │ └─Linear: 5-75 [1, 1024, 128] 49,152
162
+ │ │ │ │ └─RotaryEmbedding: 5-76 [1, 1, 1024, 64] --
163
+ │ │ │ │ └─Linear: 5-77 [1, 1024, 384] 147,456
164
+ │ │ │ │ └─Dropout: 5-78 [1, 1024, 384] --
165
+ │ │ │ └─RMSNorm: 4-35 [1, 1024, 384] 384
166
+ │ │ │ └─MLP: 4-36 [1, 1024, 384] --
167
+ │ │ │ │ └─Linear: 5-79 [1, 1024, 2048] 786,432
168
+ │ │ │ │ └─Linear: 5-80 [1, 1024, 384] 393,216
169
+ │ │ │ │ └─Dropout: 5-81 [1, 1024, 384] --
170
+ │ │ └─Block: 3-10 [1, 1024, 384] --
171
+ │ │ │ └─RMSNorm: 4-37 [1, 1024, 384] 384
172
+ │ │ │ └─CausalSelfAttention: 4-38 [1, 1024, 384] --
173
+ │ │ │ │ └─Linear: 5-82 [1, 1024, 384] 147,456
174
+ │ │ │ │ └─Linear: 5-83 [1, 1024, 128] 49,152
175
+ │ │ │ │ └─Linear: 5-84 [1, 1024, 128] 49,152
176
+ │ │ │ │ └─RotaryEmbedding: 5-85 [1, 1, 1024, 64] --
177
+ │ │ │ │ └─Linear: 5-86 [1, 1024, 384] 147,456
178
+ │ │ │ │ └─Dropout: 5-87 [1, 1024, 384] --
179
+ │ │ │ └─RMSNorm: 4-39 [1, 1024, 384] 384
180
+ │ │ │ └─MLP: 4-40 [1, 1024, 384] --
181
+ │ │ │ │ └─Linear: 5-88 [1, 1024, 2048] 786,432
182
+ │ │ │ │ └─Linear: 5-89 [1, 1024, 384] 393,216
183
+ │ │ │ │ └─Dropout: 5-90 [1, 1024, 384] --
184
+ │ │ └─Block: 3-11 [1, 1024, 384] --
185
+ │ │ │ └─RMSNorm: 4-41 [1, 1024, 384] 384
186
+ │ │ │ └─CausalSelfAttention: 4-42 [1, 1024, 384] --
187
+ │ │ │ │ └─Linear: 5-91 [1, 1024, 384] 147,456
188
+ │ │ │ │ └─Linear: 5-92 [1, 1024, 128] 49,152
189
+ │ │ │ │ └─Linear: 5-93 [1, 1024, 128] 49,152
190
+ │ │ │ │ └─RotaryEmbedding: 5-94 [1, 1, 1024, 64] --
191
+ │ │ │ │ └─Linear: 5-95 [1, 1024, 384] 147,456
192
+ │ │ │ │ └─Dropout: 5-96 [1, 1024, 384] --
193
+ │ │ │ └─RMSNorm: 4-43 [1, 1024, 384] 384
194
+ │ │ │ └─MLP: 4-44 [1, 1024, 384] --
195
+ │ │ │ │ └─Linear: 5-97 [1, 1024, 2048] 786,432
196
+ │ │ │ │ └─Linear: 5-98 [1, 1024, 384] 393,216
197
+ │ │ │ │ └─Dropout: 5-99 [1, 1024, 384] --
198
+ │ │ └─Block: 3-12 [1, 1024, 384] --
199
+ │ │ │ └─RMSNorm: 4-45 [1, 1024, 384] 384
200
+ │ │ │ └─CausalSelfAttention: 4-46 [1, 1024, 384] --
201
+ │ │ │ │ └─Linear: 5-100 [1, 1024, 384] 147,456
202
+ │ │ │ │ └─Linear: 5-101 [1, 1024, 128] 49,152
203
+ │ │ │ │ └─Linear: 5-102 [1, 1024, 128] 49,152
204
+ │ │ │ │ └─RotaryEmbedding: 5-103 [1, 1, 1024, 64] --
205
+ │ │ │ │ └─Linear: 5-104 [1, 1024, 384] 147,456
206
+ │ │ │ │ └─Dropout: 5-105 [1, 1024, 384] --
207
+ │ │ │ └─RMSNorm: 4-47 [1, 1024, 384] 384
208
+ │ │ │ └─MLP: 4-48 [1, 1024, 384] --
209
+ │ │ │ │ └─Linear: 5-106 [1, 1024, 2048] 786,432
210
+ │ │ │ │ └─Linear: 5-107 [1, 1024, 384] 393,216
211
+ │ │ │ │ └─Dropout: 5-108 [1, 1024, 384] --
212
+ │ │ └─Block: 3-13 [1, 1024, 384] --
213
+ │ │ │ └─RMSNorm: 4-49 [1, 1024, 384] 384
214
+ │ │ │ └─CausalSelfAttention: 4-50 [1, 1024, 384] --
215
+ │ │ │ │ └─Linear: 5-109 [1, 1024, 384] 147,456
216
+ │ │ │ │ └─Linear: 5-110 [1, 1024, 128] 49,152
217
+ │ │ │ │ └─Linear: 5-111 [1, 1024, 128] 49,152
218
+ │ │ │ │ └─RotaryEmbedding: 5-112 [1, 1, 1024, 64] --
219
+ │ │ │ │ └─Linear: 5-113 [1, 1024, 384] 147,456
220
+ │ │ │ │ └─Dropout: 5-114 [1, 1024, 384] --
221
+ │ │ │ └─RMSNorm: 4-51 [1, 1024, 384] 384
222
+ │ │ │ └─MLP: 4-52 [1, 1024, 384] --
223
+ │ │ │ │ └─Linear: 5-115 [1, 1024, 2048] 786,432
224
+ │ │ │ │ └─Linear: 5-116 [1, 1024, 384] 393,216
225
+ │ │ │ │ └─Dropout: 5-117 [1, 1024, 384] --
226
+ │ │ └─Block: 3-14 [1, 1024, 384] --
227
+ │ │ │ └─RMSNorm: 4-53 [1, 1024, 384] 384
228
+ │ │ │ └─CausalSelfAttention: 4-54 [1, 1024, 384] --
229
+ │ │ │ │ └─Linear: 5-118 [1, 1024, 384] 147,456
230
+ │ │ │ │ └─Linear: 5-119 [1, 1024, 128] 49,152
231
+ │ │ │ │ └─Linear: 5-120 [1, 1024, 128] 49,152
232
+ │ │ │ │ └─RotaryEmbedding: 5-121 [1, 1, 1024, 64] --
233
+ │ │ │ │ └─Linear: 5-122 [1, 1024, 384] 147,456
234
+ │ │ │ │ └─Dropout: 5-123 [1, 1024, 384] --
235
+ │ │ │ └─RMSNorm: 4-55 [1, 1024, 384] 384
236
+ │ │ │ └─MLP: 4-56 [1, 1024, 384] --
237
+ │ │ │ │ └─Linear: 5-124 [1, 1024, 2048] 786,432
238
+ │ │ │ │ └─Linear: 5-125 [1, 1024, 384] 393,216
239
+ │ │ │ │ └─Dropout: 5-126 [1, 1024, 384] --
240
+ │ │ └─Block: 3-15 [1, 1024, 384] --
241
+ │ │ │ └─RMSNorm: 4-57 [1, 1024, 384] 384
242
+ │ │ │ └─CausalSelfAttention: 4-58 [1, 1024, 384] --
243
+ │ │ │ │ └─Linear: 5-127 [1, 1024, 384] 147,456
244
+ │ │ │ │ └─Linear: 5-128 [1, 1024, 128] 49,152
245
+ │ │ │ │ └─Linear: 5-129 [1, 1024, 128] 49,152
246
+ │ │ │ │ └─RotaryEmbedding: 5-130 [1, 1, 1024, 64] --
247
+ │ │ │ │ └─Linear: 5-131 [1, 1024, 384] 147,456
248
+ │ │ │ │ └─Dropout: 5-132 [1, 1024, 384] --
249
+ │ │ │ └─RMSNorm: 4-59 [1, 1024, 384] 384
250
+ │ │ │ └─MLP: 4-60 [1, 1024, 384] --
251
+ │ │ │ │ └─Linear: 5-133 [1, 1024, 2048] 786,432
252
+ │ │ │ │ └─Linear: 5-134 [1, 1024, 384] 393,216
253
+ │ │ │ │ └─Dropout: 5-135 [1, 1024, 384] --
254
+ │ │ └─Block: 3-16 [1, 1024, 384] --
255
+ │ │ │ └─RMSNorm: 4-61 [1, 1024, 384] 384
256
+ │ │ │ └─CausalSelfAttention: 4-62 [1, 1024, 384] --
257
+ │ │ │ │ └─Linear: 5-136 [1, 1024, 384] 147,456
258
+ │ │ │ │ └─Linear: 5-137 [1, 1024, 128] 49,152
259
+ │ │ │ │ └─Linear: 5-138 [1, 1024, 128] 49,152
260
+ │ │ │ │ └─RotaryEmbedding: 5-139 [1, 1, 1024, 64] --
261
+ │ │ │ │ └─Linear: 5-140 [1, 1024, 384] 147,456
262
+ │ │ │ │ └─Dropout: 5-141 [1, 1024, 384] --
263
+ │ │ │ └─RMSNorm: 4-63 [1, 1024, 384] 384
264
+ │ │ │ └─MLP: 4-64 [1, 1024, 384] --
265
+ │ │ │ │ └─Linear: 5-142 [1, 1024, 2048] 786,432
266
+ │ │ │ │ └─Linear: 5-143 [1, 1024, 384] 393,216
267
+ │ │ │ │ └─Dropout: 5-144 [1, 1024, 384] --
268
+ │ │ └─Block: 3-17 [1, 1024, 384] --
269
+ │ │ │ └─RMSNorm: 4-65 [1, 1024, 384] 384
270
+ │ │ │ └─CausalSelfAttention: 4-66 [1, 1024, 384] --
271
+ │ │ │ │ └─Linear: 5-145 [1, 1024, 384] 147,456
272
+ │ │ │ │ └─Linear: 5-146 [1, 1024, 128] 49,152
273
+ │ │ │ │ └─Linear: 5-147 [1, 1024, 128] 49,152
274
+ │ │ │ │ └─RotaryEmbedding: 5-148 [1, 1, 1024, 64] --
275
+ │ │ │ │ └─Linear: 5-149 [1, 1024, 384] 147,456
276
+ │ │ │ │ └─Dropout: 5-150 [1, 1024, 384] --
277
+ │ │ │ └─RMSNorm: 4-67 [1, 1024, 384] 384
278
+ │ │ │ └─MLP: 4-68 [1, 1024, 384] --
279
+ │ │ │ │ └─Linear: 5-151 [1, 1024, 2048] 786,432
280
+ │ │ │ │ └─Linear: 5-152 [1, 1024, 384] 393,216
281
+ │ │ │ │ └─Dropout: 5-153 [1, 1024, 384] --
282
+ │ │ └─Block: 3-18 [1, 1024, 384] --
283
+ │ │ │ └─RMSNorm: 4-69 [1, 1024, 384] 384
284
+ │ │ │ └─CausalSelfAttention: 4-70 [1, 1024, 384] --
285
+ │ │ │ │ └─Linear: 5-154 [1, 1024, 384] 147,456
286
+ │ │ │ │ └─Linear: 5-155 [1, 1024, 128] 49,152
287
+ │ │ │ │ └─Linear: 5-156 [1, 1024, 128] 49,152
288
+ │ │ │ │ └─RotaryEmbedding: 5-157 [1, 1, 1024, 64] --
289
+ │ │ │ │ └─Linear: 5-158 [1, 1024, 384] 147,456
290
+ │ │ │ │ └─Dropout: 5-159 [1, 1024, 384] --
291
+ │ │ │ └─RMSNorm: 4-71 [1, 1024, 384] 384
292
+ │ │ │ └─MLP: 4-72 [1, 1024, 384] --
293
+ │ │ │ │ └─Linear: 5-160 [1, 1024, 2048] 786,432
294
+ │ │ │ │ └─Linear: 5-161 [1, 1024, 384] 393,216
295
+ │ │ │ │ └─Dropout: 5-162 [1, 1024, 384] --
296
+ │ │ └─Block: 3-19 [1, 1024, 384] --
297
+ │ │ │ └─RMSNorm: 4-73 [1, 1024, 384] 384
298
+ │ │ │ └─CausalSelfAttention: 4-74 [1, 1024, 384] --
299
+ │ │ │ │ └─Linear: 5-163 [1, 1024, 384] 147,456
300
+ │ │ │ │ └─Linear: 5-164 [1, 1024, 128] 49,152
301
+ │ │ │ │ └─Linear: 5-165 [1, 1024, 128] 49,152
302
+ │ │ │ │ └─RotaryEmbedding: 5-166 [1, 1, 1024, 64] --
303
+ │ │ │ │ └─Linear: 5-167 [1, 1024, 384] 147,456
304
+ │ │ │ │ └─Dropout: 5-168 [1, 1024, 384] --
305
+ │ │ │ └─RMSNorm: 4-75 [1, 1024, 384] 384
306
+ │ │ │ └─MLP: 4-76 [1, 1024, 384] --
307
+ │ │ │ │ └─Linear: 5-169 [1, 1024, 2048] 786,432
308
+ │ │ │ │ └─Linear: 5-170 [1, 1024, 384] 393,216
309
+ │ │ │ │ └─Dropout: 5-171 [1, 1024, 384] --
310
+ │ │ └─Block: 3-20 [1, 1024, 384] --
311
+ │ │ │ └─RMSNorm: 4-77 [1, 1024, 384] 384
312
+ │ │ │ └─CausalSelfAttention: 4-78 [1, 1024, 384] --
313
+ │ │ │ │ └─Linear: 5-172 [1, 1024, 384] 147,456
314
+ │ │ │ │ └─Linear: 5-173 [1, 1024, 128] 49,152
315
+ │ │ │ │ └─Linear: 5-174 [1, 1024, 128] 49,152
316
+ │ │ │ │ └─RotaryEmbedding: 5-175 [1, 1, 1024, 64] --
317
+ │ │ │ │ └─Linear: 5-176 [1, 1024, 384] 147,456
318
+ │ │ │ │ └─Dropout: 5-177 [1, 1024, 384] --
319
+ │ │ │ └─RMSNorm: 4-79 [1, 1024, 384] 384
320
+ │ │ │ └─MLP: 4-80 [1, 1024, 384] --
321
+ │ │ │ │ └─Linear: 5-178 [1, 1024, 2048] 786,432
322
+ │ │ │ │ └─Linear: 5-179 [1, 1024, 384] 393,216
323
+ │ │ │ │ └─Dropout: 5-180 [1, 1024, 384] --
324
+ │ └─RMSNorm: 2-5 [1, 1024, 384] 384
325
+ ├─FactorizedTokenEmbedding: 1-3 -- (recursive)
326
+ │ └─Linear: 2-6 [1, 1, 128] 49,152
327
+ ====================================================================================================
328
+
329
+ === Parameter Counts (unique tensors) ===
330
+ Total params: 38,010,240
331
+ Trainable params: 38,010,240
332
+ Weight tying (wte = lm_head): True
333
+ Embedding mode: factorized tied token embedding
334
+ Note: module-level torchinfo totals may double-count the tied LM head; use the unique counts above.
config_snapshot.json ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "run": {
3
+ "name": "final_c4_20l384_factorized",
4
+ "artifacts_root": "artifacts/final_c4",
5
+ "resume": true,
6
+ "deterministic": false
7
+ },
8
+ "distributed": {
9
+ "enabled": true,
10
+ "backend": "nccl"
11
+ },
12
+ "preprocessing": {
13
+ "data_dir": "data",
14
+ "processed_dir": "data/processed_OWT",
15
+ "log_dir": "logs/preprocessing",
16
+ "train_split": 0.9,
17
+ "dataset_name": "openwebtext",
18
+ "dataset_config_name": null,
19
+ "dataset_split": "train",
20
+ "dataset_text_column": "text",
21
+ "dataset_repo_id": "huiting123/processedOWT",
22
+ "num_proc": 4,
23
+ "tokenization_num_proc": 0,
24
+ "tokenization_batch_size": 1000,
25
+ "tokenization_chunk_size": 100000,
26
+ "shard_write_batch_size": 5000,
27
+ "seed": 42,
28
+ "subset_size": 0,
29
+ "raw_data_path": null,
30
+ "test_data_path": null,
31
+ "skip_language_filter": false,
32
+ "skip_repetition_filter": false,
33
+ "skip_quality_filter": false,
34
+ "min_words": 100,
35
+ "max_words": 10000,
36
+ "max_non_ascii": 0.3,
37
+ "min_line_uniqueness": 0.7,
38
+ "min_sentence_uniqueness": 0.8,
39
+ "max_train_tokens": 0
40
+ },
41
+ "model": {
42
+ "vocab_size": 50304,
43
+ "n_layers": 20,
44
+ "n_heads": 6,
45
+ "n_kv_heads": 2,
46
+ "n_embd": 384,
47
+ "embedding_dim": 128,
48
+ "tie_embeddings": true,
49
+ "context_len": 1024,
50
+ "dropout": 0.0,
51
+ "bias": false,
52
+ "norm_type": "rmsnorm",
53
+ "norm_eps": 1e-05,
54
+ "positional_embedding": "rope",
55
+ "rope_theta": 10000.0,
56
+ "rope_fraction": 1.0,
57
+ "mlp_type": "swiglu",
58
+ "mlp_hidden_mult": 4.0,
59
+ "mlp_hidden_dim": 1024,
60
+ "qk_norm": false,
61
+ "block_style": "sequential"
62
+ },
63
+ "training": {
64
+ "seed": 0,
65
+ "learning_rate": 0.0012,
66
+ "min_lr": 0.00012,
67
+ "weight_decay": 0.03,
68
+ "beta1": 0.9,
69
+ "beta2": 0.95,
70
+ "grad_clip": 1.0,
71
+ "max_iters": 11586,
72
+ "warmup_steps": 116,
73
+ "lr_schedule": "wsd",
74
+ "wsd_stable_frac": 0.85,
75
+ "batch_size": 4,
76
+ "gradient_accumulation_steps": 16,
77
+ "dtype": "float16",
78
+ "device": "cuda",
79
+ "eval_step_interval": 500,
80
+ "eval_batches": 20,
81
+ "log_interval": 10,
82
+ "max_checkpoints": 5
83
+ },
84
+ "inference": {
85
+ "checkpoint": null,
86
+ "prompt": "",
87
+ "max_tokens": 100,
88
+ "temperature": 1.0,
89
+ "seed": 0,
90
+ "device": "auto",
91
+ "leaderboard": false
92
+ },
93
+ "post_training": {
94
+ "base_checkpoint": null,
95
+ "learning_rate": 1e-05,
96
+ "max_iters": 1000,
97
+ "checkpoint_dir": "checkpoints/post",
98
+ "log_dir": "logs/post"
99
+ },
100
+ "evaluation": {
101
+ "checkpoint": null,
102
+ "batch_size": 4,
103
+ "device": "auto",
104
+ "log_dir": "logs/evaluation"
105
+ },
106
+ "notifications": {
107
+ "enabled": false,
108
+ "smtp_host": "smtp.gmail.com",
109
+ "smtp_port": 587,
110
+ "smtp_user": "",
111
+ "to_addresses": [],
112
+ "cooldown_minutes": 5,
113
+ "periodic_status_hours": 4.0,
114
+ "disk_min_gb": 5.0
115
+ }
116
+ }
eval_metrics.jsonl ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 500, "epoch": 0, "val_loss": 6.271504926681518, "val_ppl": 529.273296333478, "is_best": true, "timestamp": "2026-05-04T17:27:01.886823"}
2
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 1000, "epoch": 0, "val_loss": 5.651372718811035, "val_ppl": 284.68198604277603, "is_best": true, "timestamp": "2026-05-04T17:31:09.125462"}
3
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 1500, "epoch": 0, "val_loss": 4.958203363418579, "val_ppl": 142.33783666870428, "is_best": true, "timestamp": "2026-05-04T17:35:16.000868"}
4
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 2000, "epoch": 0, "val_loss": 4.731179404258728, "val_ppl": 113.42926244512242, "is_best": true, "timestamp": "2026-05-04T17:39:23.507721"}
5
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 2500, "epoch": 0, "val_loss": 4.649809455871582, "val_ppl": 104.5650594206581, "is_best": true, "timestamp": "2026-05-04T17:43:30.119838"}
6
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 3000, "epoch": 0, "val_loss": 4.432038378715515, "val_ppl": 84.10267540960363, "is_best": true, "timestamp": "2026-05-04T17:47:36.176655"}
7
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 3500, "epoch": 0, "val_loss": 4.374273097515106, "val_ppl": 79.38211551790492, "is_best": true, "timestamp": "2026-05-04T17:51:41.894337"}
8
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 4000, "epoch": 0, "val_loss": 4.343928050994873, "val_ppl": 77.00944302285464, "is_best": true, "timestamp": "2026-05-04T17:55:48.866524"}
9
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 4500, "epoch": 0, "val_loss": 4.245097875595093, "val_ppl": 69.76258786456656, "is_best": true, "timestamp": "2026-05-04T17:59:54.744666"}
10
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 5000, "epoch": 0, "val_loss": 4.193927860260009, "val_ppl": 66.28262922741254, "is_best": true, "timestamp": "2026-05-04T18:04:01.685011"}
11
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 5500, "epoch": 0, "val_loss": 4.268809962272644, "val_ppl": 71.4365727985318, "is_best": false, "timestamp": "2026-05-04T18:08:10.634131"}
12
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 6000, "epoch": 0, "val_loss": 4.158704721927643, "val_ppl": 63.988585886299255, "is_best": true, "timestamp": "2026-05-04T18:12:17.883599"}
13
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 6500, "epoch": 0, "val_loss": 4.159478271007538, "val_ppl": 64.03810334765954, "is_best": false, "timestamp": "2026-05-04T18:16:24.868373"}
14
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 7000, "epoch": 0, "val_loss": 4.037256824970245, "val_ppl": 56.670671814858714, "is_best": true, "timestamp": "2026-05-04T18:20:30.545643"}
15
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 7500, "epoch": 0, "val_loss": 4.1698023796081545, "val_ppl": 64.70266427800867, "is_best": false, "timestamp": "2026-05-04T18:24:38.429921"}
16
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 8000, "epoch": 0, "val_loss": 4.195084941387177, "val_ppl": 66.35936799467855, "is_best": false, "timestamp": "2026-05-04T18:28:44.671449"}
17
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 8500, "epoch": 0, "val_loss": 3.9605602622032166, "val_ppl": 52.486724040660924, "is_best": true, "timestamp": "2026-05-04T18:32:49.603235"}
18
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 9000, "epoch": 0, "val_loss": 3.9616266012191774, "val_ppl": 52.542722533708215, "is_best": false, "timestamp": "2026-05-04T18:36:56.436454"}
19
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 9500, "epoch": 0, "val_loss": 3.972209358215332, "val_ppl": 53.10172205919401, "is_best": false, "timestamp": "2026-05-04T18:41:02.290341"}
20
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 10000, "epoch": 0, "val_loss": 3.9732977986335754, "val_ppl": 53.15955158604953, "is_best": false, "timestamp": "2026-05-04T18:45:08.582269"}
21
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 10500, "epoch": 0, "val_loss": 3.9948221683502196, "val_ppl": 54.316180628900916, "is_best": false, "timestamp": "2026-05-04T18:49:14.546826"}
22
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 11000, "epoch": 0, "val_loss": 3.9498892426490784, "val_ppl": 51.92961492972043, "is_best": true, "timestamp": "2026-05-04T18:53:19.469593"}
23
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "eval", "step": 11500, "epoch": 0, "val_loss": 4.01113086938858, "val_ppl": 55.209269747273446, "is_best": false, "timestamp": "2026-05-04T18:57:26.063063"}
events.jsonl ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "model_summary", "total_params": 38010240, "trainable_params": 38010240, "weight_tied_lm_head": true, "timestamp": "2026-05-04T17:21:42.251782"}
2
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "config", "model": {"vocab_size": 50304, "n_layers": 20, "n_heads": 6, "n_kv_heads": 2, "n_embd": 384, "embedding_dim": 128, "tie_embeddings": true, "context_len": 1024, "dropout": 0.0, "bias": false, "norm_type": "rmsnorm", "norm_eps": 1e-05, "positional_embedding": "rope", "rope_theta": 10000.0, "rope_fraction": 1.0, "mlp_type": "swiglu", "mlp_hidden_mult": 4.0, "mlp_hidden_dim": 1024, "qk_norm": false, "block_style": "sequential"}, "training": {"seed": 0, "learning_rate": 0.0012, "min_lr": 0.00012, "weight_decay": 0.03, "beta1": 0.9, "beta2": 0.95, "grad_clip": 1.0, "max_iters": 11586, "warmup_steps": 116, "lr_schedule": "wsd", "wsd_stable_frac": 0.85, "batch_size": 4, "gradient_accumulation_steps": 16, "dtype": "float16", "device": "cuda", "eval_step_interval": 500, "eval_batches": 20, "log_interval": 10, "max_checkpoints": 5}, "distributed": {"enabled": true, "backend": "nccl"}, "timestamp": "2026-05-04T17:21:42.252083"}
3
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0000500.pt", "timestamp": "2026-05-04T17:27:02.419684"}
4
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T17:27:02.935937"}
5
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 1000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0001000.pt", "timestamp": "2026-05-04T17:31:09.635408"}
6
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 1000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T17:31:10.482963"}
7
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 1500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0001500.pt", "timestamp": "2026-05-04T17:35:16.506092"}
8
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 1500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T17:35:17.472426"}
9
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 2000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0002000.pt", "timestamp": "2026-05-04T17:39:24.020989"}
10
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 2000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T17:39:24.865058"}
11
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 2500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0002500.pt", "timestamp": "2026-05-04T17:43:30.630334"}
12
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 2500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T17:43:31.448643"}
13
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 3000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0003000.pt", "timestamp": "2026-05-04T17:47:36.737006"}
14
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 3000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T17:47:37.549993"}
15
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 3500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0003500.pt", "timestamp": "2026-05-04T17:51:42.450427"}
16
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 3500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T17:51:43.224403"}
17
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 4000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0004000.pt", "timestamp": "2026-05-04T17:55:49.429802"}
18
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 4000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T17:55:50.291191"}
19
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 4500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0004500.pt", "timestamp": "2026-05-04T17:59:55.308516"}
20
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 4500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T17:59:56.152649"}
21
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 5000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0005000.pt", "timestamp": "2026-05-04T18:04:02.252289"}
22
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 5000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T18:04:03.107403"}
23
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 5500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0005500.pt", "timestamp": "2026-05-04T18:08:11.201926"}
24
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 6000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0006000.pt", "timestamp": "2026-05-04T18:12:18.445731"}
25
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 6000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T18:12:19.286583"}
26
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 6500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0006500.pt", "timestamp": "2026-05-04T18:16:25.432096"}
27
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 7000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0007000.pt", "timestamp": "2026-05-04T18:20:31.109071"}
28
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 7000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T18:20:31.952552"}
29
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 7500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0007500.pt", "timestamp": "2026-05-04T18:24:38.993199"}
30
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 8000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0008000.pt", "timestamp": "2026-05-04T18:28:45.232019"}
31
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 8500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0008500.pt", "timestamp": "2026-05-04T18:32:50.175086"}
32
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 8500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T18:32:51.037758"}
33
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 9000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0009000.pt", "timestamp": "2026-05-04T18:36:56.998151"}
34
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 9500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0009500.pt", "timestamp": "2026-05-04T18:41:02.851932"}
35
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 10000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0010000.pt", "timestamp": "2026-05-04T18:45:09.144053"}
36
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 10500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0010500.pt", "timestamp": "2026-05-04T18:49:15.107050"}
37
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 11000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0011000.pt", "timestamp": "2026-05-04T18:53:20.032228"}
38
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "best_checkpoint_saved", "step": 11000, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/best_ckpt.pt", "timestamp": "2026-05-04T18:53:20.850126"}
39
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "checkpoint_saved", "step": 11500, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0011500.pt", "timestamp": "2026-05-04T18:57:26.629012"}
40
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "final_checkpoint_saved", "step": 11586, "path": "artifacts/final_c4/final_c4_20l384_factorized/checkpoints/ckpt_step0011586.pt", "best_val_loss_so_far": 3.9498892426490784, "timestamp": "2026-05-04T18:58:09.448498"}
41
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "metrics_plot_saved", "path": "artifacts/final_c4/final_c4_20l384_factorized/metrics.png", "timestamp": "2026-05-04T18:58:10.725749"}
42
+ {"run_name": "final_c4_20l384_factorized", "stage": "pretraining", "event": "results_doc_saved", "path": "artifacts/final_c4/final_c4_20l384_factorized/results.md", "timestamp": "2026-05-04T18:58:10.725896"}
logs/pretraining_20260504_172141.log ADDED
The diff for this file is too large to render. See raw diff
 
metrics.png ADDED

Git LFS Details

  • SHA256: f7543a2e58fadbf8452a63b5d509a8a4ac11b317f25d6e715f230d881593debf
  • Pointer size: 131 Bytes
  • Size of remote file: 328 kB
results.md ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Results: final_c4_20l384_factorized
2
+
3
+ Automatically generated after pretraining.
4
+
5
+ ## Summary
6
+ - Model: `20L / 6H / 384d`
7
+ - Total parameters: `38010240`
8
+ - Last logged train step: `11580`
9
+ - Best validation loss: `3.9499`
10
+ - Best validation perplexity: `51.93`
11
+ - Last validation step: `11500`
12
+ - Learning rate: `0.0012`
13
+ - Effective tokens/update: `65536`
14
+
15
+ ## Files
16
+ - [Config snapshot](config_snapshot.json)
17
+ - [Train metrics](train_metrics.jsonl)
18
+ - [Eval metrics](eval_metrics.jsonl)
19
+ - [Events](events.jsonl)
20
+ - [Metrics plot](metrics.png)
21
+
22
+ ## Metrics Plot
23
+
24
+ ![Metrics plot](metrics.png)
train_metrics.jsonl ADDED
The diff for this file is too large to render. See raw diff