AbstractPhil commited on
Commit
435ad67
·
verified ·
1 Parent(s): b5552f9

Create bank_training_try1_500k_output.txt

Browse files
training_metrics/bank_training_try1_500k_output.txt ADDED
@@ -0,0 +1,297 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ =================================================================
2
+ ALIGNMENT BANK: 5-Expert CaptionBERT-8192
3
+ =================================================================
4
+ Device: cuda
5
+ Experts: 5
6
+ Anchors: 512
7
+ Bank dim: 128
8
+
9
+ =================================================================
10
+ PHASE 0: EXPERT EMBEDDINGS
11
+ =================================================================
12
+ Loading 500,000 captions...
13
+ Got 500,000 captions
14
+
15
+ Extracting: bert (google-bert/bert-base-uncased, max_len=512)...
16
+ Loading weights: 100%
17
+  199/199 [00:00<00:00, 4325.21it/s, Materializing param=pooler.dense.weight]
18
+ BertModel LOAD REPORT from: google-bert/bert-base-uncased
19
+ Key | Status | |
20
+ -------------------------------------------+------------+--+-
21
+ cls.predictions.transform.dense.weight | UNEXPECTED | |
22
+ cls.predictions.transform.LayerNorm.weight | UNEXPECTED | |
23
+ cls.seq_relationship.weight | UNEXPECTED | |
24
+ cls.seq_relationship.bias | UNEXPECTED | |
25
+ cls.predictions.transform.dense.bias | UNEXPECTED | |
26
+ cls.predictions.transform.LayerNorm.bias | UNEXPECTED | |
27
+ cls.predictions.bias | UNEXPECTED | |
28
+
29
+ Notes:
30
+ - UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
31
+ 109,482,240 params
32
+ bert: 100%|██████████| 3907/3907 [09:46<00:00, 6.66it/s]
33
+ Saved: torch.Size([500000, 768])
34
+
35
+ Extracting: modern (answerdotai/ModernBERT-base, max_len=8192)...
36
+ Loading weights: 100%
37
+  134/134 [00:00<00:00, 3173.52it/s, Materializing param=layers.21.mlp_norm.weight]
38
+ ModernBertModel LOAD REPORT from: answerdotai/ModernBERT-base
39
+ Key | Status | |
40
+ ------------------+------------+--+-
41
+ decoder.bias | UNEXPECTED | |
42
+ head.dense.weight | UNEXPECTED | |
43
+ head.norm.weight | UNEXPECTED | |
44
+
45
+ Notes:
46
+ - UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
47
+ 149,014,272 params
48
+ modern: 100%|██████████| 3907/3907 [15:54<00:00, 4.09it/s]
49
+ Saved: torch.Size([500000, 768])
50
+
51
+ Extracting: roberta (FacebookAI/roberta-base, max_len=512)...
52
+ config.json: 100%
53
+  481/481 [00:00<00:00, 146kB/s]
54
+ model.safetensors: 100%
55
+  499M/499M [00:01<00:00, 557MB/s]
56
+ Loading weights: 100%
57
+  197/197 [00:00<00:00, 4340.38it/s, Materializing param=encoder.layer.11.output.dense.weight]
58
+ RobertaModel LOAD REPORT from: FacebookAI/roberta-base
59
+ Key | Status |
60
+ --------------------------------+------------+-
61
+ lm_head.layer_norm.bias | UNEXPECTED |
62
+ roberta.embeddings.position_ids | UNEXPECTED |
63
+ lm_head.dense.bias | UNEXPECTED |
64
+ lm_head.bias | UNEXPECTED |
65
+ lm_head.dense.weight | UNEXPECTED |
66
+ lm_head.layer_norm.weight | UNEXPECTED |
67
+ pooler.dense.weight | MISSING |
68
+ pooler.dense.bias | MISSING |
69
+
70
+ Notes:
71
+ - UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
72
+ - MISSING :those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
73
+ tokenizer_config.json: 100%
74
+  25.0/25.0 [00:00<00:00, 8.38kB/s]
75
+ vocab.json: 
76
+  899k/? [00:00<00:00, 35.8MB/s]
77
+ merges.txt: 
78
+  456k/? [00:00<00:00, 43.1MB/s]
79
+ tokenizer.json: 
80
+  1.36M/? [00:00<00:00, 149MB/s]
81
+ 124,645,632 params
82
+ roberta: 100%|██████████| 3907/3907 [09:41<00:00, 6.71it/s]
83
+ Saved: torch.Size([500000, 768])
84
+
85
+ Extracting: albert (albert/albert-base-v2, max_len=512)...
86
+ config.json: 100%
87
+  684/684 [00:00<00:00, 205kB/s]
88
+ model.safetensors: 100%
89
+  47.4M/47.4M [00:00<00:00, 236MB/s]
90
+ Loading weights: 100%
91
+  25/25 [00:00<00:00, 2864.57it/s, Materializing param=pooler.weight]
92
+ AlbertModel LOAD REPORT from: albert/albert-base-v2
93
+ Key | Status | |
94
+ -----------------------------+------------+--+-
95
+ predictions.dense.weight | UNEXPECTED | |
96
+ predictions.bias | UNEXPECTED | |
97
+ predictions.decoder.bias | UNEXPECTED | |
98
+ predictions.LayerNorm.weight | UNEXPECTED | |
99
+ predictions.LayerNorm.bias | UNEXPECTED | |
100
+ predictions.dense.bias | UNEXPECTED | |
101
+
102
+ Notes:
103
+ - UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
104
+ tokenizer_config.json: 100%
105
+  25.0/25.0 [00:00<00:00, 9.29kB/s]
106
+ spiece.model: 100%
107
+  760k/760k [00:00<00:00, 17.3MB/s]
108
+ tokenizer.json: 
109
+  1.31M/? [00:00<00:00, 18.5MB/s]
110
+ 11,683,584 params
111
+ albert: 100%|██████████| 3907/3907 [13:32<00:00, 4.81it/s]
112
+ Saved: torch.Size([500000, 768])
113
+
114
+ Extracting: distil (distilbert/distilbert-base-uncased, max_len=512)...
115
+ config.json: 100%
116
+  483/483 [00:00<00:00, 112kB/s]
117
+ model.safetensors: 100%
118
+  268M/268M [00:01<00:00, 292MB/s]
119
+ Loading weights: 100%
120
+  100/100 [00:00<00:00, 3628.42it/s, Materializing param=transformer.layer.5.sa_layer_norm.weight]
121
+ DistilBertModel LOAD REPORT from: distilbert/distilbert-base-uncased
122
+ Key | Status | |
123
+ ------------------------+------------+--+-
124
+ vocab_transform.bias | UNEXPECTED | |
125
+ vocab_layer_norm.weight | UNEXPECTED | |
126
+ vocab_transform.weight | UNEXPECTED | |
127
+ vocab_projector.bias | UNEXPECTED | |
128
+ vocab_layer_norm.bias | UNEXPECTED | |
129
+
130
+ Notes:
131
+ - UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
132
+ tokenizer_config.json: 100%
133
+  48.0/48.0 [00:00<00:00, 15.0kB/s]
134
+ vocab.txt: 
135
+  232k/? [00:00<00:00, 12.3MB/s]
136
+ tokenizer.json: 
137
+  466k/? [00:00<00:00, 59.4MB/s]
138
+ 66,362,880 params
139
+ distil: 100%|██████████| 3907/3907 [05:23<00:00, 12.09it/s]
140
+ Saved: torch.Size([500000, 768])
141
+ Using 500,000 samples
142
+
143
+ =================================================================
144
+ PHASE 1: GENERALIZED PROCRUSTES ALIGNMENT
145
+ =================================================================
146
+ GPA iter 1: delta=1.99174462
147
+ GPA iter 3: delta=0.00037972
148
+ GPA iter 6: delta=0.00006116
149
+ GPA iter 9: delta=0.00002506
150
+ GPA iter 12: delta=0.00001342
151
+ GPA iter 15: delta=0.00000849
152
+ bert : cos_after=0.7664 cos_to_mean=0.9879
153
+ modern : cos_after=0.7166 cos_to_mean=0.9829
154
+ roberta : cos_after=0.7435 cos_to_mean=0.9884
155
+ albert : cos_after=0.7150 cos_to_mean=0.9863
156
+ distil : cos_after=0.7864 cos_to_mean=0.9909
157
+ cos(consensus, bert): 0.9880
158
+ cos(consensus, modern): 0.9831
159
+ cos(consensus, roberta): 0.9885
160
+ cos(consensus, albert): 0.9864
161
+ cos(consensus, distil): 0.9909
162
+ Equidistance range: 0.0079
163
+
164
+ Measuring consensus statistics...
165
+ CV: 0.2543
166
+ Mean cos: 0.0035
167
+ Eff dim: 50.2
168
+
169
+ =================================================================
170
+ PHASE 2: ENCODE FROZEN STUDENT
171
+ =================================================================
172
+ A new version of the following files was downloaded from https://huggingface.co/AbstractPhil/geolip-captionbert-8192:
173
+ - modeling_caption_bert.py
174
+ . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
175
+ Loading weights: 100%
176
+  82/82 [00:00<00:00, 3652.45it/s, Materializing param=token_emb.weight]
177
+ Student: 25,958,016 params (frozen)
178
+ Encoding 500,000 captions...
179
+ Encoding: 100%|██████████| 1954/1954 [04:43<00:00, 6.88it/s]
180
+ Student embeddings: torch.Size([500000, 768])
181
+ Train: 495,000 Val: 5,000
182
+
183
+ =================================================================
184
+ PHASE 3: TRAIN ALIGNMENT BANK
185
+ =================================================================
186
+ Expert 0 (bert): loaded, cos_after=0.7664
187
+ Expert 1 (modern): loaded, cos_after=0.7166
188
+ Expert 2 (roberta): loaded, cos_after=0.7435
189
+ Expert 3 (albert): loaded, cos_after=0.7150
190
+ Expert 4 (distil): loaded, cos_after=0.7864
191
+ Anchors: 512 from consensus
192
+ Targets: CV=0.2543
193
+ Calibrated (n=5000):
194
+ cross_cos: 0.0449 ± 0.0316
195
+ disagree_ratio: median=0.000000
196
+ expert_cos: 1.0000 ± 0.0000
197
+ cross pairs: 10
198
+ Bank: 6,466,944 params
199
+
200
+ E 1: 46s loss=0.5256 v_loss=0.4934
201
+ Geometry: b_cv=0.2628 e_cv=0.0827 spread=0.01876 a_max=0.553
202
+ Experts: cos=0.600±0.003 agr=0.000014 ortho=0.000138
203
+ Disagree: x_cos=0.0385±0.0444 ratio=0.004107 preserve=0.000487
204
+ E 2: 45s loss=0.5010 v_loss=0.5092 exp=0.544 b_cv=0.2581 x_cos=0.0398
205
+ E 3: 45s loss=0.4991 v_loss=0.4963 exp=0.478 b_cv=0.2576 x_cos=0.0466
206
+ E 4: 45s loss=0.4979 v_loss=0.5021 exp=0.439 b_cv=0.2557 x_cos=0.0425
207
+
208
+ E 5: 45s loss=0.4973 v_loss=0.4921
209
+ Geometry: b_cv=0.2560 e_cv=0.0827 spread=0.01172 a_max=0.559
210
+ Experts: cos=0.457±0.002 agr=0.000009 ortho=0.000052
211
+ Disagree: x_cos=0.0407±0.0298 ratio=0.003423 preserve=0.000122
212
+ E 6: 45s loss=0.4970 v_loss=0.4914 exp=0.436 b_cv=0.2553 x_cos=0.0394
213
+ E 7: 45s loss=0.4961 v_loss=0.4875 exp=0.430 b_cv=0.2550 x_cos=0.0456
214
+ E 8: 45s loss=0.4960 v_loss=0.4845 exp=0.425 b_cv=0.2542 x_cos=0.0419
215
+ E 9: 45s loss=0.4959 v_loss=0.4917 exp=0.426 b_cv=0.2543 x_cos=0.0479
216
+
217
+ E10: 45s loss=0.4954 v_loss=0.4903
218
+ Geometry: b_cv=0.2542 e_cv=0.0822 spread=0.01146 a_max=0.561
219
+ Experts: cos=0.429±0.002 agr=0.000004 ortho=0.000030
220
+ Disagree: x_cos=0.0481±0.0329 ratio=0.003676 preserve=0.000075
221
+ E11: 45s loss=0.4952 v_loss=0.4855 exp=0.421 b_cv=0.2539 x_cos=0.0430
222
+ E12: 45s loss=0.4952 v_loss=0.4864 exp=0.455 b_cv=0.2536 x_cos=0.0457
223
+ E13: 45s loss=0.4947 v_loss=0.5101 exp=0.448 b_cv=0.2530 x_cos=0.0421
224
+ E14: 45s loss=0.4948 v_loss=0.5136 exp=0.433 b_cv=0.2523 x_cos=0.0437
225
+
226
+ E15: 45s loss=0.4941 v_loss=0.4953
227
+ Geometry: b_cv=0.2538 e_cv=0.0831 spread=0.01126 a_max=0.562
228
+ Experts: cos=0.450±0.002 agr=0.000001 ortho=0.000017
229
+ Disagree: x_cos=0.0402±0.0268 ratio=0.002627 preserve=0.000027
230
+ E16: 45s loss=0.4941 v_loss=0.4954 exp=0.433 b_cv=0.2535 x_cos=0.0407
231
+ E17: 45s loss=0.4945 v_loss=0.5051 exp=0.439 b_cv=0.2527 x_cos=0.0409
232
+ E18: 45s loss=0.4943 v_loss=0.4953 exp=0.427 b_cv=0.2529 x_cos=0.0450
233
+ E19: 45s loss=0.4944 v_loss=0.4889 exp=0.438 b_cv=0.2531 x_cos=0.0424
234
+
235
+ E20: 45s loss=0.4937 v_loss=0.4941
236
+ Geometry: b_cv=0.2519 e_cv=0.0827 spread=0.01126 a_max=0.562
237
+ Experts: cos=0.422±0.002 agr=0.000000 ortho=0.000012
238
+ Disagree: x_cos=0.0462±0.0330 ratio=0.002113 preserve=0.000014
239
+ E21: 45s loss=0.4936 v_loss=0.4904 exp=0.436 b_cv=0.2512 x_cos=0.0459
240
+ E22: 45s loss=0.4934 v_loss=0.4927 exp=0.438 b_cv=0.2527 x_cos=0.0448
241
+ E23: 45s loss=0.4936 v_loss=0.4840 exp=0.450 b_cv=0.2517 x_cos=0.0451
242
+ E24: 45s loss=0.4932 v_loss=0.4951 exp=0.457 b_cv=0.2508 x_cos=0.0475
243
+
244
+ E25: 45s loss=0.4929 v_loss=0.4924
245
+ Geometry: b_cv=0.2504 e_cv=0.0828 spread=0.01121 a_max=0.563
246
+ Experts: cos=0.441±0.001 agr=0.000000 ortho=0.000008
247
+ Disagree: x_cos=0.0431±0.0307 ratio=0.000861 preserve=0.000004
248
+ E26: 45s loss=0.4927 v_loss=0.4838 exp=0.444 b_cv=0.2518 x_cos=0.0446
249
+ E27: 45s loss=0.4927 v_loss=0.4901 exp=0.441 b_cv=0.2513 x_cos=0.0440
250
+ E28: 45s loss=0.4929 v_loss=0.5015 exp=0.452 b_cv=0.2512 x_cos=0.0447
251
+ E29: 45s loss=0.4925 v_loss=0.5042 exp=0.449 b_cv=0.2511 x_cos=0.0451
252
+
253
+ E30: 45s loss=0.4929 v_loss=0.4914
254
+ Geometry: b_cv=0.2518 e_cv=0.0823 spread=0.01116 a_max=0.563
255
+ Experts: cos=0.451±0.001 agr=0.000000 ortho=0.000007
256
+ Disagree: x_cos=0.0447±0.0312 ratio=0.000587 preserve=0.000001
257
+
258
+ =================================================================
259
+ PHASE 4: GEOMETRIC VERIFICATION
260
+ =================================================================
261
+ Passthrough: 1.000000
262
+ Emb CV: 0.0787 (consensus: 0.2543)
263
+ Geo context CV: 0.2584
264
+ Geo eff_dim: 18.9 / 128
265
+ Expert cos: 0.444 ± 0.001
266
+ Anchor max cos: 0.563
267
+ Cross-expert: 0.0446 ± 0.0314
268
+ Disagree ratio: 0.000668
269
+
270
+ =================================================================
271
+ PHASE 5: CLASSIFIER STABILITY TEST
272
+ =================================================================
273
+
274
+ Mode Dim Train Val Gap
275
+ --------------------------------------------------
276
+ raw_768 1536 0.532 0.328 0.204
277
+ raw+diff 3072 0.474 0.354 0.120
278
+ bank_enriched 1792 0.638 0.453 0.185
279
+ bank+diff 3584 0.589 0.512 0.077
280
+ geo_explicit 6 0.345 0.339 0.005
281
+
282
+ =================================================================
283
+ SUMMARY
284
+ =================================================================
285
+ Consensus CV: 0.2543
286
+ Consensus eff_dim: 50.2
287
+ Equidistance: 0.0079
288
+ Bank params: 6,466,944
289
+ Bank geo eff_dim: 18.9
290
+ Bank geo CV: 0.2584
291
+ Best val loss: 0.4838
292
+
293
+ Files: alignment_bank_best.pt, alignment_bank_final.pt
294
+
295
+ =================================================================
296
+ DONE
297
+ =================================================================