Create bank_training_try1_500k_output.txt

Browse files

Files changed (1) hide show

training_metrics/bank_training_try1_500k_output.txt +297 -0

training_metrics/bank_training_try1_500k_output.txt ADDED Viewed

	@@ -0,0 +1,297 @@

+=================================================================
+ALIGNMENT BANK: 5-Expert CaptionBERT-8192
+=================================================================
+  Device: cuda
+  Experts: 5
+  Anchors: 512
+  Bank dim: 128
+=================================================================
+PHASE 0: EXPERT EMBEDDINGS
+=================================================================
+  Loading 500,000 captions...
+  Got 500,000 captions
+  Extracting: bert (google-bert/bert-base-uncased, max_len=512)...
+Loading weights: 100%
+ 199/199 [00:00<00:00, 4325.21it/s, Materializing param=pooler.dense.weight]
+BertModel LOAD REPORT from: google-bert/bert-base-uncased
+Key                                        | Status     |  |
+-------------------------------------------+------------+--+-
+cls.predictions.transform.dense.weight     | UNEXPECTED |  |
+cls.predictions.transform.LayerNorm.weight | UNEXPECTED |  |
+cls.seq_relationship.weight                | UNEXPECTED |  |
+cls.seq_relationship.bias                  | UNEXPECTED |  |
+cls.predictions.transform.dense.bias       | UNEXPECTED |  |
+cls.predictions.transform.LayerNorm.bias   | UNEXPECTED |  |
+cls.predictions.bias                       | UNEXPECTED |  |
+Notes:
+- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
+    109,482,240 params
+    bert: 100%|██████████| 3907/3907 [09:46<00:00,  6.66it/s]
+    Saved: torch.Size([500000, 768])
+  Extracting: modern (answerdotai/ModernBERT-base, max_len=8192)...
+Loading weights: 100%
+ 134/134 [00:00<00:00, 3173.52it/s, Materializing param=layers.21.mlp_norm.weight]
+ModernBertModel LOAD REPORT from: answerdotai/ModernBERT-base
+Key               | Status     |  |
+------------------+------------+--+-
+decoder.bias      | UNEXPECTED |  |
+head.dense.weight | UNEXPECTED |  |
+head.norm.weight  | UNEXPECTED |  |
+Notes:
+- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
+    149,014,272 params
+    modern: 100%|██████████| 3907/3907 [15:54<00:00,  4.09it/s]
+    Saved: torch.Size([500000, 768])
+  Extracting: roberta (FacebookAI/roberta-base, max_len=512)...
+config.json: 100%
+ 481/481 [00:00<00:00, 146kB/s]
+model.safetensors: 100%
+ 499M/499M [00:01<00:00, 557MB/s]
+Loading weights: 100%
+ 197/197 [00:00<00:00, 4340.38it/s, Materializing param=encoder.layer.11.output.dense.weight]
+RobertaModel LOAD REPORT from: FacebookAI/roberta-base
+Key                             | Status     |
+--------------------------------+------------+-
+lm_head.layer_norm.bias         | UNEXPECTED |
+roberta.embeddings.position_ids | UNEXPECTED |
+lm_head.dense.bias              | UNEXPECTED |
+lm_head.bias                    | UNEXPECTED |
+lm_head.dense.weight            | UNEXPECTED |
+lm_head.layer_norm.weight       | UNEXPECTED |
+pooler.dense.weight             | MISSING    |
+pooler.dense.bias               | MISSING    |
+Notes:
+- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
+- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
+tokenizer_config.json: 100%
+ 25.0/25.0 [00:00<00:00, 8.38kB/s]
+vocab.json:
+ 899k/? [00:00<00:00, 35.8MB/s]
+merges.txt:
+ 456k/? [00:00<00:00, 43.1MB/s]
+tokenizer.json:
+ 1.36M/? [00:00<00:00, 149MB/s]
+    124,645,632 params
+    roberta: 100%|██████████| 3907/3907 [09:41<00:00,  6.71it/s]
+    Saved: torch.Size([500000, 768])
+  Extracting: albert (albert/albert-base-v2, max_len=512)...
+config.json: 100%
+ 684/684 [00:00<00:00, 205kB/s]
+model.safetensors: 100%
+ 47.4M/47.4M [00:00<00:00, 236MB/s]
+Loading weights: 100%
+ 25/25 [00:00<00:00, 2864.57it/s, Materializing param=pooler.weight]
+AlbertModel LOAD REPORT from: albert/albert-base-v2
+Key                          | Status     |  |
+-----------------------------+------------+--+-
+predictions.dense.weight     | UNEXPECTED |  |
+predictions.bias             | UNEXPECTED |  |
+predictions.decoder.bias     | UNEXPECTED |  |
+predictions.LayerNorm.weight | UNEXPECTED |  |
+predictions.LayerNorm.bias   | UNEXPECTED |  |
+predictions.dense.bias       | UNEXPECTED |  |
+Notes:
+- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
+tokenizer_config.json: 100%
+ 25.0/25.0 [00:00<00:00, 9.29kB/s]
+spiece.model: 100%
+ 760k/760k [00:00<00:00, 17.3MB/s]
+tokenizer.json:
+ 1.31M/? [00:00<00:00, 18.5MB/s]
+    11,683,584 params
+    albert: 100%|██████████| 3907/3907 [13:32<00:00,  4.81it/s]
+    Saved: torch.Size([500000, 768])
+  Extracting: distil (distilbert/distilbert-base-uncased, max_len=512)...
+config.json: 100%
+ 483/483 [00:00<00:00, 112kB/s]
+model.safetensors: 100%
+ 268M/268M [00:01<00:00, 292MB/s]
+Loading weights: 100%
+ 100/100 [00:00<00:00, 3628.42it/s, Materializing param=transformer.layer.5.sa_layer_norm.weight]
+DistilBertModel LOAD REPORT from: distilbert/distilbert-base-uncased
+Key                     | Status     |  |
+------------------------+------------+--+-
+vocab_transform.bias    | UNEXPECTED |  |
+vocab_layer_norm.weight | UNEXPECTED |  |
+vocab_transform.weight  | UNEXPECTED |  |
+vocab_projector.bias    | UNEXPECTED |  |
+vocab_layer_norm.bias   | UNEXPECTED |  |
+Notes:
+- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
+tokenizer_config.json: 100%
+ 48.0/48.0 [00:00<00:00, 15.0kB/s]
+vocab.txt:
+ 232k/? [00:00<00:00, 12.3MB/s]
+tokenizer.json:
+ 466k/? [00:00<00:00, 59.4MB/s]
+    66,362,880 params
+    distil: 100%|██████████| 3907/3907 [05:23<00:00, 12.09it/s]
+    Saved: torch.Size([500000, 768])
+  Using 500,000 samples
+=================================================================
+PHASE 1: GENERALIZED PROCRUSTES ALIGNMENT
+=================================================================
+  GPA iter 1: delta=1.99174462
+  GPA iter 3: delta=0.00037972
+  GPA iter 6: delta=0.00006116
+  GPA iter 9: delta=0.00002506
+  GPA iter 12: delta=0.00001342
+  GPA iter 15: delta=0.00000849
+  bert      : cos_after=0.7664  cos_to_mean=0.9879
+  modern    : cos_after=0.7166  cos_to_mean=0.9829
+  roberta   : cos_after=0.7435  cos_to_mean=0.9884
+  albert    : cos_after=0.7150  cos_to_mean=0.9863
+  distil    : cos_after=0.7864  cos_to_mean=0.9909
+  cos(consensus, bert): 0.9880
+  cos(consensus, modern): 0.9831
+  cos(consensus, roberta): 0.9885
+  cos(consensus, albert): 0.9864
+  cos(consensus, distil): 0.9909
+  Equidistance range: 0.0079
+  Measuring consensus statistics...
+    CV:       0.2543
+    Mean cos: 0.0035
+    Eff dim:  50.2
+=================================================================
+PHASE 2: ENCODE FROZEN STUDENT
+=================================================================
+A new version of the following files was downloaded from https://huggingface.co/AbstractPhil/geolip-captionbert-8192:
+- modeling_caption_bert.py
+. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
+Loading weights: 100%
+ 82/82 [00:00<00:00, 3652.45it/s, Materializing param=token_emb.weight]
+  Student: 25,958,016 params (frozen)
+  Encoding 500,000 captions...
+  Encoding: 100%|██████████| 1954/1954 [04:43<00:00,  6.88it/s]
+  Student embeddings: torch.Size([500000, 768])
+  Train: 495,000  Val: 5,000
+=================================================================
+PHASE 3: TRAIN ALIGNMENT BANK
+=================================================================
+    Expert 0 (bert): loaded, cos_after=0.7664
+    Expert 1 (modern): loaded, cos_after=0.7166
+    Expert 2 (roberta): loaded, cos_after=0.7435
+    Expert 3 (albert): loaded, cos_after=0.7150
+    Expert 4 (distil): loaded, cos_after=0.7864
+    Anchors: 512 from consensus
+    Targets: CV=0.2543
+    Calibrated (n=5000):
+      cross_cos: 0.0449 ± 0.0316
+      disagree_ratio: median=0.000000
+      expert_cos: 1.0000 ± 0.0000
+      cross pairs: 10
+  Bank: 6,466,944 params
+  E 1: 46s  loss=0.5256  v_loss=0.4934
+    Geometry:  b_cv=0.2628  e_cv=0.0827  spread=0.01876  a_max=0.553
+    Experts:   cos=0.600±0.003  agr=0.000014  ortho=0.000138
+    Disagree:  x_cos=0.0385±0.0444  ratio=0.004107  preserve=0.000487
+  E 2: 45s  loss=0.5010  v_loss=0.5092  exp=0.544  b_cv=0.2581  x_cos=0.0398
+  E 3: 45s  loss=0.4991  v_loss=0.4963  exp=0.478  b_cv=0.2576  x_cos=0.0466
+  E 4: 45s  loss=0.4979  v_loss=0.5021  exp=0.439  b_cv=0.2557  x_cos=0.0425
+  E 5: 45s  loss=0.4973  v_loss=0.4921
+    Geometry:  b_cv=0.2560  e_cv=0.0827  spread=0.01172  a_max=0.559
+    Experts:   cos=0.457±0.002  agr=0.000009  ortho=0.000052
+    Disagree:  x_cos=0.0407±0.0298  ratio=0.003423  preserve=0.000122
+  E 6: 45s  loss=0.4970  v_loss=0.4914  exp=0.436  b_cv=0.2553  x_cos=0.0394
+  E 7: 45s  loss=0.4961  v_loss=0.4875  exp=0.430  b_cv=0.2550  x_cos=0.0456
+  E 8: 45s  loss=0.4960  v_loss=0.4845  exp=0.425  b_cv=0.2542  x_cos=0.0419
+  E 9: 45s  loss=0.4959  v_loss=0.4917  exp=0.426  b_cv=0.2543  x_cos=0.0479
+  E10: 45s  loss=0.4954  v_loss=0.4903
+    Geometry:  b_cv=0.2542  e_cv=0.0822  spread=0.01146  a_max=0.561
+    Experts:   cos=0.429±0.002  agr=0.000004  ortho=0.000030
+    Disagree:  x_cos=0.0481±0.0329  ratio=0.003676  preserve=0.000075
+  E11: 45s  loss=0.4952  v_loss=0.4855  exp=0.421  b_cv=0.2539  x_cos=0.0430
+  E12: 45s  loss=0.4952  v_loss=0.4864  exp=0.455  b_cv=0.2536  x_cos=0.0457
+  E13: 45s  loss=0.4947  v_loss=0.5101  exp=0.448  b_cv=0.2530  x_cos=0.0421
+  E14: 45s  loss=0.4948  v_loss=0.5136  exp=0.433  b_cv=0.2523  x_cos=0.0437
+  E15: 45s  loss=0.4941  v_loss=0.4953
+    Geometry:  b_cv=0.2538  e_cv=0.0831  spread=0.01126  a_max=0.562
+    Experts:   cos=0.450±0.002  agr=0.000001  ortho=0.000017
+    Disagree:  x_cos=0.0402±0.0268  ratio=0.002627  preserve=0.000027
+  E16: 45s  loss=0.4941  v_loss=0.4954  exp=0.433  b_cv=0.2535  x_cos=0.0407
+  E17: 45s  loss=0.4945  v_loss=0.5051  exp=0.439  b_cv=0.2527  x_cos=0.0409
+  E18: 45s  loss=0.4943  v_loss=0.4953  exp=0.427  b_cv=0.2529  x_cos=0.0450
+  E19: 45s  loss=0.4944  v_loss=0.4889  exp=0.438  b_cv=0.2531  x_cos=0.0424
+  E20: 45s  loss=0.4937  v_loss=0.4941
+    Geometry:  b_cv=0.2519  e_cv=0.0827  spread=0.01126  a_max=0.562
+    Experts:   cos=0.422±0.002  agr=0.000000  ortho=0.000012
+    Disagree:  x_cos=0.0462±0.0330  ratio=0.002113  preserve=0.000014
+  E21: 45s  loss=0.4936  v_loss=0.4904  exp=0.436  b_cv=0.2512  x_cos=0.0459
+  E22: 45s  loss=0.4934  v_loss=0.4927  exp=0.438  b_cv=0.2527  x_cos=0.0448
+  E23: 45s  loss=0.4936  v_loss=0.4840  exp=0.450  b_cv=0.2517  x_cos=0.0451
+  E24: 45s  loss=0.4932  v_loss=0.4951  exp=0.457  b_cv=0.2508  x_cos=0.0475
+  E25: 45s  loss=0.4929  v_loss=0.4924
+    Geometry:  b_cv=0.2504  e_cv=0.0828  spread=0.01121  a_max=0.563
+    Experts:   cos=0.441±0.001  agr=0.000000  ortho=0.000008
+    Disagree:  x_cos=0.0431±0.0307  ratio=0.000861  preserve=0.000004
+  E26: 45s  loss=0.4927  v_loss=0.4838  exp=0.444  b_cv=0.2518  x_cos=0.0446
+  E27: 45s  loss=0.4927  v_loss=0.4901  exp=0.441  b_cv=0.2513  x_cos=0.0440
+  E28: 45s  loss=0.4929  v_loss=0.5015  exp=0.452  b_cv=0.2512  x_cos=0.0447
+  E29: 45s  loss=0.4925  v_loss=0.5042  exp=0.449  b_cv=0.2511  x_cos=0.0451
+  E30: 45s  loss=0.4929  v_loss=0.4914
+    Geometry:  b_cv=0.2518  e_cv=0.0823  spread=0.01116  a_max=0.563
+    Experts:   cos=0.451±0.001  agr=0.000000  ortho=0.000007
+    Disagree:  x_cos=0.0447±0.0312  ratio=0.000587  preserve=0.000001
+=================================================================
+PHASE 4: GEOMETRIC VERIFICATION
+=================================================================
+  Passthrough:     1.000000
+  Emb CV:          0.0787 (consensus: 0.2543)
+  Geo context CV:  0.2584
+  Geo eff_dim:     18.9 / 128
+  Expert cos:      0.444 ± 0.001
+  Anchor max cos:  0.563
+  Cross-expert:    0.0446 ± 0.0314
+  Disagree ratio:  0.000668
+=================================================================
+PHASE 5: CLASSIFIER STABILITY TEST
+=================================================================
+  Mode                    Dim   Train     Val     Gap
+  --------------------------------------------------
+  raw_768                1536   0.532   0.328   0.204
+  raw+diff               3072   0.474   0.354   0.120
+  bank_enriched          1792   0.638   0.453   0.185
+  bank+diff              3584   0.589   0.512   0.077
+  geo_explicit              6   0.345   0.339   0.005
+=================================================================
+SUMMARY
+=================================================================
+  Consensus CV:      0.2543
+  Consensus eff_dim: 50.2
+  Equidistance:      0.0079
+  Bank params:       6,466,944
+  Bank geo eff_dim:  18.9
+  Bank geo CV:       0.2584
+  Best val loss:     0.4838
+  Files: alignment_bank_best.pt, alignment_bank_final.pt
+=================================================================
+DONE
+=================================================================