geolip-vit-base-x3 / analysis /advanced_geometric_analysis_output.txt
AbstractPhil's picture
Rename advanced_geometric_analysis_output.txt to analysis/advanced_geometric_analysis_output.txt
01b93a0 verified
=================================================================
BASE TIER DEEP MODEL ANALYSIS
=================================================================
Device: cuda
=================================================================
LOADING MODELS
=================================================================
Loading CLIP ViT-L/14...
Loading weights: 100%
 391/391 [00:00<00:00, 1169.09it/s, Materializing param=vision_model.pre_layrnorm.weight]
CLIPVisionModel LOAD REPORT from: openai/clip-vit-large-patch14
Key | Status | |
-------------------------------------------------------------+------------+--+-
text_model.encoder.layers.{0...11}.self_attn.k_proj.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.mlp.fc1.bias | UNEXPECTED | |
text_projection.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.out_proj.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.out_proj.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.layer_norm1.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.k_proj.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.q_proj.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.mlp.fc1.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.layer_norm2.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.mlp.fc2.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.layer_norm1.weight | UNEXPECTED | |
text_model.final_layer_norm.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.layer_norm2.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.mlp.fc2.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.v_proj.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.v_proj.weight | UNEXPECTED | |
logit_scale | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.q_proj.weight | UNEXPECTED | |
text_model.embeddings.position_embedding.weight | UNEXPECTED | |
text_model.final_layer_norm.bias | UNEXPECTED | |
vision_model.embeddings.position_ids | UNEXPECTED | |
text_model.embeddings.position_ids | UNEXPECTED | |
visual_projection.weight | UNEXPECTED | |
text_model.embeddings.token_embedding.weight | UNEXPECTED | |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Loaded: 303,179,776 params
Loading DINOv2 ViT-B/14...
Loading weights: 100%
 223/223 [00:00<00:00, 1085.46it/s, Materializing param=layernorm.weight]
Loaded: 86,580,480 params
Loading SigLIP ViT-B/16-384...
Loading weights: 100%
 208/208 [00:00<00:00, 983.51it/s, Materializing param=vision_model.post_layernorm.weight]
SiglipVisionModel LOAD REPORT from: google/siglip-base-patch16-384
Key | Status | |
-------------------------------------------------------------+------------+--+-
text_model.encoder.layers.{0...11}.self_attn.k_proj.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.mlp.fc1.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.out_proj.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.out_proj.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.layer_norm1.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.k_proj.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.q_proj.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.mlp.fc1.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.layer_norm2.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.mlp.fc2.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.layer_norm1.weight | UNEXPECTED | |
text_model.final_layer_norm.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.layer_norm2.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.mlp.fc2.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.v_proj.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.v_proj.weight | UNEXPECTED | |
logit_scale | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.q_proj.weight | UNEXPECTED | |
text_model.embeddings.position_embedding.weight | UNEXPECTED | |
text_model.final_layer_norm.bias | UNEXPECTED | |
text_model.head.bias | UNEXPECTED | |
text_model.head.weight | UNEXPECTED | |
text_model.embeddings.token_embedding.weight | UNEXPECTED | |
logit_bias | UNEXPECTED | |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Loaded: 93,176,064 params
=================================================================
SCAN 1: ARCHITECTURE COMPARISON
=================================================================
clip_l14:
hidden_size : 1,024
intermediate_size : 4,096
num_layers : 24
num_heads : 16
patch_size : 14
image_size : 224
total_params : 303,179,776
head_dim : 64
dinov2_b14:
hidden_size : 768
num_layers : 12
num_heads : 12
patch_size : 14
image_size : 518
total_params : 86,580,480
head_dim : 64
siglip_b16:
hidden_size : 768
intermediate_size : 3,072
num_layers : 12
num_heads : 12
patch_size : 16
image_size : 384
total_params : 93,176,064
head_dim : 64
=================================================================
SCAN 2: PARAMETER INVENTORY
=================================================================
clip_l14:
embeddings : 866,304 ( 3 tensors)
embeddings.class_embedding: [1024]
patch_embedding.weight: [1024, 3, 14, 14]
position_embedding.weight: [257, 1024]
encoder_other : 100,761,600 (192 tensors)
k_proj.weight: [1024, 1024]
k_proj.bias: [1024]
v_proj.weight: [1024, 1024]
final_norm : 2,048 ( 2 tensors)
post_layernorm.weight: [1024]
post_layernorm.bias: [1024]
layernorm : 98,304 (96 tensors)
layer_norm1.weight: [1024]
layer_norm1.bias: [1024]
layer_norm2.weight: [1024]
mlp : 201,449,472 (96 tensors)
fc1.weight: [4096, 1024]
fc1.bias: [4096]
fc2.weight: [1024, 4096]
other : 2,048 ( 2 tensors)
pre_layrnorm.weight: [1024]
pre_layrnorm.bias: [1024]
dinov2_b14:
attn_other : 14,174,208 (48 tensors)
key.weight: [768, 768]
key.bias: [768]
value.weight: [768, 768]
attn_out : 7,087,104 (24 tensors)
dense.weight: [768, 768]
dense.bias: [768]
dense.weight: [768, 768]
attn_qkv : 7,087,104 (24 tensors)
query.weight: [768, 768]
query.bias: [768]
query.weight: [768, 768]
embeddings : 1,506,048 ( 5 tensors)
embeddings.cls_token: [1, 1, 768]
embeddings.mask_token: [1, 768]
embeddings.position_embeddings: [1, 1370, 768]
encoder_other : 18,432 (24 tensors)
layer_scale1.lambda1: [768]
layer_scale2.lambda1: [768]
layer_scale1.lambda1: [768]
final_norm : 1,536 ( 2 tensors)
layernorm.weight: [768]
layernorm.bias: [768]
layernorm : 36,864 (48 tensors)
norm1.weight: [768]
norm1.bias: [768]
norm2.weight: [768]
mlp : 56,669,184 (48 tensors)
fc1.weight: [3072, 768]
fc1.bias: [3072]
fc2.weight: [768, 3072]
siglip_b16:
embeddings : 1,032,960 ( 3 tensors)
patch_embedding.weight: [768, 3, 16, 16]
patch_embedding.bias: [768]
position_embedding.weight: [576, 768]
encoder_other : 28,348,416 (96 tensors)
k_proj.weight: [768, 768]
k_proj.bias: [768]
v_proj.weight: [768, 768]
final_norm : 3,072 ( 4 tensors)
post_layernorm.weight: [768]
post_layernorm.bias: [768]
layernorm.weight: [768]
head : 7,085,568 ( 9 tensors)
head.probe: [1, 1, 768]
attention.in_proj_weight: [2304, 768]
attention.in_proj_bias: [2304]
layernorm : 36,864 (48 tensors)
layer_norm1.weight: [768]
layer_norm1.bias: [768]
layer_norm2.weight: [768]
mlp : 56,669,184 (48 tensors)
fc1.weight: [3072, 768]
fc1.bias: [3072]
fc2.weight: [768, 3072]
=================================================================
SCAN 3: WEIGHT STATISTICS
=================================================================
clip_l14 — key weight matrices:
param shape norm std sv_max eff_rank
---------------------------------------------------------------------------------------------------------
vision_model.embeddings.patch_embedding.weight [1024, 3, 14, 14] 13.0285 0.01679 N/A N/A
vision_model.embeddings.position_embedding.weight [257, 1024] 11.4993 0.01826 10.6920 22.3
ion_model.encoder.layers.0.self_attn.k_proj.weight [1024, 1024] 10.3266 0.01008 4.1652 148.5
ion_model.encoder.layers.0.self_attn.v_proj.weight [1024, 1024] 10.1910 0.00995 2.0620 428.9
ion_model.encoder.layers.0.self_attn.q_proj.weight [1024, 1024] 10.4647 0.01022 4.4222 129.4
n_model.encoder.layers.0.self_attn.out_proj.weight [1024, 1024] 12.6910 0.01239 1.4843 557.1
vision_model.encoder.layers.0.mlp.fc1.weight [4096, 1024] 22.4378 0.01096 11.0751 281.7
vision_model.encoder.layers.0.mlp.fc2.weight [1024, 4096] 17.9603 0.00877 5.2350 455.7
ion_model.encoder.layers.1.self_attn.k_proj.weight [1024, 1024] 13.8288 0.01350 3.8866 179.2
ion_model.encoder.layers.1.self_attn.v_proj.weight [1024, 1024] 12.8884 0.01259 1.5682 562.6
ion_model.encoder.layers.1.self_attn.q_proj.weight [1024, 1024] 13.2903 0.01298 3.9366 173.5
n_model.encoder.layers.1.self_attn.out_proj.weight [1024, 1024] 12.8612 0.01256 1.4611 597.0
vision_model.encoder.layers.1.mlp.fc1.weight [4096, 1024] 27.3329 0.01335 9.2804 505.0
vision_model.encoder.layers.1.mlp.fc2.weight [1024, 4096] 21.3601 0.01043 3.7564 687.7
ion_model.encoder.layers.2.self_attn.k_proj.weight [1024, 1024] 15.0059 0.01465 3.4512 223.1
ion_model.encoder.layers.2.self_attn.v_proj.weight [1024, 1024] 13.2355 0.01293 1.4198 639.6
ion_model.encoder.layers.2.self_attn.q_proj.weight [1024, 1024] 14.7787 0.01443 3.2682 226.4
n_model.encoder.layers.2.self_attn.out_proj.weight [1024, 1024] 12.4446 0.01215 1.2628 669.7
vision_model.encoder.layers.2.mlp.fc1.weight [4096, 1024] 26.4932 0.01293 9.4832 565.7
vision_model.encoder.layers.2.mlp.fc2.weight [1024, 4096] 21.3475 0.01043 3.8495 707.4
ion_model.encoder.layers.3.self_attn.k_proj.weight [1024, 1024] 15.6926 0.01532 3.1154 270.4
ion_model.encoder.layers.3.self_attn.v_proj.weight [1024, 1024] 14.4002 0.01406 1.4850 639.5
ion_model.encoder.layers.3.self_attn.q_proj.weight [1024, 1024] 15.6837 0.01532 3.0972 281.4
n_model.encoder.layers.3.self_attn.out_proj.weight [1024, 1024] 12.9667 0.01266 1.3714 665.4
vision_model.encoder.layers.3.mlp.fc1.weight [4096, 1024] 28.0314 0.01369 7.8429 635.9
vision_model.encoder.layers.3.mlp.fc2.weight [1024, 4096] 22.7624 0.01112 3.4051 768.6
ion_model.encoder.layers.4.self_attn.k_proj.weight [1024, 1024] 15.8693 0.01550 2.8455 288.1
ion_model.encoder.layers.4.self_attn.v_proj.weight [1024, 1024] 13.7542 0.01343 1.1932 674.9
ion_model.encoder.layers.4.self_attn.q_proj.weight [1024, 1024] 15.8018 0.01543 2.8635 295.8
n_model.encoder.layers.4.self_attn.out_proj.weight [1024, 1024] 12.9162 0.01261 1.2183 694.6
vision_model.encoder.layers.4.mlp.fc1.weight [4096, 1024] 29.7806 0.01454 6.4180 705.5
vision_model.encoder.layers.4.mlp.fc2.weight [1024, 4096] 24.5424 0.01198 3.2353 774.9
ion_model.encoder.layers.5.self_attn.k_proj.weight [1024, 1024] 15.2627 0.01491 2.2813 369.0
ion_model.encoder.layers.5.self_attn.v_proj.weight [1024, 1024] 14.1761 0.01384 1.6230 629.3
ion_model.encoder.layers.5.self_attn.q_proj.weight [1024, 1024] 15.2910 0.01493 2.2908 383.0
n_model.encoder.layers.5.self_attn.out_proj.weight [1024, 1024] 13.4318 0.01312 1.4986 673.0
vision_model.encoder.layers.5.mlp.fc1.weight [4096, 1024] 30.9118 0.01510 4.5842 771.7
vision_model.encoder.layers.5.mlp.fc2.weight [1024, 4096] 25.5090 0.01246 2.8728 803.2
ion_model.encoder.layers.6.self_attn.k_proj.weight [1024, 1024] 16.2386 0.01586 2.3373 434.2
ion_model.encoder.layers.6.self_attn.v_proj.weight [1024, 1024] 15.0351 0.01468 1.5573 616.6
ion_model.encoder.layers.6.self_attn.q_proj.weight [1024, 1024] 16.7687 0.01638 2.4588 443.6
n_model.encoder.layers.6.self_attn.out_proj.weight [1024, 1024] 14.1036 0.01377 1.3362 659.2
vision_model.encoder.layers.6.mlp.fc1.weight [4096, 1024] 30.5849 0.01494 4.0565 804.8
vision_model.encoder.layers.6.mlp.fc2.weight [1024, 4096] 25.7987 0.01260 2.7815 821.5
ion_model.encoder.layers.7.self_attn.k_proj.weight [1024, 1024] 16.1936 0.01581 1.8904 512.6
ion_model.encoder.layers.7.self_attn.v_proj.weight [1024, 1024] 14.7217 0.01438 1.3952 636.8
ion_model.encoder.layers.7.self_attn.q_proj.weight [1024, 1024] 16.2836 0.01590 2.2271 507.9
n_model.encoder.layers.7.self_attn.out_proj.weight [1024, 1024] 14.2476 0.01391 1.3998 648.3
vision_model.encoder.layers.7.mlp.fc1.weight [4096, 1024] 30.9270 0.01510 4.1601 832.6
vision_model.encoder.layers.7.mlp.fc2.weight [1024, 4096] 26.5679 0.01297 2.1988 854.7
ion_model.encoder.layers.8.self_attn.k_proj.weight [1024, 1024] 16.4682 0.01608 1.8729 542.4
ion_model.encoder.layers.8.self_attn.v_proj.weight [1024, 1024] 14.8673 0.01452 1.3464 646.9
ion_model.encoder.layers.8.self_attn.q_proj.weight [1024, 1024] 16.7130 0.01632 1.9112 526.1
n_model.encoder.layers.8.self_attn.out_proj.weight [1024, 1024] 14.3406 0.01400 1.1506 672.7
vision_model.encoder.layers.8.mlp.fc1.weight [4096, 1024] 31.1442 0.01521 4.1953 857.3
vision_model.encoder.layers.8.mlp.fc2.weight [1024, 4096] 27.3500 0.01336 2.1856 882.1
ion_model.encoder.layers.9.self_attn.k_proj.weight [1024, 1024] 16.2309 0.01585 1.8300 564.9
ion_model.encoder.layers.9.self_attn.v_proj.weight [1024, 1024] 15.1476 0.01479 1.4036 643.1
ion_model.encoder.layers.9.self_attn.q_proj.weight [1024, 1024] 16.5288 0.01614 1.9110 561.6
n_model.encoder.layers.9.self_attn.out_proj.weight [1024, 1024] 14.5335 0.01419 1.1607 688.9
vision_model.encoder.layers.9.mlp.fc1.weight [4096, 1024] 31.3509 0.01531 4.6304 857.9
vision_model.encoder.layers.9.mlp.fc2.weight [1024, 4096] 27.2579 0.01331 2.2316 888.1
on_model.encoder.layers.10.self_attn.k_proj.weight [1024, 1024] 16.7808 0.01639 1.8287 597.6
on_model.encoder.layers.10.self_attn.v_proj.weight [1024, 1024] 14.7162 0.01437 1.3223 666.2
on_model.encoder.layers.10.self_attn.q_proj.weight [1024, 1024] 17.0940 0.01669 2.0012 585.2
_model.encoder.layers.10.self_attn.out_proj.weight [1024, 1024] 14.2325 0.01390 1.1821 698.7
vision_model.encoder.layers.10.mlp.fc1.weight [4096, 1024] 31.7419 0.01550 5.2608 865.0
vision_model.encoder.layers.10.mlp.fc2.weight [1024, 4096] 28.0947 0.01372 2.1578 896.3
on_model.encoder.layers.11.self_attn.k_proj.weight [1024, 1024] 16.9787 0.01658 1.8848 603.9
on_model.encoder.layers.11.self_attn.v_proj.weight [1024, 1024] 14.6692 0.01433 1.2199 676.8
on_model.encoder.layers.11.self_attn.q_proj.weight [1024, 1024] 17.2452 0.01684 2.0200 591.2
_model.encoder.layers.11.self_attn.out_proj.weight [1024, 1024] 14.1674 0.01384 1.0924 702.1
vision_model.encoder.layers.11.mlp.fc1.weight [4096, 1024] 31.8994 0.01558 5.6641 868.9
vision_model.encoder.layers.11.mlp.fc2.weight [1024, 4096] 28.3883 0.01386 2.1991 900.8
on_model.encoder.layers.12.self_attn.k_proj.weight [1024, 1024] 17.4435 0.01703 1.8913 627.2
on_model.encoder.layers.12.self_attn.v_proj.weight [1024, 1024] 14.2397 0.01391 1.2073 697.7
on_model.encoder.layers.12.self_attn.q_proj.weight [1024, 1024] 17.4122 0.01700 1.8893 615.5
_model.encoder.layers.12.self_attn.out_proj.weight [1024, 1024] 13.7667 0.01344 1.1256 717.4
vision_model.encoder.layers.12.mlp.fc1.weight [4096, 1024] 32.2181 0.01573 5.9889 861.6
vision_model.encoder.layers.12.mlp.fc2.weight [1024, 4096] 28.2699 0.01381 2.1308 909.6
on_model.encoder.layers.13.self_attn.k_proj.weight [1024, 1024] 17.0104 0.01661 2.0448 626.3
on_model.encoder.layers.13.self_attn.v_proj.weight [1024, 1024] 14.7396 0.01439 1.2518 678.9
on_model.encoder.layers.13.self_attn.q_proj.weight [1024, 1024] 17.0819 0.01668 1.9264 615.2
_model.encoder.layers.13.self_attn.out_proj.weight [1024, 1024] 14.1694 0.01384 1.1199 714.2
vision_model.encoder.layers.13.mlp.fc1.weight [4096, 1024] 31.6421 0.01545 6.0148 879.0
vision_model.encoder.layers.13.mlp.fc2.weight [1024, 4096] 29.1049 0.01421 2.0250 922.2
on_model.encoder.layers.14.self_attn.k_proj.weight [1024, 1024] 17.5037 0.01709 2.0815 633.5
on_model.encoder.layers.14.self_attn.v_proj.weight [1024, 1024] 14.3347 0.01400 1.1217 688.9
on_model.encoder.layers.14.self_attn.q_proj.weight [1024, 1024] 17.6811 0.01727 2.0083 609.9
_model.encoder.layers.14.self_attn.out_proj.weight [1024, 1024] 13.9998 0.01367 1.2365 724.1
vision_model.encoder.layers.14.mlp.fc1.weight [4096, 1024] 31.5815 0.01542 6.2445 881.5
vision_model.encoder.layers.14.mlp.fc2.weight [1024, 4096] 30.2788 0.01479 2.1365 926.9
on_model.encoder.layers.15.self_attn.k_proj.weight [1024, 1024] 17.6716 0.01726 1.9023 657.7
on_model.encoder.layers.15.self_attn.v_proj.weight [1024, 1024] 14.2458 0.01391 1.0955 709.5
on_model.encoder.layers.15.self_attn.q_proj.weight [1024, 1024] 17.6763 0.01726 1.8757 639.0
_model.encoder.layers.15.self_attn.out_proj.weight [1024, 1024] 14.0235 0.01369 1.1093 722.6
vision_model.encoder.layers.15.mlp.fc1.weight [4096, 1024] 31.4302 0.01535 6.1131 884.8
vision_model.encoder.layers.15.mlp.fc2.weight [1024, 4096] 30.4680 0.01488 2.0513 931.6
on_model.encoder.layers.16.self_attn.k_proj.weight [1024, 1024] 17.3406 0.01693 1.8225 670.2
on_model.encoder.layers.16.self_attn.v_proj.weight [1024, 1024] 14.7329 0.01439 1.1737 708.3
on_model.encoder.layers.16.self_attn.q_proj.weight [1024, 1024] 17.3623 0.01696 1.8747 649.2
_model.encoder.layers.16.self_attn.out_proj.weight [1024, 1024] 14.5208 0.01418 1.1226 731.1
vision_model.encoder.layers.16.mlp.fc1.weight [4096, 1024] 30.9220 0.01510 5.8892 892.7
vision_model.encoder.layers.16.mlp.fc2.weight [1024, 4096] 32.0740 0.01566 1.9386 938.4
on_model.encoder.layers.17.self_attn.k_proj.weight [1024, 1024] 17.5623 0.01715 1.7163 681.4
on_model.encoder.layers.17.self_attn.v_proj.weight [1024, 1024] 14.7063 0.01436 1.1468 716.8
on_model.encoder.layers.17.self_attn.q_proj.weight [1024, 1024] 17.5951 0.01718 1.7824 656.7
_model.encoder.layers.17.self_attn.out_proj.weight [1024, 1024] 14.5419 0.01420 1.1008 723.9
vision_model.encoder.layers.17.mlp.fc1.weight [4096, 1024] 30.7107 0.01500 5.6539 898.6
vision_model.encoder.layers.17.mlp.fc2.weight [1024, 4096] 32.6687 0.01595 1.9743 936.2
on_model.encoder.layers.18.self_attn.k_proj.weight [1024, 1024] 17.2542 0.01685 1.6047 694.5
on_model.encoder.layers.18.self_attn.v_proj.weight [1024, 1024] 15.1093 0.01476 1.1787 722.1
on_model.encoder.layers.18.self_attn.q_proj.weight [1024, 1024] 17.4201 0.01701 2.0498 665.3
_model.encoder.layers.18.self_attn.out_proj.weight [1024, 1024] 14.9225 0.01457 1.1230 724.2
vision_model.encoder.layers.18.mlp.fc1.weight [4096, 1024] 30.7589 0.01502 5.5648 903.0
vision_model.encoder.layers.18.mlp.fc2.weight [1024, 4096] 32.9469 0.01609 2.0394 930.9
on_model.encoder.layers.19.self_attn.k_proj.weight [1024, 1024] 17.3751 0.01697 1.6815 694.4
on_model.encoder.layers.19.self_attn.v_proj.weight [1024, 1024] 15.4226 0.01506 1.2927 720.0
on_model.encoder.layers.19.self_attn.q_proj.weight [1024, 1024] 17.4778 0.01707 1.8012 668.5
_model.encoder.layers.19.self_attn.out_proj.weight [1024, 1024] 15.2020 0.01485 1.1774 716.2
vision_model.encoder.layers.19.mlp.fc1.weight [4096, 1024] 31.0316 0.01515 5.3302 907.1
vision_model.encoder.layers.19.mlp.fc2.weight [1024, 4096] 33.5712 0.01639 2.2840 924.0
on_model.encoder.layers.20.self_attn.k_proj.weight [1024, 1024] 16.9316 0.01653 1.6881 696.4
on_model.encoder.layers.20.self_attn.v_proj.weight [1024, 1024] 15.8929 0.01552 1.4046 717.9
on_model.encoder.layers.20.self_attn.q_proj.weight [1024, 1024] 17.0880 0.01669 1.8350 668.7
_model.encoder.layers.20.self_attn.out_proj.weight [1024, 1024] 15.7753 0.01541 1.3378 703.6
vision_model.encoder.layers.20.mlp.fc1.weight [4096, 1024] 31.6061 0.01543 5.5200 909.0
vision_model.encoder.layers.20.mlp.fc2.weight [1024, 4096] 36.0967 0.01763 2.9445 910.8
on_model.encoder.layers.21.self_attn.k_proj.weight [1024, 1024] 16.5182 0.01613 1.7138 697.5
on_model.encoder.layers.21.self_attn.v_proj.weight [1024, 1024] 16.3243 0.01594 1.6063 722.9
on_model.encoder.layers.21.self_attn.q_proj.weight [1024, 1024] 17.6180 0.01721 4.2360 632.9
_model.encoder.layers.21.self_attn.out_proj.weight [1024, 1024] 16.1730 0.01579 1.4016 700.0
vision_model.encoder.layers.21.mlp.fc1.weight [4096, 1024] 31.9009 0.01558 5.4005 913.5
vision_model.encoder.layers.21.mlp.fc2.weight [1024, 4096] 37.0141 0.01808 3.0204 903.5
on_model.encoder.layers.22.self_attn.k_proj.weight [1024, 1024] 15.4876 0.01512 7.1209 551.2
on_model.encoder.layers.22.self_attn.v_proj.weight [1024, 1024] 15.7824 0.01541 1.2318 717.9
on_model.encoder.layers.22.self_attn.q_proj.weight [1024, 1024] 33.3101 0.03253 17.2954 256.4
_model.encoder.layers.22.self_attn.out_proj.weight [1024, 1024] 15.4298 0.01507 1.4900 709.2
vision_model.encoder.layers.22.mlp.fc1.weight [4096, 1024] 31.3137 0.01529 5.0053 916.6
vision_model.encoder.layers.22.mlp.fc2.weight [1024, 4096] 35.9646 0.01756 2.6698 891.2
on_model.encoder.layers.23.self_attn.k_proj.weight [1024, 1024] 14.7596 0.01441 1.4858 677.5
on_model.encoder.layers.23.self_attn.v_proj.weight [1024, 1024] 18.9487 0.01850 1.2860 701.6
on_model.encoder.layers.23.self_attn.q_proj.weight [1024, 1024] 14.9754 0.01462 1.5721 669.1
_model.encoder.layers.23.self_attn.out_proj.weight [1024, 1024] 18.2947 0.01787 1.3815 680.1
vision_model.encoder.layers.23.mlp.fc1.weight [4096, 1024] 31.6551 0.01546 4.1330 893.4
vision_model.encoder.layers.23.mlp.fc2.weight [1024, 4096] 30.3341 0.01481 3.8653 633.1
dinov2_b14 — key weight matrices:
param shape norm std sv_max eff_rank
---------------------------------------------------------------------------------------------------------
embeddings.position_embeddings [1, 1370, 768] 9.6808 0.00944 N/A N/A
embeddings.patch_embeddings.projection.weight [768, 3, 14, 14] 5.0943 0.00758 N/A N/A
encoder.layer.0.attention.attention.query.weight [768, 768] 15.0653 0.01962 8.2803 124.1
encoder.layer.0.attention.attention.key.weight [768, 768] 14.5449 0.01894 7.0240 146.6
encoder.layer.0.attention.attention.value.weight [768, 768] 10.8999 0.01419 2.6902 312.2
encoder.layer.0.attention.output.dense.weight [768, 768] 9.6377 0.01255 1.4304 442.7
encoder.layer.0.mlp.fc1.weight [3072, 768] 26.1807 0.01704 6.9235 362.1
encoder.layer.0.mlp.fc2.weight [768, 3072] 22.1737 0.01444 2.6781 528.8
encoder.layer.1.attention.attention.query.weight [768, 768] 15.5110 0.02020 3.0665 294.9
encoder.layer.1.attention.attention.key.weight [768, 768] 15.7740 0.02054 3.6581 290.2
encoder.layer.1.attention.attention.value.weight [768, 768] 12.0716 0.01572 1.4580 485.9
encoder.layer.1.attention.output.dense.weight [768, 768] 11.2017 0.01459 1.3664 491.9
encoder.layer.1.mlp.fc1.weight [3072, 768] 23.5162 0.01531 3.9140 569.8
encoder.layer.1.mlp.fc2.weight [768, 3072] 20.7634 0.01352 2.1780 621.9
encoder.layer.2.attention.attention.query.weight [768, 768] 13.6969 0.01783 2.0466 409.0
encoder.layer.2.attention.attention.key.weight [768, 768] 13.8587 0.01805 2.1433 409.6
encoder.layer.2.attention.attention.value.weight [768, 768] 11.3287 0.01475 1.1252 608.8
encoder.layer.2.attention.output.dense.weight [768, 768] 10.8751 0.01416 1.1264 614.7
encoder.layer.2.mlp.fc1.weight [3072, 768] 22.1368 0.01441 2.6436 620.0
encoder.layer.2.mlp.fc2.weight [768, 3072] 19.6161 0.01277 1.8037 617.4
encoder.layer.3.attention.attention.query.weight [768, 768] 17.3785 0.02263 10.1944 274.2
encoder.layer.3.attention.attention.key.weight [768, 768] 14.0313 0.01827 2.6146 362.7
encoder.layer.3.attention.attention.value.weight [768, 768] 11.7833 0.01534 1.5936 436.7
encoder.layer.3.attention.output.dense.weight [768, 768] 10.9542 0.01426 1.1789 484.3
encoder.layer.3.mlp.fc1.weight [3072, 768] 22.3524 0.01455 2.4559 632.9
encoder.layer.3.mlp.fc2.weight [768, 3072] 18.9526 0.01234 1.7526 620.3
encoder.layer.4.attention.attention.query.weight [768, 768] 13.6447 0.01777 1.4802 440.8
encoder.layer.4.attention.attention.key.weight [768, 768] 13.5677 0.01767 1.8610 427.6
encoder.layer.4.attention.attention.value.weight [768, 768] 11.5948 0.01510 1.3135 471.1
encoder.layer.4.attention.output.dense.weight [768, 768] 10.8005 0.01406 0.9923 515.5
encoder.layer.4.mlp.fc1.weight [3072, 768] 22.5944 0.01471 2.1506 646.5
encoder.layer.4.mlp.fc2.weight [768, 3072] 18.8099 0.01225 1.7677 625.3
encoder.layer.5.attention.attention.query.weight [768, 768] 13.7255 0.01787 1.6372 413.2
encoder.layer.5.attention.attention.key.weight [768, 768] 13.3573 0.01739 1.6523 441.7
encoder.layer.5.attention.attention.value.weight [768, 768] 11.7572 0.01531 1.2932 484.0
encoder.layer.5.attention.output.dense.weight [768, 768] 10.9291 0.01423 1.0229 510.2
encoder.layer.5.mlp.fc1.weight [3072, 768] 23.4727 0.01528 2.3335 654.6
encoder.layer.5.mlp.fc2.weight [768, 3072] 18.7835 0.01223 1.9061 642.5
encoder.layer.6.attention.attention.query.weight [768, 768] 14.0862 0.01834 1.5297 461.8
encoder.layer.6.attention.attention.key.weight [768, 768] 13.8269 0.01800 1.9102 465.2
encoder.layer.6.attention.attention.value.weight [768, 768] 11.5155 0.01499 1.3197 476.7
encoder.layer.6.attention.output.dense.weight [768, 768] 10.8719 0.01416 1.1720 512.2
encoder.layer.6.mlp.fc1.weight [3072, 768] 23.9825 0.01561 2.5247 661.2
encoder.layer.6.mlp.fc2.weight [768, 3072] 18.9719 0.01235 1.7798 648.4
encoder.layer.7.attention.attention.query.weight [768, 768] 14.4370 0.01880 1.6765 436.5
encoder.layer.7.attention.attention.key.weight [768, 768] 13.9417 0.01815 2.3330 451.6
encoder.layer.7.attention.attention.value.weight [768, 768] 11.7075 0.01524 1.3177 466.2
encoder.layer.7.attention.output.dense.weight [768, 768] 11.0254 0.01436 1.2727 504.1
encoder.layer.7.mlp.fc1.weight [3072, 768] 23.5657 0.01534 2.2714 656.7
encoder.layer.7.mlp.fc2.weight [768, 3072] 18.6998 0.01217 1.7242 656.8
encoder.layer.8.attention.attention.query.weight [768, 768] 14.4875 0.01886 1.6457 457.0
encoder.layer.8.attention.attention.key.weight [768, 768] 14.0343 0.01827 1.9319 464.9
encoder.layer.8.attention.attention.value.weight [768, 768] 11.7632 0.01532 1.2228 483.8
encoder.layer.8.attention.output.dense.weight [768, 768] 11.1382 0.01450 1.7787 515.8
encoder.layer.8.mlp.fc1.weight [3072, 768] 24.9183 0.01622 6.6240 625.9
encoder.layer.8.mlp.fc2.weight [768, 3072] 19.3520 0.01260 1.6640 675.1
encoder.layer.9.attention.attention.query.weight [768, 768] 14.1629 0.01844 1.7359 464.0
encoder.layer.9.attention.attention.key.weight [768, 768] 13.9257 0.01813 1.9286 475.4
encoder.layer.9.attention.attention.value.weight [768, 768] 12.0954 0.01575 1.2698 494.0
encoder.layer.9.attention.output.dense.weight [768, 768] 11.6187 0.01513 1.4691 523.1
encoder.layer.9.mlp.fc1.weight [3072, 768] 24.2356 0.01578 2.8893 679.9
encoder.layer.9.mlp.fc2.weight [768, 3072] 20.2806 0.01320 1.7777 687.3
encoder.layer.10.attention.attention.query.weight [768, 768] 14.1126 0.01838 1.7836 478.3
encoder.layer.10.attention.attention.key.weight [768, 768] 13.7915 0.01796 1.8475 493.9
encoder.layer.10.attention.attention.value.weight [768, 768] 12.5603 0.01635 1.2524 510.0
encoder.layer.10.attention.output.dense.weight [768, 768] 12.1455 0.01581 1.5136 547.3
encoder.layer.10.mlp.fc1.weight [3072, 768] 24.6123 0.01602 4.2760 689.5
encoder.layer.10.mlp.fc2.weight [768, 3072] 21.8431 0.01422 2.4292 672.4
encoder.layer.11.attention.attention.query.weight [768, 768] 13.8379 0.01802 1.8647 457.7
encoder.layer.11.attention.attention.key.weight [768, 768] 13.7512 0.01791 2.4709 482.9
encoder.layer.11.attention.attention.value.weight [768, 768] 13.2831 0.01730 1.5197 562.9
encoder.layer.11.attention.output.dense.weight [768, 768] 13.0485 0.01699 2.6470 605.2
encoder.layer.11.mlp.fc1.weight [3072, 768] 24.5670 0.01599 3.6963 678.9
encoder.layer.11.mlp.fc2.weight [768, 3072] 22.6176 0.01473 2.3999 711.8
siglip_b16 — key weight matrices:
param shape norm std sv_max eff_rank
---------------------------------------------------------------------------------------------------------
vision_model.embeddings.patch_embedding.weight [768, 3, 16, 16] 13.1067 0.01707 N/A N/A
vision_model.embeddings.position_embedding.weight [576, 768] 93.0842 0.13996 67.7360 20.9
ion_model.encoder.layers.0.self_attn.k_proj.weight [768, 768] 25.3506 0.03301 6.7521 216.8
ion_model.encoder.layers.0.self_attn.v_proj.weight [768, 768] 11.1804 0.01456 1.4769 475.2
ion_model.encoder.layers.0.self_attn.q_proj.weight [768, 768] 25.4386 0.03312 9.0594 216.9
n_model.encoder.layers.0.self_attn.out_proj.weight [768, 768] 12.0512 0.01569 3.9438 442.9
vision_model.encoder.layers.0.mlp.fc1.weight [3072, 768] 32.2698 0.02101 6.9756 495.1
vision_model.encoder.layers.0.mlp.fc2.weight [768, 3072] 30.5055 0.01986 4.9364 513.4
ion_model.encoder.layers.1.self_attn.k_proj.weight [768, 768] 21.2316 0.02765 4.2540 266.5
ion_model.encoder.layers.1.self_attn.v_proj.weight [768, 768] 14.2354 0.01854 1.8996 469.1
ion_model.encoder.layers.1.self_attn.q_proj.weight [768, 768] 21.5655 0.02808 3.8437 281.0
n_model.encoder.layers.1.self_attn.out_proj.weight [768, 768] 13.4008 0.01745 1.8366 490.3
vision_model.encoder.layers.1.mlp.fc1.weight [3072, 768] 33.1464 0.02158 3.4009 600.7
vision_model.encoder.layers.1.mlp.fc2.weight [768, 3072] 27.6425 0.01800 2.5865 625.9
ion_model.encoder.layers.2.self_attn.k_proj.weight [768, 768] 19.0127 0.02476 2.1765 406.8
ion_model.encoder.layers.2.self_attn.v_proj.weight [768, 768] 14.1224 0.01839 1.5604 507.2
ion_model.encoder.layers.2.self_attn.q_proj.weight [768, 768] 19.0933 0.02486 2.1423 408.5
n_model.encoder.layers.2.self_attn.out_proj.weight [768, 768] 13.2740 0.01728 1.5355 501.0
vision_model.encoder.layers.2.mlp.fc1.weight [3072, 768] 34.2425 0.02229 3.3269 659.9
vision_model.encoder.layers.2.mlp.fc2.weight [768, 3072] 27.3498 0.01781 2.2268 657.0
ion_model.encoder.layers.3.self_attn.k_proj.weight [768, 768] 16.7138 0.02176 2.1664 414.9
ion_model.encoder.layers.3.self_attn.v_proj.weight [768, 768] 15.6367 0.02036 1.6104 490.8
ion_model.encoder.layers.3.self_attn.q_proj.weight [768, 768] 17.9006 0.02331 2.0858 420.8
n_model.encoder.layers.3.self_attn.out_proj.weight [768, 768] 14.5315 0.01892 1.3975 522.0
vision_model.encoder.layers.3.mlp.fc1.weight [3072, 768] 34.7201 0.02260 4.1337 662.5
vision_model.encoder.layers.3.mlp.fc2.weight [768, 3072] 27.9364 0.01819 2.6334 671.7
ion_model.encoder.layers.4.self_attn.k_proj.weight [768, 768] 16.6270 0.02165 1.8333 469.6
ion_model.encoder.layers.4.self_attn.v_proj.weight [768, 768] 15.5543 0.02025 1.6095 487.9
ion_model.encoder.layers.4.self_attn.q_proj.weight [768, 768] 17.3636 0.02261 1.9774 446.8
n_model.encoder.layers.4.self_attn.out_proj.weight [768, 768] 14.7915 0.01926 1.5039 516.9
vision_model.encoder.layers.4.mlp.fc1.weight [3072, 768] 34.3811 0.02229 5.8771 661.6
vision_model.encoder.layers.4.mlp.fc2.weight [768, 3072] 29.3207 0.01909 2.5450 680.0
ion_model.encoder.layers.5.self_attn.k_proj.weight [768, 768] 16.8992 0.02200 2.0586 469.6
ion_model.encoder.layers.5.self_attn.v_proj.weight [768, 768] 15.2796 0.01990 1.5355 505.4
ion_model.encoder.layers.5.self_attn.q_proj.weight [768, 768] 17.6242 0.02295 1.9778 463.6
n_model.encoder.layers.5.self_attn.out_proj.weight [768, 768] 14.5603 0.01896 1.3099 538.5
vision_model.encoder.layers.5.mlp.fc1.weight [3072, 768] 33.7324 0.02196 4.8606 667.3
vision_model.encoder.layers.5.mlp.fc2.weight [768, 3072] 30.0129 0.01954 3.0083 689.7
ion_model.encoder.layers.6.self_attn.k_proj.weight [768, 768] 17.1745 0.02236 1.9155 473.6
ion_model.encoder.layers.6.self_attn.v_proj.weight [768, 768] 15.3063 0.01993 1.4017 510.6
ion_model.encoder.layers.6.self_attn.q_proj.weight [768, 768] 17.6908 0.02304 2.1205 457.5
n_model.encoder.layers.6.self_attn.out_proj.weight [768, 768] 14.6406 0.01906 1.2968 541.3
vision_model.encoder.layers.6.mlp.fc1.weight [3072, 768] 32.7128 0.02129 5.2047 674.9
vision_model.encoder.layers.6.mlp.fc2.weight [768, 3072] 32.1632 0.02094 2.6609 695.0
ion_model.encoder.layers.7.self_attn.k_proj.weight [768, 768] 17.4925 0.02278 1.6792 501.7
ion_model.encoder.layers.7.self_attn.v_proj.weight [768, 768] 15.3744 0.02002 1.2941 525.3
ion_model.encoder.layers.7.self_attn.q_proj.weight [768, 768] 17.5108 0.02280 1.8597 492.8
n_model.encoder.layers.7.self_attn.out_proj.weight [768, 768] 14.7239 0.01917 1.2469 553.5
vision_model.encoder.layers.7.mlp.fc1.weight [3072, 768] 32.4363 0.02112 5.1176 676.7
vision_model.encoder.layers.7.mlp.fc2.weight [768, 3072] 33.3117 0.02169 2.5620 703.2
ion_model.encoder.layers.8.self_attn.k_proj.weight [768, 768] 17.2571 0.02247 1.5921 514.9
ion_model.encoder.layers.8.self_attn.v_proj.weight [768, 768] 15.3517 0.01999 1.2797 528.9
ion_model.encoder.layers.8.self_attn.q_proj.weight [768, 768] 17.2066 0.02240 1.8249 500.0
n_model.encoder.layers.8.self_attn.out_proj.weight [768, 768] 14.9553 0.01947 1.1161 558.9
vision_model.encoder.layers.8.mlp.fc1.weight [3072, 768] 32.0244 0.02085 5.3609 677.1
vision_model.encoder.layers.8.mlp.fc2.weight [768, 3072] 35.6540 0.02321 2.7284 707.4
ion_model.encoder.layers.9.self_attn.k_proj.weight [768, 768] 16.6423 0.02167 1.4625 526.7
ion_model.encoder.layers.9.self_attn.v_proj.weight [768, 768] 15.9838 0.02081 1.3547 539.6
ion_model.encoder.layers.9.self_attn.q_proj.weight [768, 768] 16.5648 0.02157 1.6295 509.9
n_model.encoder.layers.9.self_attn.out_proj.weight [768, 768] 15.7026 0.02045 1.4963 555.7
vision_model.encoder.layers.9.mlp.fc1.weight [3072, 768] 33.1471 0.02157 7.5875 672.6
vision_model.encoder.layers.9.mlp.fc2.weight [768, 3072] 35.0755 0.02284 2.9263 703.1
on_model.encoder.layers.10.self_attn.k_proj.weight [768, 768] 15.8790 0.02068 1.3777 536.5
on_model.encoder.layers.10.self_attn.v_proj.weight [768, 768] 17.2492 0.02246 1.6574 542.5
on_model.encoder.layers.10.self_attn.q_proj.weight [768, 768] 15.7094 0.02046 1.5116 520.5
_model.encoder.layers.10.self_attn.out_proj.weight [768, 768] 16.9406 0.02206 1.7366 538.1
vision_model.encoder.layers.10.mlp.fc1.weight [3072, 768] 35.5715 0.02315 9.1395 677.0
vision_model.encoder.layers.10.mlp.fc2.weight [768, 3072] 37.5724 0.02446 3.6449 694.6
on_model.encoder.layers.11.self_attn.k_proj.weight [768, 768] 16.1542 0.02103 1.7355 529.3
on_model.encoder.layers.11.self_attn.v_proj.weight [768, 768] 18.5693 0.02418 2.1755 525.9
on_model.encoder.layers.11.self_attn.q_proj.weight [768, 768] 15.6457 0.02037 1.7939 517.4
_model.encoder.layers.11.self_attn.out_proj.weight [768, 768] 18.4060 0.02397 2.5140 515.2
vision_model.encoder.layers.11.mlp.fc1.weight [3072, 768] 35.7601 0.02328 8.1755 683.3
vision_model.encoder.layers.11.mlp.fc2.weight [768, 3072] 38.3973 0.02500 5.6203 628.3
vision_model.head.attention.in_proj_weight [2304, 768] 19.4668 0.01463 3.2690 654.7
vision_model.head.attention.out_proj.weight [768, 768] 16.4625 0.02144 1.2549 673.8
vision_model.head.mlp.fc1.weight [3072, 768] 50.3037 0.03239 12.1273 598.4
vision_model.head.mlp.fc2.weight [768, 3072] 32.4487 0.02113 4.7967 605.0
=================================================================
SCAN 4: PATCH EMBEDDING WEIGHTS
=================================================================
clip_l14: vision_model.embeddings.patch_embedding.weight
Shape: [1024, 3, 14, 14]
= 1024 filters × 3 channels × 14×14 kernel
Spectral: sv_max=1.6538 sv_min=0.003914 eff_rank=240.4/588
Norm: 13.0285 Mean: -0.000047 Std: 0.016790
Filter norms: mean=0.3947 std=0.0998 min=0.0126 max=0.6173
dinov2_b14: embeddings.patch_embeddings.projection.weight
Shape: [768, 3, 14, 14]
= 768 filters × 3 channels × 14×14 kernel
Spectral: sv_max=0.5876 sv_min=0.001076 eff_rank=238.0/588
Norm: 5.0943 Mean: 0.000002 Std: 0.007581
Filter norms: mean=0.1724 std=0.0639 min=0.0037 max=0.2703
siglip_b16: vision_model.embeddings.patch_embedding.weight
Shape: [768, 3, 16, 16]
= 768 filters × 3 channels × 16×16 kernel
Spectral: sv_max=1.4635 sv_min=0.000005 eff_rank=306.1/768
Norm: 13.1067 Mean: -0.000114 Std: 0.017066
Filter norms: mean=0.4322 std=0.1923 min=0.0375 max=0.6987
Patch embedding Procrustes alignment:
clip_l14 × dinov2_b14: raw_cos=0.0005 (d_min=768, d_feat=588)
clip_l14 × siglip_b16: raw_cos=0.0002 (d_min=768, d_feat=588)
dinov2_b14 × siglip_b16: raw_cos=-0.0013 (d_min=768, d_feat=588)
=================================================================
SCAN 5: ATTENTION HEAD GEOMETRY
=================================================================
clip_l14 (24 layers):
layer Q_norm K_norm V_norm QK_cos QV_cos KV_cos
0 10.327 10.191 10.465 -0.0050 0.4956 0.0016
1 13.829 12.888 13.290 0.0010 0.3957 0.0059
2 15.006 13.236 14.779 -0.0013 0.4683 0.0019
...
12 17.444 14.240 17.412 0.0023 0.2518 0.0017
22 15.488 15.782 33.310 -0.0000 0.1002 0.0010
23 14.760 18.949 14.975 -0.0015 0.2212 -0.0026
dinov2_b14 (12 layers):
layer Q_norm K_norm V_norm QK_cos QV_cos KV_cos
0 15.065 14.545 10.900 0.5069 0.0064 -0.0028
1 15.511 15.774 12.072 0.4234 0.0012 0.0031
2 13.697 13.859 11.329 0.5255 -0.0000 0.0004
...
6 14.086 13.827 11.516 0.1433 -0.0004 -0.0013
10 14.113 13.791 12.560 0.1317 -0.0011 -0.0007
11 13.838 13.751 13.283 0.2065 -0.0000 0.0021
siglip_b16 (12 layers):
layer Q_norm K_norm V_norm QK_cos QV_cos KV_cos
0 25.351 11.180 25.439 0.0025 0.7388 0.0022
1 21.232 14.235 21.566 -0.0022 0.4919 -0.0048
2 19.013 14.122 19.093 0.0000 0.2704 0.0009
...
6 17.175 15.306 17.691 -0.0027 0.2154 0.0000
10 15.879 17.249 15.709 0.0006 0.1501 -0.0001
11 16.154 18.569 15.646 0.0010 0.1470 -0.0006
=================================================================
SCAN 6: CROSS-MODEL WEIGHT ALIGNMENT
=================================================================
Cross-model Q weight cosine at equivalent depth fractions:
depth clip×dino clip×siglip dino×siglip
clip_l14: 24 layers
dinov2_b14: 12 layers
siglip_b16: 12 layers
0% 0.0010 -0.0014 -0.0005
25% -0.0006 0.0012 -0.0000
50% -0.0014 -0.0007 -0.0009
75% 0.0004 -0.0010 -0.0004
100% -0.0004 -0.0006 -0.0006
=================================================================
SCAN 7: MLP WEIGHT SPECTRUM
=================================================================
clip_l14 MLPs (48 weight matrices):
mlp.fc1.weight [4096, 1024] eff_rank= 281.7/1024 sv_max=11.075 sv_10=1.9901
mlp.fc2.weight [1024, 4096] eff_rank= 455.7/1024 sv_max=5.235 sv_10=2.0529
mlp.fc1.weight [4096, 1024] eff_rank= 505.0/1024 sv_max=9.280 sv_10=2.4024
mlp.fc2.weight [1024, 4096] eff_rank= 687.7/1024 sv_max=3.756 sv_10=1.5893
mlp.fc1.weight [4096, 1024] eff_rank= 565.7/1024 sv_max=9.483 sv_10=2.2923
mlp.fc2.weight [1024, 4096] eff_rank= 707.4/1024 sv_max=3.850 sv_10=1.7550
... (42 more)
dinov2_b14 MLPs (24 weight matrices):
mlp.fc1.weight [3072, 768] eff_rank= 362.1/768 sv_max=6.923 sv_10=2.3532
mlp.fc2.weight [768, 3072] eff_rank= 528.8/768 sv_max=2.678 sv_10=1.7801
mlp.fc1.weight [3072, 768] eff_rank= 569.8/768 sv_max=3.914 sv_10=1.8138
mlp.fc2.weight [768, 3072] eff_rank= 621.9/768 sv_max=2.178 sv_10=1.4481
mlp.fc1.weight [3072, 768] eff_rank= 620.0/768 sv_max=2.644 sv_10=1.6269
mlp.fc2.weight [768, 3072] eff_rank= 617.4/768 sv_max=1.804 sv_10=1.5185
... (18 more)
siglip_b16 MLPs (26 weight matrices):
mlp.fc1.weight [3072, 768] eff_rank= 495.1/768 sv_max=6.976 sv_10=2.8588
mlp.fc2.weight [768, 3072] eff_rank= 513.4/768 sv_max=4.936 sv_10=2.9769
mlp.fc1.weight [3072, 768] eff_rank= 600.7/768 sv_max=3.401 sv_10=2.4675
mlp.fc2.weight [768, 3072] eff_rank= 625.9/768 sv_max=2.586 sv_10=1.9950
mlp.fc1.weight [3072, 768] eff_rank= 659.9/768 sv_max=3.327 sv_10=2.3830
mlp.fc2.weight [768, 3072] eff_rank= 657.0/768 sv_max=2.227 sv_10=1.9165
... (20 more)
=================================================================
SCAN 8: POSITION EMBEDDINGS
=================================================================
clip_l14: vision_model.embeddings.position_embedding.weight
Shape: [257, 1024]
Norm: 11.4993 Mean: -0.013006 Std: 0.018257
Self-sim: diag_mean=1.0000 off_diag_mean=0.8616
Adjacent pos cos: mean=0.9259
Spectral: eff_rank=22.3/257 sv1%=86.5%
dinov2_b14: embeddings.position_embeddings
Shape: [1, 1370, 768]
Norm: 9.6808 Mean: -0.000044 Std: 0.009438
Self-sim: diag_mean=1.0000 off_diag_mean=0.0899
Adjacent pos cos: mean=0.9136
Spectral: eff_rank=47.1/768 sv1%=12.2%
siglip_b16: vision_model.embeddings.position_embedding.weight
Shape: [576, 768]
Norm: 93.0842 Mean: -0.000226 Std: 0.139955
Self-sim: diag_mean=1.0000 off_diag_mean=0.0637
Adjacent pos cos: mean=0.8524
Spectral: eff_rank=20.9/576 sv1%=53.0%
=================================================================
SCAN 9: LAYERNORM WEIGHT/BIAS PATTERNS
=================================================================
clip_l14 (50 LayerNorms):
vision_model.pre_layrnorm w: mean=0.4305 std=0.4963 b: mean=-0.00095 std=0.0838
0.layer_norm1 w: mean=0.3185 std=0.3345 b: mean=-0.00761 std=0.1239
0.layer_norm2 w: mean=0.8141 std=0.5963 b: mean=0.00049 std=0.1555
1.layer_norm1 w: mean=0.4587 std=0.2367 b: mean=-0.00064 std=0.1305
FINAL: vision_model.post_layernorm.weight
weight: mean=1.0044 std=0.0635 min=0.6363 max=1.8433
bias: mean=0.15815 std=0.1796
dinov2_b14 (25 LayerNorms):
0.norm1 w: mean=0.6640 std=0.6603 b: mean=0.00617 std=0.2407
0.norm2 w: mean=1.4833 std=0.8402 b: mean=0.03759 std=0.3599
1.norm1 w: mean=1.2084 std=0.7074 b: mean=0.02352 std=0.3012
1.norm2 w: mean=1.3366 std=0.3585 b: mean=0.01015 std=0.3414
FINAL: layernorm.weight
weight: mean=2.0223 std=1.2662 min=-0.0481 max=15.5085
bias: mean=0.00439 std=0.4535
siglip_b16 (26 LayerNorms):
0.layer_norm1 w: mean=0.4845 std=0.2820 b: mean=0.00176 std=0.1890
0.layer_norm2 w: mean=1.6044 std=1.1956 b: mean=-0.00058 std=0.3523
1.layer_norm1 w: mean=0.7414 std=0.2573 b: mean=0.02384 std=0.2548
1.layer_norm2 w: mean=1.0720 std=0.2015 b: mean=0.02144 std=0.2104
FINAL: vision_model.head.layernorm.weight
weight: mean=0.9334 std=0.1508 min=0.4374 max=2.4978
bias: mean=0.14603 std=0.3553
=================================================================
SCAN 10: PENTACHORON CV ON WEIGHT GEOMETRY
=================================================================
Patch embedding filter CV (rows = output filters):
clip_l14 filters=1024 CV=0.0484
dinov2_b14 filters=768 CV=0.0444
siglip_b16 filters=768 CV=0.0398
QKV weight row CV per layer:
model layer Q_cv K_cv V_cv QK_diff
clip_l14 0 0.2023 0.0364 0.2803 0.1659
clip_l14 1 0.1546 0.0273 0.1394 0.1273
clip_l14 ...
clip_l14 12 0.0290 0.0236 0.0318 0.0054
clip_l14 22 0.1283 0.0206 0.1969 0.1077
clip_l14 23 0.0259 0.0203 0.0248 0.0056
dinov2_b14 0 0.2148 0.1172 0.0515 0.0977
dinov2_b14 1 0.0682 0.0656 0.0254 0.0026
dinov2_b14 ...
dinov2_b14 6 0.0371 0.0330 0.0329 0.0042
dinov2_b14 10 0.0360 0.0314 0.0273 0.0045
dinov2_b14 11 0.0357 0.0358 0.0222 0.0001
siglip_b16 0 0.1282 0.0318 0.1980 0.0964
siglip_b16 1 0.0631 0.0328 0.0528 0.0303
siglip_b16 ...
siglip_b16 6 0.0325 0.0298 0.0366 0.0027
siglip_b16 10 0.0267 0.0235 0.0292 0.0032
siglip_b16 11 0.0252 0.0284 0.0264 0.0032
MLP weight row CV (first and last layers):
clip_l14 first_mlp CV=1.2233 last_mlp CV=0.0276
dinov2_b14 first_mlp CV=0.0830 last_mlp CV=0.0126
siglip_b16 first_mlp CV=0.0635 last_mlp CV=0.0248
Position embedding CV:
clip_l14 positions=257 CV=0.3435
dinov2_b14 positions=1370 CV=0.3179
siglip_b16 positions=576 CV=0.3001
=================================================================
SCAN 11: CROSS-MODEL CV BAND COMPARISON
=================================================================
Q weight CV distribution per model:
clip_l14 Q: mean=0.0586 std=0.0535 range=[0.0233, 0.2490]
K: mean=0.0247 std=0.0039 range=[0.0189, 0.0380]
V: mean=0.0621 std=0.0616 range=[0.0238, 0.2787]
In CV band [0.18-0.25]: Q=1/24 K=0/24 V=0/24
dinov2_b14 Q: mean=0.0582 std=0.0464 range=[0.0324, 0.2018]
K: mean=0.0507 std=0.0262 range=[0.0323, 0.1322]
V: mean=0.0322 std=0.0121 range=[0.0209, 0.0703]
In CV band [0.18-0.25]: Q=1/12 K=0/12 V=0/12
siglip_b16 Q: mean=0.0437 std=0.0324 range=[0.0226, 0.1422]
K: mean=0.0272 std=0.0026 range=[0.0235, 0.0325]
V: mean=0.0507 std=0.0480 range=[0.0294, 0.2063]
In CV band [0.18-0.25]: Q=0/12 K=0/12 V=1/12
Cross-model concatenated Q weight CV (same-depth rows mixed):
clip_l14 × dinov2_b mean=0.0541 std=0.0425 range=[0.0259, 0.1762]
clip_l14 × siglip_b mean=0.0483 std=0.0352 range=[0.0224, 0.1451]
dinov2_b × siglip_b mean=0.0392 std=0.0241 range=[0.0259, 0.1150]
clip_l14 × dinov2_b × siglip_b mean=0.0419 std=0.0236 range=[0.0220, 0.1077]
=================================================================
WEIGHT ANALYSIS COMPLETE — STARTING ACTIVATION ANALYSIS
=================================================================
=================================================================
SCAN 12: PER-LAYER ACTIVATION EXTRACTION
=================================================================
Streaming images from rafaelpadilla/coco2017...
Resolving data files: 100%
 39/39 [00:00<00:00, 5093.50it/s]
Loading weights: 100%
 391/391 [00:00<00:00, 1154.38it/s, Materializing param=vision_model.pre_layrnorm.weight]
CLIPVisionModel LOAD REPORT from: openai/clip-vit-large-patch14
Key | Status | |
-------------------------------------------------------------+------------+--+-
text_model.encoder.layers.{0...11}.self_attn.k_proj.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.mlp.fc1.bias | UNEXPECTED | |
text_projection.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.out_proj.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.out_proj.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.layer_norm1.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.k_proj.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.q_proj.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.mlp.fc1.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.layer_norm2.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.mlp.fc2.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.layer_norm1.weight | UNEXPECTED | |
text_model.final_layer_norm.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.layer_norm2.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.mlp.fc2.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.v_proj.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.v_proj.weight | UNEXPECTED | |
logit_scale | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.q_proj.weight | UNEXPECTED | |
text_model.embeddings.position_embedding.weight | UNEXPECTED | |
text_model.final_layer_norm.bias | UNEXPECTED | |
vision_model.embeddings.position_ids | UNEXPECTED | |
text_model.embeddings.position_ids | UNEXPECTED | |
visual_projection.weight | UNEXPECTED | |
text_model.embeddings.token_embedding.weight | UNEXPECTED | |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Loading weights: 100%
 223/223 [00:00<00:00, 1104.50it/s, Materializing param=layernorm.weight]
Loading weights: 100%
 208/208 [00:00<00:00, 1008.25it/s, Materializing param=vision_model.post_layernorm.weight]
SiglipVisionModel LOAD REPORT from: google/siglip-base-patch16-384
Key | Status | |
-------------------------------------------------------------+------------+--+-
text_model.encoder.layers.{0...11}.self_attn.k_proj.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.mlp.fc1.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.out_proj.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.out_proj.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.layer_norm1.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.k_proj.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.q_proj.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.mlp.fc1.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.layer_norm2.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.mlp.fc2.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.layer_norm1.weight | UNEXPECTED | |
text_model.final_layer_norm.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.layer_norm2.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.mlp.fc2.weight | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.v_proj.bias | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.v_proj.weight | UNEXPECTED | |
logit_scale | UNEXPECTED | |
text_model.encoder.layers.{0...11}.self_attn.q_proj.weight | UNEXPECTED | |
text_model.embeddings.position_embedding.weight | UNEXPECTED | |
text_model.final_layer_norm.bias | UNEXPECTED | |
text_model.head.bias | UNEXPECTED | |
text_model.head.weight | UNEXPECTED | |
text_model.embeddings.token_embedding.weight | UNEXPECTED | |
logit_bias | UNEXPECTED | |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Captured 256 images (streamed)
clip_l14: 25 layers, d=1024, N=256
dinov2_b14: 13 layers, d=768, N=256
siglip_b16: 13 layers, d=768, N=256
=================================================================
SCAN 13: WITHIN-MODEL DEPTH PROGRESSION
=================================================================
Layer-to-layer Procrustes within each model (layer N vs layer N+1):
clip_l14 (25 layers):
L→L+1 pre_cos post_cos sv_min sv_max
0→1 0.0032 0.0032 0.0000 0.0000
1→2 0.7617 0.9290 0.0000 1.9914
2→3 0.7570 0.9819 0.0000 1.4528
...
12→13 0.7829 1.0000 0.0000 1.0040
22→23 0.8049 1.0000 0.0000 1.0040
23→24 0.8713 1.0000 0.0000 1.0040
dinov2_b14 (13 layers):
L→L+1 pre_cos post_cos sv_min sv_max
0→1 -0.0045 0.0336 0.0000 8.5919
1→2 0.5373 0.7513 0.0000 6.8159
2→3 0.7910 0.9438 0.0000 4.8996
...
6→7 0.4866 0.9833 0.0000 1.1664
10→11 0.5688 1.0000 0.0000 1.0041
11→12 0.2910 1.0000 0.0000 1.0041
siglip_b16 (13 layers):
L→L+1 pre_cos post_cos sv_min sv_max
0→1 0.4324 0.9515 0.0000 3.0734
1→2 0.6613 0.9999 0.0000 1.0977
2→3 0.7007 1.0000 0.0000 1.0043
...
6→7 0.7311 1.0000 0.0000 1.0041
10→11 0.7719 1.0000 0.0000 1.0040
11→12 0.6503 1.0000 0.0000 1.0040
=================================================================
SCAN 14: CROSS-MODEL PROCRUSTES (per depth fraction)
=================================================================
Layers: clip=25 dino=13 siglip=13
frac clip×dino clip×dino clip×sig clip×sig dino×sig dino×sig
pre POST pre POST pre POST
-------------------------------------------------------------------
0% 0.0003 1.0000 -0.0033 0.0565 0.0005 0.0565
10% -0.0121 0.4873 -0.0077 0.9434 0.0138 0.4668
20% 0.0147 0.6777 -0.0027 0.9736 -0.0066 0.6628
30% -0.0185 0.6158 -0.0049 0.9891 -0.0032 0.6149
40% -0.0052 0.7267 -0.0078 0.9906 -0.0028 0.7246
50% 0.0126 0.8709 -0.0007 0.9910 -0.0107 0.8671
60% -0.0027 0.9491 -0.0013 0.9926 -0.0062 0.9442
70% 0.0008 0.9575 0.0098 0.9932 -0.0041 0.9522
80% 0.0023 0.9746 0.0226 0.9963 -0.0069 0.9716
90% 0.0080 0.9878 0.0069 0.9954 0.0077 0.9836
100% -0.0060 0.9996 -0.0069 0.9990 -0.0001 0.9986
Final output (pooled, L2-normed) Procrustes:
clip_l14 × dinov2_b14: pre=0.0011 POST=1.0000 sv_range=[0.0000, 1.0039]
clip_l14 × siglip_b16: pre=0.0043 POST=0.9997 sv_range=[0.0000, 1.0908]
dinov2_b14 × siglip_b16: pre=-0.0051 POST=0.9997 sv_range=[0.0000, 1.0908]
=================================================================
SCAN 15: ACTIVATION CV PER LAYER
=================================================================
model layer CV norm_μ norm_σ eff_dim
-------------------------------------------------------
clip_l14 0 0.0000 14.500 0.0000 1.0
clip_l14 1 0.0251 13.234 0.0430 27.2
clip_l14 ...
clip_l14 4 0.3692 12.597 0.0534 51.5
clip_l14 8 0.3492 11.341 0.1047 88.2
clip_l14 12 0.2947 7.901 0.2503 124.8
clip_l14 16 0.3086 6.690 0.2409 131.7
clip_l14 20 0.2089 12.478 0.4620 145.1
clip_l14 23 0.1700 17.310 0.8507 154.2
clip_l14 24 0.1157 16.085 1.0605 155.4
dinov2_b14 0 0.0000 0.799 0.0000 0.8
dinov2_b14 1 0.5707 2.022 0.0211 23.2
dinov2_b14 ...
dinov2_b14 4 0.3945 2.703 0.0420 69.8
dinov2_b14 6 0.4672 3.437 0.1381 66.1
dinov2_b14 8 0.3323 9.531 0.2926 91.4
dinov2_b14 11 0.4401 24.276 2.4562 103.8
dinov2_b14 12 0.1060 39.796 5.1513 158.0
siglip_b16 0 0.9458 4.681 1.2630 55.1
siglip_b16 1 0.5489 10.083 1.5282 88.1
siglip_b16 ...
siglip_b16 4 0.2990 9.593 0.5404 117.7
siglip_b16 6 0.2661 8.269 0.6763 119.6
siglip_b16 8 0.2456 9.616 1.1296 125.7
siglip_b16 11 0.2992 21.631 3.8725 136.2
siglip_b16 12 0.4064 37.213 6.2822 132.1
=================================================================
SCAN 16: PER-IMAGE AGREEMENT ANALYSIS
=================================================================
clip_l14 × dinov2_b14:
Raw per-image cos: mean=-0.0030 std=0.0346 min=-0.0936 max=0.0807
Distribution: [('0.0-0.1', 123)]
clip_l14 × siglip_b16:
Raw per-image cos: mean=0.0329 std=0.0312 min=-0.0503 max=0.1347
Distribution: [('0.0-0.1', 220), ('0.1-0.2', 3)]
dinov2_b14 × siglip_b16:
Raw per-image cos: mean=0.0030 std=0.0355 min=-0.0853 max=0.0960
Distribution: [('0.0-0.1', 145)]
=================================================================
FULL ANALYSIS COMPLETE
=================================================================