lemousehunter commited on
Commit
4354eaa
·
verified ·
1 Parent(s): be675bc

Upload analysis/findings/gap_analysis_key_findings.md with huggingface_hub

Browse files
analysis/findings/gap_analysis_key_findings.md ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GAP Analysis Key Findings
2
+
3
+ ## Analysis 1: Structural Proof
4
+ - Pre-GAP: (batch, 1008, 512) - 1008 spatial positions, 512 channels
5
+ - Post-GAP: (batch, 512) - single vector, all spatial info destroyed
6
+ - Verification: manual average matches GAP output (error = 1.45e-04)
7
+ - All 16 byte heads receive the IDENTICAL 512-dim vector
8
+
9
+ ## Analysis 2: Spatial Activation Analysis
10
+ - Mean spatial variance per channel: 14.30 (high variance = spatial info exists)
11
+ - Energy retained after GAP: 47.8% (52.2% of information energy lost)
12
+ - Normalized spatial entropy: 0.9953 (nearly uniform = info spread across all positions)
13
+ - This means GAP averages over highly variable spatial features, destroying structure
14
+
15
+ ## Analysis 3: Weight Analysis (CRITICAL FINDING)
16
+ ### Weight norms reveal two distinct groups:
17
+ - Bytes 0,1 (SUCCEEDED): W_norm = 65.03, 61.06 (W_std ~0.17)
18
+ - Bytes 2-15 (FAILED): W_norm = 32.16-32.45 (W_std ~0.084)
19
+ - **Bytes 0,1 have ~2x the weight magnitude of failed bytes**
20
+ - This means bytes 0,1 learned stronger discriminative features
21
+ - Failed bytes have nearly IDENTICAL weights (norm range 32.16-32.45, std ~0.084)
22
+
23
+ ### Weight cosine similarity:
24
+ - Byte 0 vs Byte 1: 0.1125 (low - they learned different features)
25
+ - Byte 0 vs failed: mean 0.1233 (low)
26
+ - Failed vs failed: mean 0.1875 (HIGHER - failed bytes are more similar to each other)
27
+ - Failed bytes cluster together with similar weight patterns
28
+
29
+ ### Interpretation:
30
+ Bytes 0 and 1 managed to learn strong features (2x weight magnitude) from the GAP-averaged
31
+ representation. The remaining 14 bytes converged to a near-identical, weak solution
32
+ (weight norms within 1% of each other, W_std within 0.5% of each other).
33
+
34
+ This is direct evidence that GAP created an information bottleneck: only 2 of 16 bytes
35
+ could extract useful discriminative features from the averaged representation. The other
36
+ 14 bytes collapsed to essentially the same weak classifier because the spatial information
37
+ they needed was destroyed by averaging.