Upload analysis/findings/gap_analysis_key_findings.md with huggingface_hub

Browse files

Files changed (1) hide show

analysis/findings/gap_analysis_key_findings.md +37 -0

analysis/findings/gap_analysis_key_findings.md ADDED Viewed

	@@ -0,0 +1,37 @@

+# GAP Analysis Key Findings
+## Analysis 1: Structural Proof
+- Pre-GAP: (batch, 1008, 512) - 1008 spatial positions, 512 channels
+- Post-GAP: (batch, 512) - single vector, all spatial info destroyed
+- Verification: manual average matches GAP output (error = 1.45e-04)
+- All 16 byte heads receive the IDENTICAL 512-dim vector
+## Analysis 2: Spatial Activation Analysis
+- Mean spatial variance per channel: 14.30 (high variance = spatial info exists)
+- Energy retained after GAP: 47.8% (52.2% of information energy lost)
+- Normalized spatial entropy: 0.9953 (nearly uniform = info spread across all positions)
+- This means GAP averages over highly variable spatial features, destroying structure
+## Analysis 3: Weight Analysis (CRITICAL FINDING)
+### Weight norms reveal two distinct groups:
+- Bytes 0,1 (SUCCEEDED): W_norm = 65.03, 61.06 (W_std ~0.17)
+- Bytes 2-15 (FAILED): W_norm = 32.16-32.45 (W_std ~0.084)
+- **Bytes 0,1 have ~2x the weight magnitude of failed bytes**
+- This means bytes 0,1 learned stronger discriminative features
+- Failed bytes have nearly IDENTICAL weights (norm range 32.16-32.45, std ~0.084)
+### Weight cosine similarity:
+- Byte 0 vs Byte 1: 0.1125 (low - they learned different features)
+- Byte 0 vs failed: mean 0.1233 (low)
+- Failed vs failed: mean 0.1875 (HIGHER - failed bytes are more similar to each other)
+- Failed bytes cluster together with similar weight patterns
+### Interpretation:
+Bytes 0 and 1 managed to learn strong features (2x weight magnitude) from the GAP-averaged
+representation. The remaining 14 bytes converged to a near-identical, weak solution
+(weight norms within 1% of each other, W_std within 0.5% of each other).
+This is direct evidence that GAP created an information bottleneck: only 2 of 16 bytes
+could extract useful discriminative features from the averaged representation. The other
+14 bytes collapsed to essentially the same weak classifier because the spatial information
+they needed was destroyed by averaging.