Upload analysis/findings/gap_analysis_key_findings.md with huggingface_hub
Browse files
analysis/findings/gap_analysis_key_findings.md
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# GAP Analysis Key Findings
|
| 2 |
+
|
| 3 |
+
## Analysis 1: Structural Proof
|
| 4 |
+
- Pre-GAP: (batch, 1008, 512) - 1008 spatial positions, 512 channels
|
| 5 |
+
- Post-GAP: (batch, 512) - single vector, all spatial info destroyed
|
| 6 |
+
- Verification: manual average matches GAP output (error = 1.45e-04)
|
| 7 |
+
- All 16 byte heads receive the IDENTICAL 512-dim vector
|
| 8 |
+
|
| 9 |
+
## Analysis 2: Spatial Activation Analysis
|
| 10 |
+
- Mean spatial variance per channel: 14.30 (high variance = spatial info exists)
|
| 11 |
+
- Energy retained after GAP: 47.8% (52.2% of information energy lost)
|
| 12 |
+
- Normalized spatial entropy: 0.9953 (nearly uniform = info spread across all positions)
|
| 13 |
+
- This means GAP averages over highly variable spatial features, destroying structure
|
| 14 |
+
|
| 15 |
+
## Analysis 3: Weight Analysis (CRITICAL FINDING)
|
| 16 |
+
### Weight norms reveal two distinct groups:
|
| 17 |
+
- Bytes 0,1 (SUCCEEDED): W_norm = 65.03, 61.06 (W_std ~0.17)
|
| 18 |
+
- Bytes 2-15 (FAILED): W_norm = 32.16-32.45 (W_std ~0.084)
|
| 19 |
+
- **Bytes 0,1 have ~2x the weight magnitude of failed bytes**
|
| 20 |
+
- This means bytes 0,1 learned stronger discriminative features
|
| 21 |
+
- Failed bytes have nearly IDENTICAL weights (norm range 32.16-32.45, std ~0.084)
|
| 22 |
+
|
| 23 |
+
### Weight cosine similarity:
|
| 24 |
+
- Byte 0 vs Byte 1: 0.1125 (low - they learned different features)
|
| 25 |
+
- Byte 0 vs failed: mean 0.1233 (low)
|
| 26 |
+
- Failed vs failed: mean 0.1875 (HIGHER - failed bytes are more similar to each other)
|
| 27 |
+
- Failed bytes cluster together with similar weight patterns
|
| 28 |
+
|
| 29 |
+
### Interpretation:
|
| 30 |
+
Bytes 0 and 1 managed to learn strong features (2x weight magnitude) from the GAP-averaged
|
| 31 |
+
representation. The remaining 14 bytes converged to a near-identical, weak solution
|
| 32 |
+
(weight norms within 1% of each other, W_std within 0.5% of each other).
|
| 33 |
+
|
| 34 |
+
This is direct evidence that GAP created an information bottleneck: only 2 of 16 bytes
|
| 35 |
+
could extract useful discriminative features from the averaged representation. The other
|
| 36 |
+
14 bytes collapsed to essentially the same weak classifier because the spatial information
|
| 37 |
+
they needed was destroyed by averaging.
|