add mechanism section
Browse files- comparison.md +27 -0
comparison.md
CHANGED
|
@@ -31,6 +31,33 @@ Text-conditioned LeJEPA: baseline vs two conditioning architectures (FiLM, cross
|
|
| 31 |
- `wrong_text`: ρ = -0.886 (n=6)
|
| 32 |
- A strong negative ρ is the *expected* sign for LeJEPA: lower SIGReg ⇒ closer to iso-Gaussian in projection space ⇒ higher linear probe. Conditioning preserves this whenever |ρ| stays close to baseline's.
|
| 33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
## Figures
|
| 35 |
|
| 36 |

|
|
|
|
| 31 |
- `wrong_text`: ρ = -0.886 (n=6)
|
| 32 |
- A strong negative ρ is the *expected* sign for LeJEPA: lower SIGReg ⇒ closer to iso-Gaussian in projection space ⇒ higher linear probe. Conditioning preserves this whenever |ρ| stays close to baseline's.
|
| 33 |
|
| 34 |
+
## Mechanism — why conditioning hurt at this scale
|
| 35 |
+
|
| 36 |
+
The conditioned predictors achieve a **10–20× lower** invariance loss than baseline
|
| 37 |
+
(`inv` at epoch 29: baseline 0.138, film 0.013, xattn 0.007), yet their linear-probe
|
| 38 |
+
accuracy is **strictly lower**. Two orthogonal observations explain this:
|
| 39 |
+
|
| 40 |
+
1. **Text shortcuts the invariance.** Both views of an image share the same prompt, so
|
| 41 |
+
the predictor can satisfy `L_inv` by routing information through the (identical-across-views)
|
| 42 |
+
text channel rather than by learning view-invariant image features. The backbone has no
|
| 43 |
+
reason to match views — the predictor does it for free. Low `L_inv` therefore stops
|
| 44 |
+
implying a well-learned backbone.
|
| 45 |
+
2. **The projection space is text-controlled.** t-SNE of the same 500 val images under
|
| 46 |
+
three prompts (`"the object"`, `"the background"`, `"the texture"`) shows a
|
| 47 |
+
**silhouette-by-prompt of +0.51 to +0.60** and a **negative silhouette-by-image
|
| 48 |
+
(−0.33 to −0.49)** for every conditioned variant. Projections cluster by *prompt*, not
|
| 49 |
+
by *image*. The predictor is effectively a text-indexed codebook; image content is secondary.
|
| 50 |
+
|
| 51 |
+
The `wrong_text` control closes the loop: when we swap each class's prompt for another
|
| 52 |
+
class's prompt (fixed permutation), accuracy **does not change** (23.03 vs 23.13). Since
|
| 53 |
+
the text content is now semantically uncoupled from the image, the residual gain above
|
| 54 |
+
random must be a *regularization* effect from conditioning on any constant-per-image
|
| 55 |
+
signal — not from text semantics.
|
| 56 |
+
|
| 57 |
+
The SIGReg↔accuracy correlation survives conditioning (|ρ| ≥ 0.89 every variant), so
|
| 58 |
+
the two-term LeJEPA loss remains a valid model-selection proxy even when the predictor
|
| 59 |
+
is conditioned; it just selects among *worse* solutions in the conditioned case.
|
| 60 |
+
|
| 61 |
## Figures
|
| 62 |
|
| 63 |

|