adipanda commited on
Commit
4ac79bb
·
verified ·
1 Parent(s): db899ff

add mechanism section

Browse files
Files changed (1) hide show
  1. comparison.md +27 -0
comparison.md CHANGED
@@ -31,6 +31,33 @@ Text-conditioned LeJEPA: baseline vs two conditioning architectures (FiLM, cross
31
  - `wrong_text`: ρ = -0.886 (n=6)
32
  - A strong negative ρ is the *expected* sign for LeJEPA: lower SIGReg ⇒ closer to iso-Gaussian in projection space ⇒ higher linear probe. Conditioning preserves this whenever |ρ| stays close to baseline's.
33
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  ## Figures
35
 
36
  ![Training loss and online probe accuracy per epoch, per variant.](figures/loss_curves.png)
 
31
  - `wrong_text`: ρ = -0.886 (n=6)
32
  - A strong negative ρ is the *expected* sign for LeJEPA: lower SIGReg ⇒ closer to iso-Gaussian in projection space ⇒ higher linear probe. Conditioning preserves this whenever |ρ| stays close to baseline's.
33
 
34
+ ## Mechanism — why conditioning hurt at this scale
35
+
36
+ The conditioned predictors achieve a **10–20× lower** invariance loss than baseline
37
+ (`inv` at epoch 29: baseline 0.138, film 0.013, xattn 0.007), yet their linear-probe
38
+ accuracy is **strictly lower**. Two orthogonal observations explain this:
39
+
40
+ 1. **Text shortcuts the invariance.** Both views of an image share the same prompt, so
41
+ the predictor can satisfy `L_inv` by routing information through the (identical-across-views)
42
+ text channel rather than by learning view-invariant image features. The backbone has no
43
+ reason to match views — the predictor does it for free. Low `L_inv` therefore stops
44
+ implying a well-learned backbone.
45
+ 2. **The projection space is text-controlled.** t-SNE of the same 500 val images under
46
+ three prompts (`"the object"`, `"the background"`, `"the texture"`) shows a
47
+ **silhouette-by-prompt of +0.51 to +0.60** and a **negative silhouette-by-image
48
+ (−0.33 to −0.49)** for every conditioned variant. Projections cluster by *prompt*, not
49
+ by *image*. The predictor is effectively a text-indexed codebook; image content is secondary.
50
+
51
+ The `wrong_text` control closes the loop: when we swap each class's prompt for another
52
+ class's prompt (fixed permutation), accuracy **does not change** (23.03 vs 23.13). Since
53
+ the text content is now semantically uncoupled from the image, the residual gain above
54
+ random must be a *regularization* effect from conditioning on any constant-per-image
55
+ signal — not from text semantics.
56
+
57
+ The SIGReg↔accuracy correlation survives conditioning (|ρ| ≥ 0.89 every variant), so
58
+ the two-term LeJEPA loss remains a valid model-selection proxy even when the predictor
59
+ is conditioned; it just selects among *worse* solutions in the conditioned case.
60
+
61
  ## Figures
62
 
63
  ![Training loss and online probe accuracy per epoch, per variant.](figures/loss_curves.png)