adipanda
/

lejepa

Model card Files Files and versions

xet

Community

adipanda commited on Apr 23

Commit

4ac79bb

verified ·

1 Parent(s): db899ff

add mechanism section

Browse files

Files changed (1) hide show

comparison.md +27 -0

comparison.md CHANGED Viewed

@@ -31,6 +31,33 @@ Text-conditioned LeJEPA: baseline vs two conditioning architectures (FiLM, cross
    - `wrong_text`: ρ = -0.886 (n=6)
    - A strong negative ρ is the *expected* sign for LeJEPA: lower SIGReg ⇒ closer to iso-Gaussian in projection space ⇒ higher linear probe. Conditioning preserves this whenever |ρ| stays close to baseline's.
 ## Figures
 ![Training loss and online probe accuracy per epoch, per variant.](figures/loss_curves.png)

    - `wrong_text`: ρ = -0.886 (n=6)
    - A strong negative ρ is the *expected* sign for LeJEPA: lower SIGReg ⇒ closer to iso-Gaussian in projection space ⇒ higher linear probe. Conditioning preserves this whenever |ρ| stays close to baseline's.
+## Mechanism — why conditioning hurt at this scale
+The conditioned predictors achieve a **10–20× lower** invariance loss than baseline
+(`inv` at epoch 29: baseline 0.138, film 0.013, xattn 0.007), yet their linear-probe
+accuracy is **strictly lower**. Two orthogonal observations explain this:
+1. **Text shortcuts the invariance.** Both views of an image share the same prompt, so
+   the predictor can satisfy `L_inv` by routing information through the (identical-across-views)
+   text channel rather than by learning view-invariant image features. The backbone has no
+   reason to match views — the predictor does it for free. Low `L_inv` therefore stops
+   implying a well-learned backbone.
+2. **The projection space is text-controlled.** t-SNE of the same 500 val images under
+   three prompts (`"the object"`, `"the background"`, `"the texture"`) shows a
+   **silhouette-by-prompt of +0.51 to +0.60** and a **negative silhouette-by-image
+   (−0.33 to −0.49)** for every conditioned variant. Projections cluster by *prompt*, not
+   by *image*. The predictor is effectively a text-indexed codebook; image content is secondary.
+The `wrong_text` control closes the loop: when we swap each class's prompt for another
+class's prompt (fixed permutation), accuracy **does not change** (23.03 vs 23.13). Since
+the text content is now semantically uncoupled from the image, the residual gain above
+random must be a *regularization* effect from conditioning on any constant-per-image
+signal — not from text semantics.
+The SIGReg↔accuracy correlation survives conditioning (|ρ| ≥ 0.89 every variant), so
+the two-term LeJEPA loss remains a valid model-selection proxy even when the predictor
+is conditioned; it just selects among *worse* solutions in the conditioned case.
 ## Figures
 ![Training loss and online probe accuracy per epoch, per variant.](figures/loss_curves.png)