fix: ARCHITECTURE.md — complete ablation table with all 13 experiments, CLI flags, dinov2/loss_fn/sigreg/vicreg ablations, footnote on no_rollout→no_jepa

Browse files

Files changed (1) hide show

mr_jepa/ARCHITECTURE.md +31 -23

mr_jepa/ARCHITECTURE.md CHANGED Viewed

@@ -29,7 +29,7 @@ The core insight: solving a multimodal question (e.g., "What is the GDP growth s
                                               │
   ┌─────────┐                         JEPA Loss:
   │Optional:│                         SmoothL1/Cosine
-  │OCR,SAM, │──────────┘              + SIGReg
   │Layout   │
   └─────────┘
 ```
@@ -151,30 +151,28 @@ The online predictor must predict these targets.
 ### 2.6 JEPA Objective
-**Prediction loss** (hybrid branch — SmoothL1 from I-JEPA, more robust than L2):
 ```
 L_JEPA = (1/K) Σ_{k=1}^{K} SmoothL1(z_pred_k, sg(z*_k))
 ```
 Only steps k=1..K are supervised (z₀ is deterministic from evidence).
-**Alternative (purist branch)**: Cosine similarity loss.
-**Anti-collapse regularization** (from LeWorldModel — SIGReg):
-```
-L_SIGReg = (1/M) Σ_{m=1}^{M} T(Z · u_m)
-```
-Where T is the Epps-Pulley normality test statistic, u_m are random unit vectors.
-This encourages latent embeddings to remain Gaussian-distributed, preventing collapse.
-**Alternative (hybrid branch)**: VICReg (variance-invariance-covariance) regularization.
 **Total loss**:
 ```
-L_total = L_JEPA + L_task + λ · L_reg + α · L_gen
 Where:
   L_task = CrossEntropy(disc_head(z_K), answer_label)    # MC scoring
   L_gen = CE(gen_head(z_K), target_answer_tokens)        # Short answer (Phase 3)
   λ = 0.1 (regularization weight)
   α = 0.1 (generative weight)
 ```
@@ -241,23 +239,33 @@ Qwen3.5-4B (or SmolLM3-3B) decoder:
 ## 4. Ablation Experiments
-### Key ablations for the paper:
-| Experiment | Modification | Expected finding |
-|------------|-------------|-----------------|
-| **Full MR-JEPA** | Baseline | Best overall |
-| **No JEPA** | Remove L_JEPA, train with task loss only | Drops on reasoning-heavy benchmarks |
-| **No Rollout** | K=0, use z₀ directly | Significant drop (proves rollout value) |
-| **No Evidence Gate** | Remove gating | Slight drop (gate helps focus) |
-| **K=1** | Shallow rollout | Worse than K=3 |
-| **K=5** | Deeper rollout | Diminishing returns |
-| **No SIGReg** | Remove anti-collapse | Training instability |
-| **Purist branch** | DINOv3-B, no enriched evidence | Lower absolute scores, but validates JEPA contribution |
 ### Cross-benchmark analysis:
 - JEPA contribution should be highest on **reasoning** benchmarks (MathVista, MMMU, ScienceQA)
 - Evidence gate contribution should be highest on **evidence-rich** benchmarks (DocVQA, ChartQA)
 - Enriched evidence (Phase 3) should matter most for **document** benchmarks
 ---

                                               │
   ┌─────────┐                         JEPA Loss:
   │Optional:│                         SmoothL1/Cosine
+  │OCR,SAM, │──────────┘              + SIGReg/VICReg
   │Layout   │
   └─────────┘
 ```
 ### 2.6 JEPA Objective
+**Prediction loss** (hybrid branch — SmoothL1, more robust than L2):
 ```
 L_JEPA = (1/K) Σ_{k=1}^{K} SmoothL1(z_pred_k, sg(z*_k))
 ```
 Only steps k=1..K are supervised (z₀ is deterministic from evidence).
+**Alternative losses** (ablation):
+- **MSE** (L2): original I-JEPA loss
+- **Cosine**: 1 - cos_sim, used in purist branch
+**Anti-collapse regularization**:
+- **SIGReg** (from LeWorldModel): Epps-Pulley normality test on random projections, encourages Gaussian-distributed latents
+- **VICReg**: variance (keep std ≥ 1) + covariance (decorrelate features) regularization
 **Total loss**:
 ```
+L_total = w_jepa · L_JEPA + w_task · L_task + λ · L_reg + α · L_gen
 Where:
   L_task = CrossEntropy(disc_head(z_K), answer_label)    # MC scoring
   L_gen = CE(gen_head(z_K), target_answer_tokens)        # Short answer (Phase 3)
+  L_reg = SIGReg and/or VICReg
   λ = 0.1 (regularization weight)
   α = 0.1 (generative weight)
 ```
 ## 4. Ablation Experiments
+### Complete ablation matrix
+Each experiment maps 1:1 to a CLI flag in `train_mrjepa.py`.
+| Experiment | CLI flag | Modification | Expected finding |
+|------------|----------|-------------|-----------------|
+| `hybrid_main` | *(default)* | Full model (DINOv3-L, K=3, SmoothL1+VICReg) | Best overall |
+| `no_jepa` | `--no_jepa` | Remove L_JEPA, task loss only | Drops on reasoning-heavy benchmarks |
+| `no_rollout` | `--no_rollout` | K=0, use z₀ directly (also disables JEPA¹) | Significant drop (proves rollout value) |
+| `no_gate` | `--no_evidence_gate` | Remove sigmoid evidence gating | Slight drop (gate helps focus) |
+| `K1` | `--K 1` | Shallow rollout | Worse than K=3 |
+| `K5` | `--K 5` | Deeper rollout | Diminishing returns |
+| `K7` | `--K 7` | Very deep rollout | Overfitting / diminishing returns |
+| `dinov2_ablation` | `--backbone dinov2` | DINOv2-L/14 instead of DINOv3-L/16 | DINOv3 > DINOv2 due to Gram anchoring + RoPE |
+| `mse_loss` | `--loss_fn mse` | MSE (L2) JEPA loss (original I-JEPA) | Slightly worse than SmoothL1 |
+| `cosine_loss` | `--loss_fn cosine` | Cosine similarity JEPA loss | Better for purist, similar for hybrid |
+| `no_sigreg` | `--no_sigreg` | Disable SIGReg anti-collapse | Training instability / representation collapse |
+| `vicreg_only` | `--no_sigreg --use_vicreg` | VICReg only (no SIGReg) | Alternative anti-collapse strategy |
+| `purist` | `--purist` | DINOv3-B, K=5, Cosine+SIGReg, no enriched ev. | Lower absolute, validates JEPA contribution |
+¹ `no_rollout` also disables JEPA because with K=0 there is only z₀ — no trajectory to supervise. To test JEPA in isolation, use `--no_jepa` with K>0.
 ### Cross-benchmark analysis:
 - JEPA contribution should be highest on **reasoning** benchmarks (MathVista, MMMU, ScienceQA)
 - Evidence gate contribution should be highest on **evidence-rich** benchmarks (DocVQA, ChartQA)
 - Enriched evidence (Phase 3) should matter most for **document** benchmarks
+- DINOv3 vs DINOv2 gap should be largest on **fine-grained visual** benchmarks (AI2D, ChartQA)
 ---