Improve model card for C-JEPA

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +46 -5
README.md CHANGED
@@ -1,8 +1,49 @@
1
  ---
2
  license: apache-2.0
3
- datasets:
4
- - user/dataset-name
5
- base_model: user/model-name
6
  paper_ids:
7
- - "2602.11389"
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
 
 
 
3
  paper_ids:
4
+ - '2602.11389'
5
+ pipeline_tag: image-feature-extraction
6
+ datasets:
7
+ - clevrer
8
+ - pusht
9
+ tags:
10
+ - object-centric
11
+ - world-models
12
+ - causal-inference
13
+ - jepa
14
+ - representation-learning
15
+ - vision
16
+ ---
17
+
18
+ # C-JEPA: Causal-JEPA
19
+
20
+ This repository contains the weights and code for **Causal-JEPA (C-JEPA)**, a simple and flexible object-centric world model architecture presented in the paper [Causal-JEPA: Learning World Models through Object-Level Latent Interventions](https://huggingface.co/papers/2602.11389).
21
+
22
+ * **Paper:** [Causal-JEPA: Learning World Models through Object-Level Latent Interventions](https://huggingface.co/papers/2602.11389)
23
+ * **Project Page:** [https://hazel-heejeong-nam.github.io/cjepa/](https://hazel-heejeong-nam.github.io/cjepa/)
24
+ * **Code:** [https://github.com/galilai-group/cjepa](https://github.com/galilai-group/cjepa)
25
+
26
+ ## Summary
27
+ World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. C-JEPA is a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations. By applying object-level masking that requires an object's state to be inferred from other objects, C-JEPA induces latent interventions with counterfactual-like effects and prevents shortcut solutions, making interaction reasoning essential.
28
+
29
+ Empirically, C-JEPA demonstrates:
30
+ * **Improved Visual Reasoning:** Consistent gains in visual question answering, with an absolute improvement of about 20% in counterfactual reasoning compared to the same architecture without object-level masking on benchmarks like CLEVRER.
31
+ * **Efficient Planning:** Substantially more efficient planning in agent control tasks (e.g., Push-T), using only 1% of the total latent input features required by patch-based world models while achieving comparable performance.
32
+ * **Causal Inductive Bias:** A formal analysis demonstrates that object-level masking induces a causal inductive bias via latent interventions.
33
+
34
+ ## Architecture
35
+ ![architecture](https://hazel-heejeong-nam.github.io/cjepa/static/architecture.png)
36
+
37
+ ## Setup and Usage
38
+ C-JEPA relies on object-centric encoders (like VideoSAUR or SAVi) to extract representations. For detailed environment setup, dataset preparation, and training/evaluation scripts, please refer to the [official GitHub repository](https://github.com/galilai-group/cjepa). The repository also provides model checkpoints and pre-extracted slot representations for various configurations.
39
+
40
+ ## Citation
41
+ If you find this work useful, please consider citing:
42
+ ```bibtex
43
+ @article{nam2026causal,
44
+ title={Causal-JEPA: Learning World Models through Object-Level Latent Interventions},
45
+ author={Nam, Heejeong and Le Lidec, Quentin and Maes, Lucas and LeCun, Yann and Balestriero, Randall},
46
+ journal={arXiv preprint arXiv:2602.11389},
47
+ year={2026}
48
+ }
49
+ ```