Add model card for Visual Jigsaw

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +31 -0
README.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
+ ---
6
+
7
+ # Visual Jigsaw Post-Training Improves MLLMs
8
+
9
+ Visual Jigsaw is a generic self-supervised post-training framework designed to strengthen visual understanding in MLLMs. It is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. We provide the instantiations of Visual Jigsaw across three visual modalities, including images, videos, and 3D data.
10
+
11
+ * **Paper:** [Visual Jigsaw Post-Training Improves MLLMs](https://huggingface.co/papers/2509.25190)
12
+ * **Project Page:** https://penghao-wu.github.io/visual_jigsaw/
13
+ * **Code:** https://github.com/penghao-wu/visual_jigsaw
14
+
15
+ <p align="center">
16
+ <img src="https://github.com/penghao-wu/visual_jigsaw/raw/main/assets/overview.png" alt="Overview of Visual Jigsaw" width="700"/>
17
+ </p>
18
+
19
+ ## License
20
+ This project is under the Apache-2.0 license.
21
+
22
+ ## Citation
23
+ Please consider citing our paper if you find this project helpful for your research:
24
+
25
+ ```bibtex
26
+ @article{visual_jigsaw,
27
+ author = {Wu, Penghao and Yushan, Zhang and Haiwen, Diao and Bo, Li and Lu, Lewei and Liu, Ziwei},
28
+ title = {Visual Jigsaw Post-Training Improves MLLMs},
29
+ journal={arXiv preprint arXiv:2509.25190},
30
+ year={2025}}
31
+ ```