craigwu
/

visual_jigsaw_image_7B

+---
+license: apache-2.0
+pipeline_tag: image-text-to-text
+library_name: transformers
+---
+# Visual Jigsaw Post-Training Improves MLLMs
+Visual Jigsaw is a generic self-supervised post-training framework designed to strengthen visual understanding in MLLMs. It is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. We provide the instantiations of Visual Jigsaw across three visual modalities, including images, videos, and 3D data.
+*   **Paper:** [Visual Jigsaw Post-Training Improves MLLMs](https://huggingface.co/papers/2509.25190)
+*   **Project Page:** https://penghao-wu.github.io/visual_jigsaw/
+*   **Code:** https://github.com/penghao-wu/visual_jigsaw
+<p align="center">
+<img src="https://github.com/penghao-wu/visual_jigsaw/raw/main/assets/overview.png" alt="Overview of Visual Jigsaw" width="700"/>
+</p>
+## License
+This project is under the Apache-2.0 license.
+## Citation
+Please consider citing our paper if you find this project helpful for your research:
+```bibtex
+@article{visual_jigsaw,
+  author    = {Wu, Penghao and Yushan, Zhang and Haiwen, Diao and Bo, Li and Lu, Lewei and Liu, Ziwei},
+  title     = {Visual Jigsaw Post-Training Improves MLLMs},
+  journal={arXiv preprint arXiv:2509.25190},
+  year={2025}}
+```