craigwu
/

visual_jigsaw_3D_7B

+---
+license: apache-2.0
+library_name: transformers
+pipeline_tag: image-text-to-text
+---
+# Visual Jigsaw Post-Training Improves MLLMs
+This repository contains models for **Visual Jigsaw**, a framework presented in the paper [Visual Jigsaw Post-Training Improves MLLMs](https://huggingface.co/papers/2509.25190).
+Visual Jigsaw is a generic self-supervised post-training framework designed to strengthen visual understanding in Multimodal Large Language Models (MLLMs). It is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. The framework has been instantiated across three visual modalities: images, videos, and 3D data.
+<p align="center">
+<img src="https://github.com/penghao-wu/visual_jigsaw/raw/main/assets/overview.png" alt="Overview of Visual Jigsaw" width="700"/>
+</p>
+*   **Project Page:** [https://penghao-wu.github.io/visual_jigsaw/](https://penghao-wu.github.io/visual_jigsaw/)
+*   **Code Repository:** [https://github.com/penghao-wu/visual_jigsaw](https://github.com/penghao-wu/visual_jigsaw)
+## Model Description
+This model, **Visual Jigsaw Image 7B**, is based on [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) and trained with the image jigsaw task to enhance fine-grained perception in multimodal large language models.
+## Model Checkpoints
+We release the following models trained with Visual Jigsaw from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct):
+*   **[Visual Jigsaw Image 7B](https://huggingface.co/craigwu/visual_jigsaw_image_7B):** Qwen2.5-VL-7B-Instruct trained with image jigsaw
+*   **[Visual Jigsaw Video 7B](https://huggingface.co/craigwu/visual_jigsaw_video_7B):** Qwen2.5-VL-7B-Instruct trained with video jigsaw
+*   **[Visual Jigsaw 3D 7B](https://huggingface.co/craigwu/visual_jigsaw_3D_7B):** Qwen2.5-VL-7B-Instruct trained with 3D jigsaw
+## Usage
+Our models are based on Qwen2.5-VL-7B-Instruct. You can use the same code as it for inference. Please refer to the [GitHub repository](https://github.com/penghao-wu/visual_jigsaw) for detailed instructions on installation, training, and evaluation.
+## License
+This project is licensed under the Apache-2.0 license.
+## Citation
+If you find our work helpful or inspiring, please feel free to cite it:
+```bibtex
+@article{visual_jigsaw,
+  author    = {Wu, Penghao and Yushan, Zhang and Haiwen, Diao and Bo, Li and Lu, Lewei and Liu, Ziwei},
+  title     = {Visual Jigsaw Post-Training Improves MLLMs},
+  journal={arXiv preprint arXiv:2509.25190},
+  year={2025}}
+```