Add model card for Visual Jigsaw

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +50 -0
README.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ ---
6
+
7
+ # Visual Jigsaw Post-Training Improves MLLMs
8
+
9
+ This repository contains models for **Visual Jigsaw**, a framework presented in the paper [Visual Jigsaw Post-Training Improves MLLMs](https://huggingface.co/papers/2509.25190).
10
+
11
+ Visual Jigsaw is a generic self-supervised post-training framework designed to strengthen visual understanding in Multimodal Large Language Models (MLLMs). It is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. The framework has been instantiated across three visual modalities: images, videos, and 3D data.
12
+
13
+ <p align="center">
14
+ <img src="https://github.com/penghao-wu/visual_jigsaw/raw/main/assets/overview.png" alt="Overview of Visual Jigsaw" width="700"/>
15
+ </p>
16
+
17
+ * **Project Page:** [https://penghao-wu.github.io/visual_jigsaw/](https://penghao-wu.github.io/visual_jigsaw/)
18
+ * **Code Repository:** [https://github.com/penghao-wu/visual_jigsaw](https://github.com/penghao-wu/visual_jigsaw)
19
+
20
+ ## Model Description
21
+
22
+ This model, **Visual Jigsaw Image 7B**, is based on [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) and trained with the image jigsaw task to enhance fine-grained perception in multimodal large language models.
23
+
24
+ ## Model Checkpoints
25
+
26
+ We release the following models trained with Visual Jigsaw from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct):
27
+
28
+ * **[Visual Jigsaw Image 7B](https://huggingface.co/craigwu/visual_jigsaw_image_7B):** Qwen2.5-VL-7B-Instruct trained with image jigsaw
29
+ * **[Visual Jigsaw Video 7B](https://huggingface.co/craigwu/visual_jigsaw_video_7B):** Qwen2.5-VL-7B-Instruct trained with video jigsaw
30
+ * **[Visual Jigsaw 3D 7B](https://huggingface.co/craigwu/visual_jigsaw_3D_7B):** Qwen2.5-VL-7B-Instruct trained with 3D jigsaw
31
+
32
+ ## Usage
33
+
34
+ Our models are based on Qwen2.5-VL-7B-Instruct. You can use the same code as it for inference. Please refer to the [GitHub repository](https://github.com/penghao-wu/visual_jigsaw) for detailed instructions on installation, training, and evaluation.
35
+
36
+ ## License
37
+
38
+ This project is licensed under the Apache-2.0 license.
39
+
40
+ ## Citation
41
+
42
+ If you find our work helpful or inspiring, please feel free to cite it:
43
+
44
+ ```bibtex
45
+ @article{visual_jigsaw,
46
+ author = {Wu, Penghao and Yushan, Zhang and Haiwen, Diao and Bo, Li and Lu, Lewei and Liu, Ziwei},
47
+ title = {Visual Jigsaw Post-Training Improves MLLMs},
48
+ journal={arXiv preprint arXiv:2509.25190},
49
+ year={2025}}
50
+ ```