craigwu
/

visual_jigsaw_video_7B

+---
+license: apache-2.0
+pipeline_tag: video-text-to-text
+library_name: transformers
+base_model: Qwen2.5-VL-7B-Instruct
+tags:
+- qwen
+- multimodal
+- visual-jigsaw
+---
+# Visual Jigsaw: Visual Jigsaw Video 7B
+This repository contains the `Visual Jigsaw Video 7B` model, which is based on `Qwen2.5-VL-7B-Instruct` and presented in the paper [Visual Jigsaw Post-Training Improves MLLMs](https://huggingface.co/papers/2509.25190).
+🌐 [Project Page](https://penghao-wu.github.io/visual_jigsaw/) | 💻 [Code on GitHub](https://github.com/penghao-wu/visual_jigsaw)
+Visual Jigsaw is a generic self-supervised post-training framework designed to strengthen visual understanding in Multimodal Large Language Models (MLLMs). It is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. This specific model is an instantiation of Visual Jigsaw trained with video data, focusing on temporal reasoning.
+<p align="center">
+<img src="https://github.com/penghao-wu/visual_jigsaw/raw/main/assets/overview.png" alt="Overview of Visual Jigsaw" width="700"/>
+</p>
+## How to use (Inference)
+Our models are based on `Qwen2.5-VL-7B-Instruct`. You can use the `transformers` library for inference by following the standard `Qwen2.5-VL-Instruct` usage pattern.
+```python
+from transformers import AutoProcessor, AutoModelForCausalLM
+from PIL import Image
+import torch
+# Load model and processor for Visual Jigsaw Video 7B
+model_id = "craigwu/visual_jigsaw_video_7B" # Assuming this is the current model repository
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16, # or torch.float16 depending on GPU
+    device_map="auto",
+    trust_remote_code=True
+)
+processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
+# For video input, you would typically load a sequence of frames.
+# This example uses a single dummy image for demonstration of the API structure.
+# For actual video processing, replace `dummy_image` with your video frames.
+dummy_image = Image.new("RGB", (500, 300), color='blue')
+# Prepare chat messages using Qwen2.5-VL-Instruct format
+# For video, you would pass a list of frames instead of a single image.
+messages = [
+    {"role": "user", "content": [
+        {"type": "image", "image": dummy_image}, # For video, a list of images (frames)
+        {"type": "text", "text": "Describe the content shown."}
+    ]}
+]
+# Process inputs
+text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+# For actual video, `images` would be a list of PIL Images (frames)
+model_inputs = processor(text=[text_input], images=[dummy_image], return_tensors="pt")
+# Move inputs to GPU if available
+if torch.cuda.is_available():
+    model_inputs = {k: v.to("cuda") for k, v in model_inputs.items()}
+# Generate response
+generated_ids = model.generate(**model_inputs, max_new_tokens=512)
+response = processor.decode(generated_ids[0], skip_special_tokens=True)
+print(response)
+```
+## Citation
+If you find this project helpful for your research, please consider citing our paper:
+```bibtex
+@article{visual_jigsaw,
+  author    = {Wu, Penghao and Yushan, Zhang and Haiwen, Diao and Bo, Li and Lu, Lewei and Liu, Ziwei},
+  title     = {Visual Jigsaw Post-Training Improves MLLMs},
+  journal={arXiv preprint arXiv:2509.25190},
+  year={2025}}
+```