license: apache-2.0
pipeline_tag: video-text-to-text
library_name: transformers
base_model: Qwen2.5-VL-7B-Instruct
tags:
- qwen
- multimodal
- visual-jigsaw
Visual Jigsaw: Visual Jigsaw Video 7B
This repository contains the Visual Jigsaw Video 7B model, which is based on Qwen2.5-VL-7B-Instruct and presented in the paper Visual Jigsaw Post-Training Improves MLLMs.
🌐 Project Page | 💻 Code on GitHub
Visual Jigsaw is a generic self-supervised post-training framework designed to strengthen visual understanding in Multimodal Large Language Models (MLLMs). It is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. This specific model is an instantiation of Visual Jigsaw trained with video data, focusing on temporal reasoning.
How to use (Inference)
Our models are based on Qwen2.5-VL-7B-Instruct. You can use the transformers library for inference by following the standard Qwen2.5-VL-Instruct usage pattern.
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch
# Load model and processor for Visual Jigsaw Video 7B
model_id = "craigwu/visual_jigsaw_video_7B" # Assuming this is the current model repository
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16, # or torch.float16 depending on GPU
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# For video input, you would typically load a sequence of frames.
# This example uses a single dummy image for demonstration of the API structure.
# For actual video processing, replace `dummy_image` with your video frames.
dummy_image = Image.new("RGB", (500, 300), color='blue')
# Prepare chat messages using Qwen2.5-VL-Instruct format
# For video, you would pass a list of frames instead of a single image.
messages = [
{"role": "user", "content": [
{"type": "image", "image": dummy_image}, # For video, a list of images (frames)
{"type": "text", "text": "Describe the content shown."}
]}
]
# Process inputs
text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# For actual video, `images` would be a list of PIL Images (frames)
model_inputs = processor(text=[text_input], images=[dummy_image], return_tensors="pt")
# Move inputs to GPU if available
if torch.cuda.is_available():
model_inputs = {k: v.to("cuda") for k, v in model_inputs.items()}
# Generate response
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
response = processor.decode(generated_ids[0], skip_special_tokens=True)
print(response)
Citation
If you find this project helpful for your research, please consider citing our paper:
@article{visual_jigsaw,
author = {Wu, Penghao and Yushan, Zhang and Haiwen, Diao and Bo, Li and Lu, Lewei and Liu, Ziwei},
title = {Visual Jigsaw Post-Training Improves MLLMs},
journal={arXiv preprint arXiv:2509.25190},
year={2025}}