Improve model card: Add pipeline tag, library_name, paper, code, usage, and additional tags

7387a62 verified 7 months ago

5.31 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	datasets:
	- Senqiao/VisionThink-General-Train
	- Senqiao/VisionThink-General-Val
	license: apache-2.0
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- vision-language-model
	- multimodal
	- qwen
	---

	<p align="center" width="100%">
	<img src="https://raw.githubusercontent.com/dvlab-research/VisionThink/main/files/VisionThink.jpg" alt="VisionThink" style="width: 100%; min-width: 300px; display: block; margin: auto;">
	</p>

	# VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

	This repository contains the `VisionThink-General` model, a smart and efficient vision-language model. VisionThink introduces a new paradigm for visual token compression in Vision-Language Models (VLMs). It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Unlike existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case.

	The model was presented in the paper [VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning](https://huggingface.co/papers/2507.13348).

	The official code and more details can be found on the [VisionThink GitHub repository](https://github.com/dvlab-research/VisionThink).

	## Highlights
	<p align="center" width="80%">
	<img src="https://raw.githubusercontent.com/dvlab-research/VisionThink/main/files/Framework.jpg" alt="VisionThink Framework" style="width: 80%; min-width: 300px; display: block; margin: auto;">
	</p>

	1. Our VisionThink leverages reinforcement learning to autonomously learn whether to reduce visual tokens. Compared to traditional efficient VLM approaches, our method achieves significant improvements on fine-grained benchmarks, such as those involving OCR-related tasks.

	2. VisionThink improves performance on General VQA tasks while reducing visual tokens by 50%, achieving 102% of the original model’s performance across nine benchmarks.

	3. VisionThink achieves strong performance and efficiency by simply resizing input images to reduce visual tokens. We hope this inspires further research into Efficient Reasoning Vision Language Models.

	## Installation

	The environment follows the [Verl](https://github.com/volcengine/verl).
	```bash
	git clone https://github.com/dvlab-research/VisionThink.git
	conda create -n visionthink python=3.11 -y
	conda activate visionthink
	# veRL
	pip3 install -e .
	# flash-attn
	pip3 install flash-attn --no-build-isolation
	```
	If you want to use the Qwen3 as the Judge Model.
	```bash
	pip install -U tensordict
	pip install transformers==4.51.0
	```

	## Usage

	You can easily load and use VisionThink with the Hugging Face `transformers` library. Below is a quick example demonstrating how to load the `VisionThink-General` model and perform inference.

	```python
	from transformers import AutoProcessor, AutoModelForCausalLM
	from PIL import Image

	# Load model and processor
	model_id = "Senqiao/VisionThink-General" # Or "Senqiao/VisionThink-Efficient"
	processor = AutoProcessor.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype="auto",
	device_map="auto",
	trust_remote_code=True
	)

	# Prepare input image and text
	# Replace with your image path
	image = Image.open("./path/to/your/image.jpg").convert("RGB")
	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image", "image": image},
	{"type": "text", "text": "Describe this image in detail."},
	],
	}
	]

	# Apply chat template and process inputs
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	inputs = processor(text=text, images=image, return_tensors="pt")
	inputs = inputs.to(model.device)

	# Generate response
	generated_ids = model.generate(**inputs, max_new_tokens=512)

	# Decode and print the output
	generated_ids = generated_ids[:, inputs["input_ids"].shape[1]:]
	response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print(response)
	```

	## Citation
	If you find this project useful in your research, please consider citing:

	```bibtex
	@article{yang2025visionthink,
	title={VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning},
	author={Yang, Senqiao and Li, Junyi and Lai, Xin and Yu, Bei and Zhao, Hengshuang and Jia, Jiaya},
	journal={arXiv preprint arXiv:2507.13348},
	year={2025},
	}
	```

	## Acknowledgement
	- This work is built upon [Verl](https://github.com/volcengine/verl), [EasyR1](https://github.com/hiyouga/EasyR1), [Lmms-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), and [MMSearch-R1](https://github.com/EvolvingLMMs-Lab/multimodal-search-r1). We thank them for their excellent open-source contributions.

	- We also thank [Qwen](https://github.com/QwenLM/Qwen2.5-VL), [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1), [VisionZip](https://github.com/dvlab-research/VisionZip), [FastV](https://github.com/pkunlp-icler/FastV), [SparseVLM](https://github.com/Gumpest/SparseVLMs), and others for their contributions, which have provided valuable insights.