|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-VL-7B-Instruct |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
|
|
|
# Mixture-of-Visual-Thoughts |
|
|
|
|
|
AdaVaR-3B/7B is our presented adaptive visual reasoning model with the ability to reason in two thinking modes: |
|
|
|
|
|
1. Text-based reasoning: direct express reasoning with natural languages; |
|
|
2. Grounded reasoning: align reasoning processes with images with coordinates (typically object bounding boxes) |
|
|
|
|
|
For more detailed introduction, please visit: |
|
|
|
|
|
- Our Github Repo: [Mixture-of-Visual-Thoughts](https://github.com/Future-Living-Lab/mixture-of-visual-thoughts) |
|
|
- Our Paper: https://arxiv.org/pdf/2509.22746 |
|
|
|
|
|
## Quick Usage of AdaVaR |
|
|
Our AdaVaR-3B/7B models are based on Qwen2.5-VL-3B/7B, you can use them the same way as Qwen2.5-VL--just modify the system_prompt and supplement a post prompt. |
|
|
```python |
|
|
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor |
|
|
from constants import R1_SYSTEM_PROMPT_ADAPT_v2, POST_PROMPT_ADAPT_v2 |
|
|
import torch |
|
|
from qwen_vl_utils import process_vision_info |
|
|
|
|
|
# loading the model and processor |
|
|
model_path = "ZejunLi/AdaVaR-3B" |
|
|
device = torch.device("cuda") |
|
|
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map=device) |
|
|
processor = AutoProcessor.from_pretrained(model_path) |
|
|
|
|
|
# construct input messages |
|
|
image = "./assets/vstar.jpg" |
|
|
query = "Is the dog on the left or right side of the bicycle? (A) right; (B) left. Please answer the question with the correct option letter, e.g., A, B, C, D." |
|
|
|
|
|
messages = [ |
|
|
{"role": "system", "content": R1_SYSTEM_PROMPT_ADAPT_v2}, |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{ |
|
|
"type": "image", |
|
|
"image": image, |
|
|
}, |
|
|
{"type": "text", "text": query + " " + POST_PROMPT_ADAPT_v2}, |
|
|
], |
|
|
} |
|
|
] |
|
|
|
|
|
# process model inputs |
|
|
image_inputs, _ = process_vision_info(messages) |
|
|
query = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
input_dict = {k:v.to(device) for k,v in processor(text=[query], images=image_inputs, padding=True, return_tensors="pt").items()} |
|
|
|
|
|
# generate model responses |
|
|
output = model.generate(**input_dict, use_cache=True, do_sample=False, max_new_tokens=2048) |
|
|
output_trimmed = [ |
|
|
out_ids[len(in_ids) :] for in_ids, out_ids in zip(input_dict['input_ids'], output)] |
|
|
response = processor.tokenizer.batch_decode(output_trimmed)[0] |
|
|
print(response) |
|
|
``` |
|
|
Note: the sample image is provided in our GitHub. |
|
|
|
|
|
AdaVaR will adaptively choose an appropriate mode. Users can specify the mode by fixing the mode prefix token: |
|
|
```python |
|
|
# visually-grounded mode |
|
|
grd_query = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) + "<grounding>" |
|
|
|
|
|
# text-based mode |
|
|
txt_query = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) + "<text>" |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find our code, model, or data helpful for your work, please consider citing: |
|
|
|
|
|
```bibtex |
|
|
@article{li2025mixture, |
|
|
title={Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning}, |
|
|
author={Li, Zejun and Zhao, Yingxiu and Zhang, Jiwen and Wang, Siyuan and Yao, Yang and Zhao, Runzhou and Song, Jun and Zheng, Bo and Wei, Zhongyu}, |
|
|
journal={arXiv preprint arXiv:2509.22746}, |
|
|
year={2025} |
|
|
} |
|
|
``` |