File size: 3,352 Bytes
9427d4a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a60418a
9427d4a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
license: apache-2.0
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
---

# Mixture-of-Visual-Thoughts

AdaVaR-3B/7B is our presented adaptive visual reasoning model with the ability to reason in two thinking modes:

1. Text-based reasoning: direct express reasoning with natural languages;
2. Grounded reasoning: align reasoning processes with images with coordinates (typically object bounding boxes)

For more detailed introduction, please visit:

- Our Github Repo: [Mixture-of-Visual-Thoughts](https://github.com/Future-Living-Lab/mixture-of-visual-thoughts)
- Our Paper: https://arxiv.org/pdf/2509.22746

## Quick Usage of AdaVaR
Our AdaVaR-3B/7B models are based on Qwen2.5-VL-3B/7B, you can use them the same way as Qwen2.5-VL--just modify the system_prompt and supplement a post prompt.
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from constants import R1_SYSTEM_PROMPT_ADAPT_v2, POST_PROMPT_ADAPT_v2
import torch
from qwen_vl_utils import process_vision_info

# loading the model and processor
model_path = "ZejunLi/AdaVaR-3B"
device = torch.device("cuda")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map=device)
processor = AutoProcessor.from_pretrained(model_path)

# construct input messages
image = "./assets/vstar.jpg"
query = "Is the dog on the left or right side of the bicycle? (A) right; (B) left. Please answer the question with the correct option letter, e.g., A, B, C, D."

messages = [
    {"role": "system", "content": R1_SYSTEM_PROMPT_ADAPT_v2},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": image,
            },
            {"type": "text", "text": query + " " + POST_PROMPT_ADAPT_v2},
        ],
    }
]

# process model inputs
image_inputs, _ = process_vision_info(messages)
query = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_dict = {k:v.to(device) for k,v in processor(text=[query], images=image_inputs, padding=True, return_tensors="pt").items()}

# generate model responses
output = model.generate(**input_dict, use_cache=True, do_sample=False, max_new_tokens=2048)
output_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(input_dict['input_ids'], output)]
response = processor.tokenizer.batch_decode(output_trimmed)[0]
print(response)
```
Note: the sample image is provided in our GitHub.

AdaVaR will adaptively choose an appropriate mode. Users can specify the mode by fixing the mode prefix token:
```python
# visually-grounded mode
grd_query = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) + "<grounding>"

# text-based mode
txt_query = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) + "<text>"
```

## Citation

If you find our code, model, or data helpful for your work, please consider citing:

```bibtex
@article{li2025mixture,
  title={Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning},
  author={Li, Zejun and Zhao, Yingxiu and Zhang, Jiwen and Wang, Siyuan and Yao, Yang and Zhao, Runzhou and Song, Jun and Zheng, Bo and Wei, Zhongyu},
  journal={arXiv preprint arXiv:2509.22746},
  year={2025}
}
```