ZejunLi
/

AdaVaR-7B

Image-Text-to-Text

Model card Files Files and versions

AdaVaR-7B / README.md

ZejunLi's picture

Update README.md

a60418a verified 2 months ago

|

history blame contribute delete

3.35 kB

	---
	license: apache-2.0
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	pipeline_tag: image-text-to-text
	---

	# Mixture-of-Visual-Thoughts

	AdaVaR-3B/7B is our presented adaptive visual reasoning model with the ability to reason in two thinking modes:

	1. Text-based reasoning: direct express reasoning with natural languages;
	2. Grounded reasoning: align reasoning processes with images with coordinates (typically object bounding boxes)

	For more detailed introduction, please visit:

	- Our Github Repo: [Mixture-of-Visual-Thoughts](https://github.com/Future-Living-Lab/mixture-of-visual-thoughts)
	- Our Paper: https://arxiv.org/pdf/2509.22746

	## Quick Usage of AdaVaR
	Our AdaVaR-3B/7B models are based on Qwen2.5-VL-3B/7B, you can use them the same way as Qwen2.5-VL--just modify the system_prompt and supplement a post prompt.
	```python
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
	from constants import R1_SYSTEM_PROMPT_ADAPT_v2, POST_PROMPT_ADAPT_v2
	import torch
	from qwen_vl_utils import process_vision_info

	# loading the model and processor
	model_path = "ZejunLi/AdaVaR-3B"
	device = torch.device("cuda")
	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map=device)
	processor = AutoProcessor.from_pretrained(model_path)

	# construct input messages
	image = "./assets/vstar.jpg"
	query = "Is the dog on the left or right side of the bicycle? (A) right; (B) left. Please answer the question with the correct option letter, e.g., A, B, C, D."

	messages = [
	{"role": "system", "content": R1_SYSTEM_PROMPT_ADAPT_v2},
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": image,
	},
	{"type": "text", "text": query + " " + POST_PROMPT_ADAPT_v2},
	],
	}
	]

	# process model inputs
	image_inputs, _ = process_vision_info(messages)
	query = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	input_dict = {k:v.to(device) for k,v in processor(text=[query], images=image_inputs, padding=True, return_tensors="pt").items()}

	# generate model responses
	output = model.generate(**input_dict, use_cache=True, do_sample=False, max_new_tokens=2048)
	output_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(input_dict['input_ids'], output)]
	response = processor.tokenizer.batch_decode(output_trimmed)[0]
	print(response)
	```
	Note: the sample image is provided in our GitHub.

	AdaVaR will adaptively choose an appropriate mode. Users can specify the mode by fixing the mode prefix token:
	```python
	# visually-grounded mode
	grd_query = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) + "<grounding>"

	# text-based mode
	txt_query = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) + "<text>"
	```

	## Citation

	If you find our code, model, or data helpful for your work, please consider citing:

	```bibtex
	@article{li2025mixture,
	title={Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning},
	author={Li, Zejun and Zhao, Yingxiu and Zhang, Jiwen and Wang, Siyuan and Yao, Yang and Zhao, Runzhou and Song, Jun and Zheng, Bo and Wei, Zhongyu},
	journal={arXiv preprint arXiv:2509.22746},
	year={2025}
	}
	```