--- license: apache-2.0 base_model: - Qwen/Qwen2.5-VL-7B-Instruct --- ### Model Sources - **Repository:** https://github.com/maifoundations/DualMindVLM - **Paper:** https://arxiv.org/pdf/2511.16670 ### Quick Start The model is fine-tuned from Qwen2.5-VL-7B-Instruct. We provide an inference example using the Transformers inference backend. ``` import torch from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "maifoundations/DualMindVLM", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", ) # default processer processor = AutoProcessor.from_pretrained("maifoundations/DualMindVLM") SYSTEM_PROMPT = """You are a Vision-Language Model answering questions about images. Follow these rules strictly: 1. Judge the length of reasoning needed. - Short: start with "Short Thinking:". - Long: start with "Long Thinking:". 2. Short Thinking: give a concise thinking process which is sufficient to answer the question, then provide the final answer. 3. Long Thinking: give a structured reasoning process of the question and the image, including question analysis, visual details description, self-verification and then provide the final answer. 4. The final answer MUST BE put in \\boxed{}.""" messages = [ {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]}, { "role": "user", "content": [ { "type": "image", "image": image_path, }, {"type": "text", "text": question}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=1024) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` ### Citation ```bibtex @article{lin2025dualmindvlm, title = {Learning to Think Fast and Slow for Visual Language Models}, author = {Chenyu Lin and Cheng Chi and Jinlin Wu and Sharon Li and Kaiyang Zhou}, journal = {arXiv preprint arXiv:2511.16670}, year = {2025} }