LCraiGY commited on
Commit
97aeb66
·
verified ·
1 Parent(s): e451d68

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -2
README.md CHANGED
@@ -4,6 +4,70 @@ base_model:
4
  - Qwen/Qwen2.5-VL-7B-Instruct
5
  ---
6
 
7
- ### 🔗 Model Sources
8
  - **Repository:** https://github.com/maifoundations/DualMindVLM
9
- - **Paper:** https://arxiv.org/pdf/2511.16670
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - Qwen/Qwen2.5-VL-7B-Instruct
5
  ---
6
 
7
+ ### Model Sources
8
  - **Repository:** https://github.com/maifoundations/DualMindVLM
9
+ - **Paper:** https://arxiv.org/pdf/2511.16670
10
+ ### Quick Start
11
+ The model is trained based on the Qwen2.5-VL-7B-Instruct. Here we present an example of inference with the Transformers backend.
12
+ ```
13
+ import torch
14
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
15
+ from qwen_vl_utils import process_vision_info
16
+
17
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
18
+ "maifoundations/DualMindVLM",
19
+ torch_dtype=torch.bfloat16,
20
+ attn_implementation="flash_attention_2",
21
+ device_map="auto",
22
+ )
23
+
24
+ # default processer
25
+ processor = AutoProcessor.from_pretrained("maifoundations/DualMindVLM")
26
+
27
+ SYSTEM_PROMPT = """You are a Vision-Language Model answering questions about images.
28
+ Follow these rules strictly:
29
+ 1. Judge the length of reasoning needed.
30
+ - Short: start with "Short Thinking:".
31
+ - Long: start with "Long Thinking:".
32
+ 2. Short Thinking: give a concise thinking process which is sufficient to answer the question, then provide the final answer.
33
+ 3. Long Thinking: give a structured reasoning process of the question and the image, including question analysis, visual details description, self-verification and then provide the final answer.
34
+ 4. The final answer MUST BE put in \\boxed{}."""
35
+
36
+ messages = [
37
+ {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
38
+ {
39
+ "role": "user",
40
+ "content": [
41
+ {
42
+ "type": "image",
43
+ "image": image_path,
44
+ },
45
+ {"type": "text", "text": question},
46
+ ],
47
+ }
48
+ ]
49
+
50
+ # Preparation for inference
51
+ text = processor.apply_chat_template(
52
+ messages, tokenize=False, add_generation_prompt=True
53
+ )
54
+ image_inputs, video_inputs = process_vision_info(messages)
55
+ inputs = processor(
56
+ text=[text],
57
+ images=image_inputs,
58
+ videos=video_inputs,
59
+ padding=True,
60
+ return_tensors="pt",
61
+ )
62
+ inputs = inputs.to("cuda")
63
+
64
+ # Inference: Generation of the output
65
+ generated_ids = model.generate(**inputs, max_new_tokens=1024)
66
+ generated_ids_trimmed = [
67
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
68
+ ]
69
+ output_text = processor.batch_decode(
70
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
71
+ )
72
+ print(output_text)
73
+ ```