JiaerX commited on
Commit
ca7f942
·
verified ·
1 Parent(s): f8e87a0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -2
README.md CHANGED
@@ -16,8 +16,66 @@ tags:
16
  - **Paper:** https://arxiv.org/pdf/2505.14677
17
  - **Blog:** https://www.maifoundations.com/blog/visionary-r1/
18
 
19
- ## Uses
20
- The model is trained based on the Qwen2.5-VL-3B-Instruct. You can follow the instructions of [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) to use the checkpoints.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ## Citation
23
 
 
16
  - **Paper:** https://arxiv.org/pdf/2505.14677
17
  - **Blog:** https://www.maifoundations.com/blog/visionary-r1/
18
 
19
+ ## Quick Start
20
+ The model is trained based on the Qwen2.5-VL-3B-Instruct. Here we present an example of the use of inference.
21
+ ```
22
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
23
+ from qwen_vl_utils import process_vision_info
24
+
25
+
26
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
27
+ "maifoundations/Visionary-R1",
28
+ torch_dtype=torch.bfloat16,
29
+ attn_implementation="flash_attention_2",
30
+ device_map="auto",
31
+ )
32
+
33
+ # default processer
34
+ processor = AutoProcessor.from_pretrained("maifoundations/Visionary-R1")
35
+
36
+ SYSTEM_PROMPT = (
37
+ '''You are tasked with analyzing an image to generate an exhaustive and detailed description. Your goal is to extract and describe all possible information from the image, including but not limited to objects, numbers, text, and the relationships between these elements. The description should be as fine and detailed as possible, capturing every nuance. After generating the detailed description, you need to analyze it and provide step-by-step detailed reasoning for the given question based on the information. Finally, provide a single word or phrase answer to the question. The description, reasoning process and answer are enclosed within <info> </info>, <think> </think> and <answer> </answer> tags, respectively, i.e., <info> image description here </info> <think> reasoning process here </think> <answer> answer here </answer>.
38
+ '''
39
+ )
40
+
41
+ messages = [
42
+ {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
43
+ {
44
+ "role": "user",
45
+ "content": [
46
+ {
47
+ "type": "image",
48
+ "image": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSWC5jzin8PaEgo1oXaiITvornP8G4m0ep7Ow&s",
49
+ },
50
+ {"type": "text", "text": "Which year has the most divergent opinions about Brazil’s economy?"},
51
+ ],
52
+ }
53
+ ]
54
+
55
+ # Preparation for inference
56
+ text = processor.apply_chat_template(
57
+ messages, tokenize=False, add_generation_prompt=True
58
+ )
59
+ image_inputs, video_inputs = process_vision_info(messages)
60
+ inputs = processor(
61
+ text=[text],
62
+ images=image_inputs,
63
+ videos=video_inputs,
64
+ padding=True,
65
+ return_tensors="pt",
66
+ )
67
+ inputs = inputs.to("cuda")
68
+
69
+ # Inference: Generation of the output
70
+ generated_ids = model.generate(**inputs, max_new_tokens=512)
71
+ generated_ids_trimmed = [
72
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
73
+ ]
74
+ output_text = processor.batch_decode(
75
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
76
+ )
77
+ print(output_text)
78
+ ```
79
 
80
  ## Citation
81