Safetensors
qwen2_5_vl
qunwang13 commited on
Commit
c8e14e4
·
verified ·
1 Parent(s): 881abb6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -0
README.md CHANGED
@@ -17,6 +17,127 @@ For further details, please refer to the following:
17
  - 📚 Github: https://github.com/qunzhongwang/vr-thinker
18
  - 👋 Contact: [Qunzhong Wang](http://qunzhongwang.github.io/)
19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  ## Citation
22
  ```
 
17
  - 📚 Github: https://github.com/qunzhongwang/vr-thinker
18
  - 👋 Contact: [Qunzhong Wang](http://qunzhongwang.github.io/)
19
 
20
+ ### Quick Start
21
+ We provide a sample test interface here:
22
+
23
+ ~~~python
24
+ import json
25
+ import random
26
+ import torch
27
+ import tqdm
28
+ from PIL import Image
29
+ import warnings
30
+ import os
31
+ import requests
32
+ import cv2
33
+ import numpy as np
34
+ from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
35
+ from qwen_vl_utils import process_vision_info
36
+
37
+
38
+ warnings.filterwarnings("ignore")
39
+
40
+
41
+
42
+
43
+ model_path = "qunwang13/vr-thinker"
44
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
45
+ model_path, torch_dtype="auto", device_map="auto"
46
+ )
47
+ processor = AutoProcessor.from_pretrained(model_path)
48
+
49
+
50
+ video_urls = [
51
+ "https://cdn.pixabay.com/video/2024/05/20/212623_large.mp4", # sample video 1
52
+ "https://cdn.pixabay.com/video/2024/02/07/199320-912042274_large.mp4" # sample video 2
53
+ ]
54
+
55
+
56
+ prompt_for_videos = "A cinematic shot of a waterfall in a lush forest."
57
+ dim_name_1, dim_explain_1 = "Temporal Alignment (TA)", "How well the video adheres to the temporal aspects of the prompt."
58
+ dim_name_2, dim_explain_2 = "Video Quality (VQ)", "The visual and aesthetic quality of the video."
59
+ dim_name_3, dim_explain_3 = "Motion Quality (MQ)", "The smoothness and realism of the motion in the video."
60
+
61
+ N = 150
62
+
63
+ prompt_text = \
64
+ f"""Task Description:
65
+ Your task is to compare two videos generated based on the same prompt by analyzing their frames in detail and provide an overall judgment along with a judgment for each dimension. This involves:
66
+ - Iterative reasoning,
67
+ - Zooming in on details,
68
+ - Dynamically selecting frames for further analysis.
69
+
70
+ The provided frames are downsampled from these videos:
71
+ - Video 1: First four input frames.
72
+ - Video 2: Next four input frames.
73
+
74
+ The prompt is: {prompt_for_videos}
75
+
76
+ Evaluation Dimensions:
77
+ 1. {dim_name_1}(TA):
78
+ {dim_explain_1}
79
+ 2. {dim_name_2}(VQ):
80
+ {dim_explain_2}
81
+ 3. {dim_name_3}(MQ):
82
+ {dim_explain_3}
83
+
84
+ Frames and Analysis Rules
85
+ - 8 sampled frames are provided, evenly downsampled from {N} frames
86
+ - Insufficient frames? Request more:
87
+ <tool_call>{{"target_frames": []}}</tool_call>
88
+
89
+ Format Requirement:
90
+
91
+ 1. Snapshot:
92
+ Every time you receive new visual information, summarize any information that might be useful for your final judgment within <Snapshot></Snapshot> tags.
93
+
94
+ 2. Think:
95
+ Place all reasoning content within <Think></Think> tags.
96
+
97
+ 3. Answer:
98
+ If the final answer can be determined, output the answer within <Answer></Answer> tags. If the answer is still uncertain, output the recommended answer and confidence level within <Recommend Answer></Recommend Answer> tags.
99
+ Here, 1 represents Video 1, 2 represents Video 2, and 0 represents Tie. The confidence levels range from high to low as 1, 2, and 3.
100
+
101
+ Examples:
102
+ <Answer>TA=1, VQ=1, MQ=0, OA=1</Answer>, or
103
+ <Recommend Answer>TA=0, VQ=1, MQ=0, OA=1, CF=2</Recommend Answer>
104
+ """
105
+
106
+
107
+ content_list = [{"type": "video", "video": url, "nframes": 4 } for url in video_urls]
108
+ content_list.append({"type": "text", "text": prompt_text})
109
+
110
+ messages = [
111
+ {
112
+ "role": "user",
113
+ "content": content_list,
114
+ }
115
+ ]
116
+
117
+ text = processor.apply_chat_template(
118
+ messages, tokenize=False, add_generation_prompt=True
119
+ )
120
+ image_inputs, video_inputs = process_vision_info(messages)
121
+
122
+ inputs = processor(
123
+ text=[text],
124
+ images=image_inputs,
125
+ videos=video_inputs,
126
+ padding=True,
127
+ return_tensors="pt",
128
+ )
129
+ inputs = inputs.to("cuda")
130
+
131
+
132
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
133
+ generated_ids_trimmed = [
134
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
135
+ ]
136
+ output_text = processor.batch_decode(
137
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
138
+ )
139
+ print(output_text)
140
+ ~~~
141
 
142
  ## Citation
143
  ```