Safetensors
qwen2_5_vl
qunwang13 commited on
Commit
a5a562a
·
verified ·
1 Parent(s): c8e14e4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +76 -44
README.md CHANGED
@@ -57,57 +57,81 @@ prompt_for_videos = "A cinematic shot of a waterfall in a lush forest."
57
  dim_name_1, dim_explain_1 = "Temporal Alignment (TA)", "How well the video adheres to the temporal aspects of the prompt."
58
  dim_name_2, dim_explain_2 = "Video Quality (VQ)", "The visual and aesthetic quality of the video."
59
  dim_name_3, dim_explain_3 = "Motion Quality (MQ)", "The smoothness and realism of the motion in the video."
60
-
61
- N = 150
62
 
63
  prompt_text = \
64
- f"""Task Description:
65
- Your task is to compare two videos generated based on the same prompt by analyzing their frames in detail and provide an overall judgment along with a judgment for each dimension. This involves:
66
- - Iterative reasoning,
67
- - Zooming in on details,
68
- - Dynamically selecting frames for further analysis.
69
-
70
- The provided frames are downsampled from these videos:
71
- - Video 1: First four input frames.
72
- - Video 2: Next four input frames.
73
-
74
- The prompt is: {prompt_for_videos}
75
-
76
- Evaluation Dimensions:
77
- 1. {dim_name_1}(TA):
78
- {dim_explain_1}
79
- 2. {dim_name_2}(VQ):
80
- {dim_explain_2}
81
- 3. {dim_name_3}(MQ):
82
- {dim_explain_3}
83
-
84
- Frames and Analysis Rules
85
- - 8 sampled frames are provided, evenly downsampled from {N} frames
86
- - Insufficient frames? Request more:
87
- <tool_call>{{"target_frames": []}}</tool_call>
88
-
89
- Format Requirement:
90
-
91
- 1. Snapshot:
92
- Every time you receive new visual information, summarize any information that might be useful for your final judgment within <Snapshot></Snapshot> tags.
93
-
94
- 2. Think:
95
- Place all reasoning content within <Think></Think> tags.
96
-
97
- 3. Answer:
98
- If the final answer can be determined, output the answer within <Answer></Answer> tags. If the answer is still uncertain, output the recommended answer and confidence level within <Recommend Answer></Recommend Answer> tags.
99
- Here, 1 represents Video 1, 2 represents Video 2, and 0 represents Tie. The confidence levels range from high to low as 1, 2, and 3.
100
-
101
- Examples:
102
- <Answer>TA=1, VQ=1, MQ=0, OA=1</Answer>, or
103
- <Recommend Answer>TA=0, VQ=1, MQ=0, OA=1, CF=2</Recommend Answer>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  """
105
 
 
 
 
 
 
 
106
 
107
  content_list = [{"type": "video", "video": url, "nframes": 4 } for url in video_urls]
108
  content_list.append({"type": "text", "text": prompt_text})
109
 
110
  messages = [
 
 
 
 
111
  {
112
  "role": "user",
113
  "content": content_list,
@@ -117,7 +141,13 @@ messages = [
117
  text = processor.apply_chat_template(
118
  messages, tokenize=False, add_generation_prompt=True
119
  )
120
- image_inputs, video_inputs = process_vision_info(messages)
 
 
 
 
 
 
121
 
122
  inputs = processor(
123
  text=[text],
@@ -125,11 +155,12 @@ inputs = processor(
125
  videos=video_inputs,
126
  padding=True,
127
  return_tensors="pt",
 
128
  )
129
  inputs = inputs.to("cuda")
130
 
131
 
132
- generated_ids = model.generate(**inputs, max_new_tokens=128)
133
  generated_ids_trimmed = [
134
  out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
135
  ]
@@ -137,6 +168,7 @@ output_text = processor.batch_decode(
137
  generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
138
  )
139
  print(output_text)
 
140
  ~~~
141
 
142
  ## Citation
 
57
  dim_name_1, dim_explain_1 = "Temporal Alignment (TA)", "How well the video adheres to the temporal aspects of the prompt."
58
  dim_name_2, dim_explain_2 = "Video Quality (VQ)", "The visual and aesthetic quality of the video."
59
  dim_name_3, dim_explain_3 = "Motion Quality (MQ)", "The smoothness and realism of the motion in the video."
60
+ N = 96
 
61
 
62
  prompt_text = \
63
+ f"""**Task Description**:
64
+ Your task is to compare two videos generated based on the same caption by analyzing their frames in detail. This involves an iterative process of reasoning, zooming in on details, and dynamically selecting frames for further analysis. The provided frames are snapshots from these videos:
65
+ - The first four input frames correspond to Video 1.
66
+ - The next four input frames correspond to Video 2.
67
+
68
+ The caption is: {prompt_for_videos}
69
+
70
+ **Evaluation Dimensions**:
71
+ You need to evaluate the videos based on the following dimensions:
72
+ 1. **{dim_name_1}**: {dim_explain_1}
73
+ 2. **{dim_name_2}**: {dim_explain_2}
74
+ 3. **{dim_name_3}**: {dim_explain_3}
75
+
76
+ **Frames and Analysis Rules**:
77
+ - You are provided with 8 sampled input frames:
78
+ - The first four input frames are sampled from the first {N/2} actual frames of Video 1.
79
+ - The next four input frames are sampled from the first {N/2} actual frames of Video 2.
80
+ - These input frames are evenly sampled (e.g., for N = 96, frames 1, 12, 24, 36 for Video 1, and frames 49, 60, 72, 84 for Video 2).
81
+ - If the provided input frames are insufficient for a detailed comparison, you must request additional frames:
82
+ - Select up to 8 additional frames (4 from each video, ensuring strict correspondence between the two videos, i.e., the second video frames must be Video 1 frames + {N/2}).
83
+ - Frame selection must be logical and based on specific transitions or critical differences observed in the analysis.
84
+ **Process**:
85
+ 1. **Round 1 Analysis**:
86
+ - Start by analyzing the first 8 input frames.
87
+ - Compare the videos based on the evaluation dimensions.
88
+ - If differences are subtle, identify specific key moments for further comparison and request additional frames.
89
+ - Use the `<tool_call>` to select up to 8 additional frames. Example:
90
+ `<tool_call>{{"name": "select_frames", "arguments": {{"target_frames": [12, 16, 20, 24, 60, 64, 68, 72]}}}}</tool_call>`
91
+ - Use `<recommend answer>` to output your current inclination and confidence level.
92
+
93
+ 2. **Subsequent Rounds**:
94
+ - Analyze the newly provided frames.
95
+ - If differences remain unclear, request further frames and continue reasoning.
96
+ - If the new frames are repetitive or insufficient, adjust your focus to different sets of frames.
97
+ .
98
+ - Use `<recommend answer>` to output your current inclination and confidence level until a final answer is reached.
99
+
100
+ 3. **Final Output**:
101
+ - After completing your analysis, output exactly one of the following answers:
102
+ - `1` if Video 1 is better,
103
+ - `2` if Video 2 is better,
104
+ - `0` if Video 1 and Video 2 are tied.
105
+ - Provide a breakdown of the evaluation dimensions using this format:
106
+ `<final answer> TA = i_1, MQ = i_2, VQ = i_3, OA = i_4 </final answer>`
107
+ - **OA** (Overall Assessment): Represents the overall preference.
108
+ - **i_1, i_2, i_3, i_4**: One of {{0, 1, 2}}.
109
+
110
+ 4. **Format Requirements**:
111
+ - Your analysis must be explicitly structured using the following tags:
112
+ - `<snapshot>`: Use this tag to summarize the observations from the current round. This summary is critical because subsequent rounds will rely on your synthesis to track progress and frame-specific details.
113
+ - `<think>`: Use this tag to describe your reasoning process, including decisions about frame selection or task approach.
114
+ - `<recommend answer>`: Use this tag to output your current inclination, including confidence level:
115
+ `<recommend answer> TA = i_1, MQ = i_2, VQ = i_3, OA = i_4, CF = i_5 </recommend answer>`
116
+ - **CF** (Confidence): One of {{1, 2, 3, 4}}, where 4 indicates higher confidence while 0 indicate low confidence.
117
+ - `<final answer>`: Use this tag only when in the final decision.
118
  """
119
 
120
+ sys_prompt = \
121
+ """You are a helpful assistant. \n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:
122
+ <tools>\n{\"type\": \"function\", \"function\": {\"name\": \"select_frames\", \"description\": \"Select frames from a video.\", \"parameters\": {\"type\": \"object\", \"properties\":
123
+ {\"target_frames\": {\"type\": \"array\", \"description\": \"List of frame indices to select from the video (no more than 8 frames in total).\", \"items\": {\"type\": \"integer\", \"description\": \"Frame index from 1 to 96.\"}}},
124
+ \"required\": [\"target_frames\"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}</tool_call>"""
125
+
126
 
127
  content_list = [{"type": "video", "video": url, "nframes": 4 } for url in video_urls]
128
  content_list.append({"type": "text", "text": prompt_text})
129
 
130
  messages = [
131
+ {
132
+ "role": "system",
133
+ "content": sys_prompt,
134
+ },
135
  {
136
  "role": "user",
137
  "content": content_list,
 
141
  text = processor.apply_chat_template(
142
  messages, tokenize=False, add_generation_prompt=True
143
  )
144
+ image_inputs, video_inputs, video_kwargs = process_vision_info(
145
+ messages, return_video_kwargs = True
146
+ )
147
+
148
+ print(video_kwargs)
149
+
150
+ breakpoint()
151
 
152
  inputs = processor(
153
  text=[text],
 
155
  videos=video_inputs,
156
  padding=True,
157
  return_tensors="pt",
158
+ **video_kwargs,
159
  )
160
  inputs = inputs.to("cuda")
161
 
162
 
163
+ generated_ids = model.generate(**inputs, max_new_tokens=2048)
164
  generated_ids_trimmed = [
165
  out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
166
  ]
 
168
  generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
169
  )
170
  print(output_text)
171
+
172
  ~~~
173
 
174
  ## Citation