qunwang13
/

vr-thinker

Safetensors

qwen2_5_vl

Model card Files Files and versions

xet

Community

qunwang13 commited on Oct 17, 2025

Commit

a5a562a

verified ·

1 Parent(s): c8e14e4

Update README.md

Browse files

Files changed (1) hide show

README.md +76 -44

README.md CHANGED Viewed

@@ -57,57 +57,81 @@ prompt_for_videos = "A cinematic shot of a waterfall in a lush forest."
 dim_name_1, dim_explain_1 = "Temporal Alignment (TA)", "How well the video adheres to the temporal aspects of the prompt."
 dim_name_2, dim_explain_2 = "Video Quality (VQ)", "The visual and aesthetic quality of the video."
 dim_name_3, dim_explain_3 = "Motion Quality (MQ)", "The smoothness and realism of the motion in the video."
-N = 150
 prompt_text = \
-f"""Task Description:
-Your task is to compare two videos generated based on the same prompt by analyzing their frames in detail and provide an overall judgment along with a judgment for each dimension. This involves:
-- Iterative reasoning,
-- Zooming in on details,
-- Dynamically selecting frames for further analysis.
-The provided frames are downsampled from these videos:
-- Video 1: First four input frames.
-- Video 2: Next four input frames.
-The prompt is: {prompt_for_videos}
-Evaluation Dimensions:
-1. {dim_name_1}(TA):
-{dim_explain_1}
-2. {dim_name_2}(VQ):
-{dim_explain_2}
-3. {dim_name_3}(MQ):
-{dim_explain_3}
-Frames and Analysis Rules
-- 8 sampled frames are provided, evenly downsampled from {N} frames
-- Insufficient frames? Request more:
-<tool_call>{{"target_frames": []}}</tool_call>
-Format Requirement:
-1. Snapshot:
-Every time you receive new visual information, summarize any information that might be useful for your final judgment within <Snapshot></Snapshot> tags.
-2. Think:
-Place all reasoning content within <Think></Think> tags.
-3. Answer:
-If the final answer can be determined, output the answer within <Answer></Answer> tags. If the answer is still uncertain, output the recommended answer and confidence level within <Recommend Answer></Recommend Answer> tags.
-Here, 1 represents Video 1, 2 represents Video 2, and 0 represents Tie. The confidence levels range from high to low as 1, 2, and 3.
-Examples:
-<Answer>TA=1, VQ=1, MQ=0, OA=1</Answer>, or
-<Recommend Answer>TA=0, VQ=1, MQ=0, OA=1, CF=2</Recommend Answer>
 """
 content_list = [{"type": "video", "video": url, "nframes": 4 } for url in video_urls]
 content_list.append({"type": "text", "text": prompt_text})
 messages = [
     {
         "role": "user",
         "content": content_list,
@@ -117,7 +141,13 @@ messages = [
 text = processor.apply_chat_template(
     messages, tokenize=False, add_generation_prompt=True
 )
-image_inputs, video_inputs = process_vision_info(messages)
 inputs = processor(
     text=[text],
@@ -125,11 +155,12 @@ inputs = processor(
     videos=video_inputs,
     padding=True,
     return_tensors="pt",
 )
 inputs = inputs.to("cuda")
-generated_ids = model.generate(**inputs, max_new_tokens=128)
 generated_ids_trimmed = [
     out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
 ]
@@ -137,6 +168,7 @@ output_text = processor.batch_decode(
     generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
 )
 print(output_text)
 ~~~
 ## Citation

 dim_name_1, dim_explain_1 = "Temporal Alignment (TA)", "How well the video adheres to the temporal aspects of the prompt."
 dim_name_2, dim_explain_2 = "Video Quality (VQ)", "The visual and aesthetic quality of the video."
 dim_name_3, dim_explain_3 = "Motion Quality (MQ)", "The smoothness and realism of the motion in the video."
+N = 96
 prompt_text = \
+f"""**Task Description**:
+Your task is to compare two videos generated based on the same caption by analyzing their frames in detail. This involves an iterative process of reasoning, zooming in on details, and dynamically selecting frames for further analysis. The provided frames are snapshots from these videos:
+- The first four input frames correspond to Video 1.
+- The next four input frames correspond to Video 2.
+The caption is: {prompt_for_videos}
+**Evaluation Dimensions**:
+You need to evaluate the videos based on the following dimensions:
+1. **{dim_name_1}**: {dim_explain_1}
+2. **{dim_name_2}**: {dim_explain_2}
+3. **{dim_name_3}**: {dim_explain_3}
+**Frames and Analysis Rules**:
+- You are provided with 8 sampled input frames:
+ - The first four input frames are sampled from the first {N/2} actual frames of Video 1.
+ - The next four input frames are sampled from the first {N/2} actual frames of Video 2.
+- These input frames are evenly sampled (e.g., for N = 96, frames 1, 12, 24, 36 for Video 1, and frames 49, 60, 72, 84 for Video 2).
+- If the provided input frames are insufficient for a detailed comparison, you must request additional frames:
+ - Select up to 8 additional frames (4 from each video, ensuring strict correspondence between the two videos, i.e., the second video frames must be Video 1 frames + {N/2}).
+ - Frame selection must be logical and based on specific transitions or critical differences observed in the analysis.
+**Process**:
+1. **Round 1 Analysis**:
+ - Start by analyzing the first 8 input frames.
+ - Compare the videos based on the evaluation dimensions.
+ - If differences are subtle, identify specific key moments for further comparison and request additional frames.
+ - Use the `<tool_call>` to select up to 8 additional frames. Example:
+ `<tool_call>{{"name": "select_frames", "arguments": {{"target_frames": [12, 16, 20, 24, 60, 64, 68, 72]}}}}</tool_call>`
+ - Use `<recommend answer>` to output your current inclination and confidence level.
+2. **Subsequent Rounds**:
+ - Analyze the newly provided frames.
+ - If differences remain unclear, request further frames and continue reasoning.
+ - If the new frames are repetitive or insufficient, adjust your focus to different sets of frames.
+ .
+ - Use `<recommend answer>` to output your current inclination and confidence level until a final answer is reached.
+3. **Final Output**:
+ - After completing your analysis, output exactly one of the following answers:
+ - `1` if Video 1 is better,
+ - `2` if Video 2 is better,
+ - `0` if Video 1 and Video 2 are tied.
+ - Provide a breakdown of the evaluation dimensions using this format:
+ `<final answer> TA = i_1, MQ = i_2, VQ = i_3, OA = i_4 </final answer>`
+ - **OA** (Overall Assessment): Represents the overall preference.
+ - **i_1, i_2, i_3, i_4**: One of {{0, 1, 2}}.
+4. **Format Requirements**:
+ - Your analysis must be explicitly structured using the following tags:
+ - `<snapshot>`: Use this tag to summarize the observations from the current round. This summary is critical because subsequent rounds will rely on your synthesis to track progress and frame-specific details.
+ - `<think>`: Use this tag to describe your reasoning process, including decisions about frame selection or task approach.
+ - `<recommend answer>`: Use this tag to output your current inclination, including confidence level:
+ `<recommend answer> TA = i_1, MQ = i_2, VQ = i_3, OA = i_4, CF = i_5 </recommend answer>`
+ - **CF** (Confidence): One of {{1, 2, 3, 4}}, where 4 indicates higher confidence while 0 indicate low confidence.
+ - `<final answer>`: Use this tag only when in the final decision.
 """
+sys_prompt = \
+"""You are a helpful assistant. \n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:
+<tools>\n{\"type\": \"function\", \"function\": {\"name\": \"select_frames\", \"description\": \"Select frames from a video.\", \"parameters\": {\"type\": \"object\", \"properties\":
+{\"target_frames\": {\"type\": \"array\", \"description\": \"List of frame indices to select from the video (no more than 8 frames in total).\", \"items\": {\"type\": \"integer\", \"description\": \"Frame index from 1 to 96.\"}}},
+\"required\": [\"target_frames\"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}</tool_call>"""
 content_list = [{"type": "video", "video": url, "nframes": 4 } for url in video_urls]
 content_list.append({"type": "text", "text": prompt_text})
 messages = [
+    {
+        "role": "system",
+        "content": sys_prompt,
+    },
     {
         "role": "user",
         "content": content_list,
 text = processor.apply_chat_template(
     messages, tokenize=False, add_generation_prompt=True
 )
+image_inputs, video_inputs, video_kwargs = process_vision_info(
+    messages, return_video_kwargs = True
+    )
+print(video_kwargs)
+breakpoint()
 inputs = processor(
     text=[text],
     videos=video_inputs,
     padding=True,
     return_tensors="pt",
+    **video_kwargs,
 )
 inputs = inputs.to("cuda")
+generated_ids = model.generate(**inputs, max_new_tokens=2048)
 generated_ids_trimmed = [
     out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
 ]
     generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
 )
 print(output_text)
 ~~~
 ## Citation