Update README.md
Browse files
README.md
CHANGED
|
@@ -59,66 +59,45 @@ dim_name_2, dim_explain_2 = "Video Quality (VQ)", "The visual and aesthetic qual
|
|
| 59 |
dim_name_3, dim_explain_3 = "Motion Quality (MQ)", "The smoothness and realism of the motion in the video."
|
| 60 |
N = 96
|
| 61 |
|
|
|
|
|
|
|
|
|
|
| 62 |
prompt_text = \
|
| 63 |
-
f"""
|
| 64 |
-
|
| 65 |
-
-
|
| 66 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
-
The
|
| 69 |
|
| 70 |
-
|
| 71 |
-
You need to evaluate the videos based on the following dimensions:
|
| 72 |
1. **{dim_name_1}**: {dim_explain_1}
|
| 73 |
2. **{dim_name_2}**: {dim_explain_2}
|
| 74 |
3. **{dim_name_3}**: {dim_explain_3}
|
| 75 |
|
| 76 |
-
|
| 77 |
-
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
1
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
- Use `<recommend answer>` to output your current inclination and confidence level.
|
| 92 |
-
|
| 93 |
-
2. **Subsequent Rounds**:
|
| 94 |
-
- Analyze the newly provided frames.
|
| 95 |
-
- If differences remain unclear, request further frames and continue reasoning.
|
| 96 |
-
- If the new frames are repetitive or insufficient, adjust your focus to different sets of frames.
|
| 97 |
-
.
|
| 98 |
-
- Use `<recommend answer>` to output your current inclination and confidence level until a final answer is reached.
|
| 99 |
-
|
| 100 |
-
3. **Final Output**:
|
| 101 |
-
- After completing your analysis, output exactly one of the following answers:
|
| 102 |
-
- `1` if Video 1 is better,
|
| 103 |
-
- `2` if Video 2 is better,
|
| 104 |
-
- `0` if Video 1 and Video 2 are tied.
|
| 105 |
-
- Provide a breakdown of the evaluation dimensions using this format:
|
| 106 |
-
`<final answer> TA = i_1, MQ = i_2, VQ = i_3, OA = i_4 </final answer>`
|
| 107 |
-
- **OA** (Overall Assessment): Represents the overall preference.
|
| 108 |
-
- **i_1, i_2, i_3, i_4**: One of {{0, 1, 2}}.
|
| 109 |
-
|
| 110 |
-
4. **Format Requirements**:
|
| 111 |
-
- Your analysis must be explicitly structured using the following tags:
|
| 112 |
-
- `<snapshot>`: Use this tag to summarize the observations from the current round. This summary is critical because subsequent rounds will rely on your synthesis to track progress and frame-specific details.
|
| 113 |
-
- `<think>`: Use this tag to describe your reasoning process, including decisions about frame selection or task approach.
|
| 114 |
-
- `<recommend answer>`: Use this tag to output your current inclination, including confidence level:
|
| 115 |
-
`<recommend answer> TA = i_1, MQ = i_2, VQ = i_3, OA = i_4, CF = i_5 </recommend answer>`
|
| 116 |
-
- **CF** (Confidence): One of {{1, 2, 3, 4}}, where 4 indicates higher confidence while 0 indicate low confidence.
|
| 117 |
-
- `<final answer>`: Use this tag only when in the final decision.
|
| 118 |
-
"""
|
| 119 |
|
| 120 |
sys_prompt = \
|
| 121 |
-
"""You are a helpful assistant
|
| 122 |
<tools>\n{\"type\": \"function\", \"function\": {\"name\": \"select_frames\", \"description\": \"Select frames from a video.\", \"parameters\": {\"type\": \"object\", \"properties\":
|
| 123 |
{\"target_frames\": {\"type\": \"array\", \"description\": \"List of frame indices to select from the video (no more than 8 frames in total).\", \"items\": {\"type\": \"integer\", \"description\": \"Frame index from 1 to N.\"}}},
|
| 124 |
\"required\": [\"target_frames\"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}</tool_call>"""
|
|
|
|
| 59 |
dim_name_3, dim_explain_3 = "Motion Quality (MQ)", "The smoothness and realism of the motion in the video."
|
| 60 |
N = 96
|
| 61 |
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
|
| 65 |
prompt_text = \
|
| 66 |
+
f"""Task Description: Your task is to compare two videos generated based on the same prompt by analyzing their frames in detail and provide an overall judgment along with a judgment for each dimension.
|
| 67 |
+
This involves:
|
| 68 |
+
- Iterative reasoning,
|
| 69 |
+
- Zooming in on details,
|
| 70 |
+
- Dynamically selecting frames for further analysis.
|
| 71 |
+
|
| 72 |
+
The provided frames are downsampled from these videos:
|
| 73 |
+
- Video 1: First four input frames.
|
| 74 |
+
- Video 2: Next four input frames.
|
| 75 |
|
| 76 |
+
The prompt is: {prompt_for_videos}
|
| 77 |
|
| 78 |
+
Evaluation Dimensions:
|
|
|
|
| 79 |
1. **{dim_name_1}**: {dim_explain_1}
|
| 80 |
2. **{dim_name_2}**: {dim_explain_2}
|
| 81 |
3. **{dim_name_3}**: {dim_explain_3}
|
| 82 |
|
| 83 |
+
Frames and Analysis Rules:
|
| 84 |
+
- 8 sampled frames are provided, evenly downsampled from {N} frames.
|
| 85 |
+
- First 4 input frames sampled from {N/2} actual frames of Video 1, next 4 input frames sampled from {N/2} actual frames of Video
|
| 86 |
+
- Insufficient frames? Request more using the tool.
|
| 87 |
+
|
| 88 |
+
Format Requirement:
|
| 89 |
+
1. Snapshot:
|
| 90 |
+
Every time you receive new visual information, summarize any information that might be useful for your final judgment within <snapshot></snapshot> tags.
|
| 91 |
+
2. Think:\nPlace all reasoning content within <think></think> tags.\n\n3. Answer:\nIf the final answer can be determined, output the answer within <final answer></final answer> tags. If the answer is still uncertain, output the recommended answer and confidence level within <recommend answer></recommend answer> tags.
|
| 92 |
+
- For TA, MQ, VQ, and OA: 1 represents Video 1 is better, 2 represents Video 2 is better, and 0 represents a Tie.
|
| 93 |
+
- For CF (Confidence level): 1 (low), 2 (medium), 3 (high), 4 (very high), 5 (confirmed).
|
| 94 |
+
|
| 95 |
+
Examples:\n<recommend answer>TA=0, VQ=1, MQ=0, OA=1, CF=2</recommend answer>
|
| 96 |
+
<final answer>TA=1, VQ=1, MQ=0, OA=1</final answer>."""
|
| 97 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
|
| 99 |
sys_prompt = \
|
| 100 |
+
"""You are a helpful assistant.\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:
|
| 101 |
<tools>\n{\"type\": \"function\", \"function\": {\"name\": \"select_frames\", \"description\": \"Select frames from a video.\", \"parameters\": {\"type\": \"object\", \"properties\":
|
| 102 |
{\"target_frames\": {\"type\": \"array\", \"description\": \"List of frame indices to select from the video (no more than 8 frames in total).\", \"items\": {\"type\": \"integer\", \"description\": \"Frame index from 1 to N.\"}}},
|
| 103 |
\"required\": [\"target_frames\"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}</tool_call>"""
|