File size: 5,803 Bytes
a10837b fe3e695 a10837b fe3e695 a10837b fe3e695 a10837b c8e14e4 fe3e695 c8e14e4 fe3e695 c8e14e4 a5a562a c8e14e4 ec08d7b c8e14e4 ec08d7b a5a562a ec08d7b a5a562a ec08d7b a5a562a ec08d7b c8e14e4 a5a562a ec08d7b a5a562a d1298c2 a5a562a c8e14e4 a5a562a c8e14e4 a5a562a c8e14e4 a5a562a c8e14e4 a5a562a c8e14e4 a5a562a fe3e695 a10837b fe3e695 a10837b fe3e695 a10837b fe3e695 a10837b fe3e695 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
---
license: mit
datasets:
- KwaiVGI/VideoGen-RewardBench
- TIGER-Lab/GenAI-Bench
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
---
## Model Summary
VR-Thinker is the first Multimodal Reward Model utilizing Thinking-with-Image framework.
For further details, please refer to the following:
- 📰 Paper: https://arxiv.org/pdf/2510.10518
- 📚 Github: https://github.com/qunzhongwang/vr-thinker
- 👋 Contact: [Qunzhong Wang](http://qunzhongwang.github.io/)
### Quick Start
We provide a sample test interface here:
```python
import json
import random
import torch
import tqdm
from PIL import Image
import warnings
import os
import requests
import cv2
import numpy as np
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
warnings.filterwarnings("ignore")
model_path = "qunwang13/vr-thinker"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)
video_urls = [
"https://cdn.pixabay.com/video/2024/05/20/212623_large.mp4", # sample video 1
"https://cdn.pixabay.com/video/2024/02/07/199320-912042274_large.mp4" # sample video 2
]
prompt_for_videos = "A cinematic shot of a waterfall in a lush forest."
dim_name_1, dim_explain_1 = "Temporal Alignment (TA)", "How well the video adheres to the temporal aspects of the prompt."
dim_name_2, dim_explain_2 = "Video Quality (VQ)", "The visual and aesthetic quality of the video."
dim_name_3, dim_explain_3 = "Motion Quality (MQ)", "The smoothness and realism of the motion in the video."
N = 96
prompt_text = \
f"""Task Description: Your task is to compare two videos generated based on the same prompt by analyzing their frames in detail and provide an overall judgment along with a judgment for each dimension.
This involves:
- Iterative reasoning,
- Zooming in on details,
- Dynamically selecting frames for further analysis.
The provided frames are downsampled from these videos:
- Video 1: First four input frames.
- Video 2: Next four input frames.
The prompt is: {prompt_for_videos}
Evaluation Dimensions:
1. **{dim_name_1}**: {dim_explain_1}
2. **{dim_name_2}**: {dim_explain_2}
3. **{dim_name_3}**: {dim_explain_3}
Frames and Analysis Rules:
- 8 sampled frames are provided, evenly downsampled from {N} frames.
- First 4 input frames sampled from {N/2} actual frames of Video 1, next 4 input frames sampled from {N/2} actual frames of Video
- Insufficient frames? Request more using the tool.
Format Requirement:
1. Snapshot:
Every time you receive new visual information, summarize any information that might be useful for your final judgment within <snapshot></snapshot> tags.
2. Think:\nPlace all reasoning content within <think></think> tags.\n\n3. Answer:\nIf the final answer can be determined, output the answer within <final answer></final answer> tags. If the answer is still uncertain, output the recommended answer and confidence level within <recommend answer></recommend answer> tags.
- For TA, MQ, VQ, and OA: 1 represents Video 1 is better, 2 represents Video 2 is better, and 0 represents a Tie.
- For CF (Confidence level): 1 (low), 2 (medium), 3 (high), 4 (very high), 5 (confirmed).
Examples:\n<recommend answer>TA=0, VQ=1, MQ=0, OA=1, CF=2</recommend answer>
<final answer>TA=1, VQ=1, MQ=0, OA=1</final answer>."""
sys_prompt = \
"""You are a helpful assistant.\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:
<tools>\n{\"type\": \"function\", \"function\": {\"name\": \"select_frames\", \"description\": \"Select frames from a video.\", \"parameters\": {\"type\": \"object\", \"properties\":
{\"target_frames\": {\"type\": \"array\", \"description\": \"List of frame indices to select from the video (no more than 8 frames in total).\", \"items\": {\"type\": \"integer\", \"description\": \"Frame index from 1 to N.\"}}},
\"required\": [\"target_frames\"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}</tool_call>"""
content_list = [{"type": "video", "video": url, "nframes": 4 } for url in video_urls]
content_list.append({"type": "text", "text": prompt_text})
messages = [
{
"role": "system",
"content": sys_prompt,
},
{
"role": "user",
"content": content_list,
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(
messages, return_video_kwargs = True
)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
**video_kwargs,
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
## Citation
```
@misc{wang2025vrthinkerboostingvideoreward,
title={VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning},
author={Qunzhong Wang and Jie Liu and Jiajun Liang and Yilei Jiang and Yuanxing Zhang and Jinyuan Chen and Yaozhi Zheng and Xintao Wang and Pengfei Wan and Xiangyu Yue and Jiaheng Liu},
year={2025},
eprint={2510.10518},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.10518},
}
```
|