Safetensors
qwen2_5_vl
File size: 5,803 Bytes
a10837b
 
 
fe3e695
 
a10837b
fe3e695
a10837b
 
 
 
 
 
 
fe3e695
a10837b
 
 
 
c8e14e4
fe3e695
c8e14e4
 
fe3e695
c8e14e4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a5a562a
c8e14e4
ec08d7b
 
 
c8e14e4
ec08d7b
 
 
 
 
 
 
 
 
a5a562a
ec08d7b
a5a562a
ec08d7b
a5a562a
 
 
 
ec08d7b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c8e14e4
a5a562a
ec08d7b
a5a562a
d1298c2
a5a562a
 
c8e14e4
 
 
 
 
a5a562a
 
 
 
c8e14e4
 
 
 
 
 
 
 
 
a5a562a
 
 
 
c8e14e4
 
 
 
 
 
a5a562a
c8e14e4
 
 
 
a5a562a
c8e14e4
 
 
 
 
 
 
a5a562a
fe3e695
a10837b
 
fe3e695
a10837b
 
fe3e695
a10837b
 
 
 
 
fe3e695
a10837b
fe3e695
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
license: mit
datasets:
  - KwaiVGI/VideoGen-RewardBench
  - TIGER-Lab/GenAI-Bench
base_model:
  - Qwen/Qwen2.5-VL-7B-Instruct
---

## Model Summary

VR-Thinker is the first Multimodal Reward Model utilizing Thinking-with-Image framework.

For further details, please refer to the following:

- 📰 Paper: https://arxiv.org/pdf/2510.10518
- 📚 Github: https://github.com/qunzhongwang/vr-thinker
- 👋 Contact: [Qunzhong Wang](http://qunzhongwang.github.io/)

### Quick Start

We provide a sample test interface here:

```python
import json
import random
import torch
import tqdm
from PIL import Image
import warnings
import os
import requests
import cv2
import numpy as np
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info


warnings.filterwarnings("ignore")




model_path = "qunwang13/vr-thinker"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)


video_urls = [
    "https://cdn.pixabay.com/video/2024/05/20/212623_large.mp4", # sample video 1
    "https://cdn.pixabay.com/video/2024/02/07/199320-912042274_large.mp4"  # sample video 2
]


prompt_for_videos = "A cinematic shot of a waterfall in a lush forest."
dim_name_1, dim_explain_1 = "Temporal Alignment (TA)", "How well the video adheres to the temporal aspects of the prompt."
dim_name_2, dim_explain_2 = "Video Quality (VQ)", "The visual and aesthetic quality of the video."
dim_name_3, dim_explain_3 = "Motion Quality (MQ)", "The smoothness and realism of the motion in the video."
N = 96




prompt_text = \
f"""Task Description: Your task is to compare two videos generated based on the same prompt by analyzing their frames in detail and provide an overall judgment along with a judgment for each dimension.
This involves:
- Iterative reasoning,
- Zooming in on details,
- Dynamically selecting frames for further analysis.

The provided frames are downsampled from these videos:
- Video 1: First four input frames.
- Video 2: Next four input frames.

The prompt is: {prompt_for_videos}

Evaluation Dimensions:
1. **{dim_name_1}**: {dim_explain_1}
2. **{dim_name_2}**: {dim_explain_2}
3. **{dim_name_3}**: {dim_explain_3}

Frames and Analysis Rules:
- 8 sampled frames are provided, evenly downsampled from {N} frames.
- First 4 input frames sampled from {N/2} actual frames of Video 1, next 4 input frames sampled from {N/2} actual frames of Video
- Insufficient frames? Request more using the tool.

Format Requirement:
1. Snapshot:
Every time you receive new visual information, summarize any information that might be useful for your final judgment within <snapshot></snapshot> tags.
2. Think:\nPlace all reasoning content within <think></think> tags.\n\n3. Answer:\nIf the final answer can be determined, output the answer within <final answer></final answer> tags. If the answer is still uncertain, output the recommended answer and confidence level within <recommend answer></recommend answer> tags.
- For TA, MQ, VQ, and OA: 1 represents Video 1 is better, 2 represents Video 2 is better, and 0 represents a Tie.
- For CF (Confidence level): 1 (low), 2 (medium), 3 (high), 4 (very high), 5 (confirmed).

Examples:\n<recommend answer>TA=0, VQ=1, MQ=0, OA=1, CF=2</recommend answer>
<final answer>TA=1, VQ=1, MQ=0, OA=1</final answer>."""


sys_prompt = \
"""You are a helpful assistant.\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:
<tools>\n{\"type\": \"function\", \"function\": {\"name\": \"select_frames\", \"description\": \"Select frames from a video.\", \"parameters\": {\"type\": \"object\", \"properties\":
{\"target_frames\": {\"type\": \"array\", \"description\": \"List of frame indices to select from the video (no more than 8 frames in total).\", \"items\": {\"type\": \"integer\", \"description\": \"Frame index from 1 to N.\"}}},
\"required\": [\"target_frames\"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}</tool_call>"""


content_list = [{"type": "video", "video": url, "nframes": 4 } for url in video_urls]
content_list.append({"type": "text", "text": prompt_text})

messages = [
    {
        "role": "system",
        "content": sys_prompt,
    },
    {
        "role": "user",
        "content": content_list,
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(
    messages, return_video_kwargs = True
    )

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")


generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

```

## Citation

```
@misc{wang2025vrthinkerboostingvideoreward,
      title={VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning},
      author={Qunzhong Wang and Jie Liu and Jiajun Liang and Yilei Jiang and Yuanxing Zhang and Jinyuan Chen and Yaozhi Zheng and Xintao Wang and Pengfei Wan and Xiangyu Yue and Jiaheng Liu},
      year={2025},
      eprint={2510.10518},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.10518},
}
```