File size: 5,803 Bytes

a10837b
 
 
fe3e695
 
a10837b
fe3e695
a10837b
 
 
 
 
 
 
fe3e695
a10837b
 
 
 
c8e14e4
fe3e695
c8e14e4
 
fe3e695
c8e14e4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a5a562a
c8e14e4
ec08d7b
 
 
c8e14e4
ec08d7b
 
 
 
 
 
 
 
 
a5a562a
ec08d7b
a5a562a
ec08d7b
a5a562a
 
 
 
ec08d7b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c8e14e4
a5a562a
ec08d7b
a5a562a
d1298c2
a5a562a
 
c8e14e4
 
 
 
 
a5a562a
 
 
 
c8e14e4
 
 
 
 
 
 
 
 
a5a562a
 
 
 
c8e14e4
 
 
 
 
 
a5a562a
c8e14e4
 
 
 
a5a562a
c8e14e4
 
 
 
 
 
 
a5a562a
fe3e695
a10837b
 
fe3e695
a10837b
 
fe3e695
a10837b
 
 
 
 
fe3e695
a10837b
fe3e695

---
license: mit
datasets:
  - KwaiVGI/VideoGen-RewardBench
  - TIGER-Lab/GenAI-Bench
base_model:
  - Qwen/Qwen2.5-VL-7B-Instruct
---

## Model Summary

VR-Thinker is the first Multimodal Reward Model utilizing Thinking-with-Image framework.

For further details, please refer to the following:

- 📰 Paper: https://arxiv.org/pdf/2510.10518
- 📚 Github: https://github.com/qunzhongwang/vr-thinker
- 👋 Contact: [Qunzhong Wang](http://qunzhongwang.github.io/)

### Quick Start

We provide a sample test interface here:

```python
import json
import random
import torch
import tqdm
from PIL import Image
import warnings
import os
import requests
import cv2
import numpy as np
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info


warnings.filterwarnings("ignore")




model_path = "qunwang13/vr-thinker"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)


video_urls = [
    "https://cdn.pixabay.com/video/2024/05/20/212623_large.mp4", # sample video 1
    "https://cdn.pixabay.com/video/2024/02/07/199320-912042274_large.mp4"  # sample video 2
]


prompt_for_videos = "A cinematic shot of a waterfall in a lush forest."
dim_name_1, dim_explain_1 = "Temporal Alignment (TA)", "How well the video adheres to the temporal aspects of the prompt."
dim_name_2, dim_explain_2 = "Video Quality (VQ)", "The visual and aesthetic quality of the video."
dim_name_3, dim_explain_3 = "Motion Quality (MQ)", "The smoothness and realism of the motion in the video."
N = 96




prompt_text = \
f"""Task Description: Your task is to compare two videos generated based on the same prompt by analyzing their frames in detail and provide an overall judgment along with a judgment for each dimension.
This involves:
- Iterative reasoning,
- Zooming in on details,
- Dynamically selecting frames for further analysis.

The provided frames are downsampled from these videos:
- Video 1: First four input frames.
- Video 2: Next four input frames.

The prompt is: {prompt_for_videos}

Evaluation Dimensions:
1. **{dim_name_1}**: {dim_explain_1}
2. **{dim_name_2}**: {dim_explain_2}
3. **{dim_name_3}**: {dim_explain_3}

Frames and Analysis Rules:
- 8 sampled frames are provided, evenly downsampled from {N} frames.
- First 4 input frames sampled from {N/2} actual frames of Video 1, next 4 input frames sampled from {N/2} actual frames of Video
- Insufficient frames? Request more using the tool.

Format Requirement:
1. Snapshot:
Every time you receive new visual information, summarize any information that might be useful for your final judgment within <snapshot></snapshot> tags.
2. Think:\nPlace all reasoning content within <think></think> tags.\n\n3. Answer:\nIf the final answer can be determined, output the answer within <final answer></final answer> tags. If the answer is still uncertain, output the recommended answer and confidence level within <recommend answer></recommend answer> tags.
- For TA, MQ, VQ, and OA: 1 represents Video 1 is better, 2 represents Video 2 is better, and 0 represents a Tie.
- For CF (Confidence level): 1 (low), 2 (medium), 3 (high), 4 (very high), 5 (confirmed).

Examples:\n<recommend answer>TA=0, VQ=1, MQ=0, OA=1, CF=2</recommend answer>
<final answer>TA=1, VQ=1, MQ=0, OA=1</final answer>."""


sys_prompt = \
"""You are a helpful assistant.\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:
<tools>\n{\"type\": \"function\", \"function\": {\"name\": \"select_frames\", \"description\": \"Select frames from a video.\", \"parameters\": {\"type\": \"object\", \"properties\":
{\"target_frames\": {\"type\": \"array\", \"description\": \"List of frame indices to select from the video (no more than 8 frames in total).\", \"items\": {\"type\": \"integer\", \"description\": \"Frame index from 1 to N.\"}}},
\"required\": [\"target_frames\"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}</tool_call>"""


content_list = [{"type": "video", "video": url, "nframes": 4 } for url in video_urls]
content_list.append({"type": "text", "text": prompt_text})

messages = [
    {
        "role": "system",
        "content": sys_prompt,
    },
    {
        "role": "user",
        "content": content_list,
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(
    messages, return_video_kwargs = True
    )

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")


generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

```

## Citation

```
@misc{wang2025vrthinkerboostingvideoreward,
      title={VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning},
      author={Qunzhong Wang and Jie Liu and Jiajun Liang and Yilei Jiang and Yuanxing Zhang and Jinyuan Chen and Yaozhi Zheng and Xintao Wang and Pengfei Wan and Xiangyu Yue and Jiaheng Liu},
      year={2025},
      eprint={2510.10518},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.10518},
}
```