Safetensors
qwen2_5_vl
vr-thinker / README.md
nielsr's picture
nielsr HF Staff
Improve model card: Add pipeline tag, library name, abstract, and comprehensive details
7e9f5ac verified
|
raw
history blame
11.9 kB
metadata
base_model:
  - Qwen/Qwen2.5-VL-7B-Instruct
datasets:
  - KwaiVGI/VideoGen-RewardBench
  - TIGER-Lab/GenAI-Bench
license: mit
pipeline_tag: video-text-to-text
library_name: transformers

VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

This repository contains the VR-Thinker model, a thinking-with-image framework for multimodal reward models, introduced in the paper VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning.

Abstract

Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.

Model Details

VR-Thinker is the first Multimodal Reward Model utilizing a Thinking-with-Image framework.

For further details, please refer to the following:

Overview

overview

Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations:

  1. Visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details
  2. All visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning

To overcome these issues, we introduce VR-Thinker, a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability.

Qualitative Case

Training Pipeline

We activate visual reasoning via a reinforcement fine-tuning pipeline:

  1. Cold-start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting
  2. Rejection Sampling Fine-Tuning: Select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning
  3. Group Relative Policy Optimization (GRPO): Apply GRPO to strengthen reasoning

Results

Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos:

A 7B VR-Thinker achieves:

  • 80.5% on VideoGen Reward
  • 82.3% on GenAI-Bench
  • 75.6% on MJ-Bench-Video

These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.

Quick Start

We provide a sample test interface here:

import json
import random
import torch
import tqdm
from PIL import Image
import warnings
import os
import requests
import cv2
import numpy as np
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info


warnings.filterwarnings("ignore")




model_path = "qunwang13/vr-thinker"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)


video_urls = [
    "https://cdn.pixabay.com/video/2024/05/20/212623_large.mp4", # sample video 1
    "https://cdn.pixabay.com/video/2024/02/07/199320-912042274_large.mp4"  # sample video 2
]


prompt_for_videos = "A cinematic shot of a waterfall in a lush forest."
dim_name_1, dim_explain_1 = "Temporal Alignment (TA)", "How well the video adheres to the temporal aspects of the prompt."
dim_name_2, dim_explain_2 = "Video Quality (VQ)", "The visual and aesthetic quality of the video."
dim_name_3, dim_explain_3 = "Motion Quality (MQ)", "The smoothness and realism of the motion in the video."
N = 96

prompt_text = \
f"""**Task Description**: 
Your task is to compare two videos generated based on the same caption by analyzing their frames in detail. This involves an iterative process of reasoning, zooming in on details, and dynamically selecting frames for further analysis. The provided frames are snapshots from these videos: 
- The first four input frames correspond to Video 1. 
- The next four input frames correspond to Video 2. 

The caption is: {prompt_for_videos}

**Evaluation Dimensions**: 
You need to evaluate the videos based on the following dimensions: 
1. **{dim_name_1}**: {dim_explain_1}
2. **{dim_name_2}**: {dim_explain_2}
3. **{dim_name_3}**: {dim_explain_3}

**Frames and Analysis Rules**: 
- You are provided with 8 sampled input frames: 
 - The first four input frames are sampled from the first {N/2} actual frames of Video 1. 
 - The next four input frames are sampled from the first {N/2} actual frames of Video 2. 
- These input frames are evenly sampled (e.g., for N = 96, frames 1, 12, 24, 36 for Video 1, and frames 49, 60, 72, 84 for Video 2). 
- If the provided input frames are insufficient for a detailed comparison, you must request additional frames: 
 - Select up to 8 additional frames (4 from each video, ensuring strict correspondence between the two videos, i.e., the second video frames must be Video 1 frames + {N/2}). 
 - Frame selection must be logical and based on specific transitions or critical differences observed in the analysis.
**Process**: 
1. **Round 1 Analysis**: 
 - Start by analyzing the first 8 input frames. 
 - Compare the videos based on the evaluation dimensions. 
 - If differences are subtle, identify specific key moments for further comparison and request additional frames. 
 - Use the `<tool_call>` to select up to 8 additional frames. Example: 
 `<tool_call>{{\"name\": \"select_frames\", \"arguments\": {{\"target_frames\": [12, 16, 20, 24, 60, 64, 68, 72]}}}}</tool_call>` 
 - Use `<recommend answer>` to output your current inclination and confidence level. 

2. **Subsequent Rounds**: 
 - Analyze the newly provided frames. 
 - If differences remain unclear, request further frames and continue reasoning. 
 - If the new frames are repetitive or insufficient, adjust your focus to different sets of frames. 
 . 
 - Use `<recommend answer>` to output your current inclination and confidence level until a final answer is reached. 

3. **Final Output**: 
 - After completing your analysis, output exactly one of the following answers: 
 - `1` if Video 1 is better, 
 - `2` if Video 2 is better, 
 - `0` if Video 1 and Video 2 are tied. 
 - Provide a breakdown of the evaluation dimensions using this format: 
 `<final answer> TA = i_1, MQ = i_2, VQ = i_3, OA = i_4 </final answer>` 
 - **OA** (Overall Assessment): Represents the overall preference. 
 - **i_1, i_2, i_3, i_4**: One of {{0, 1, 2}}. 

4. **Format Requirements**: 
 - Your analysis must be explicitly structured using the following tags: 
 - `<snapshot>`: Use this tag to summarize the observations from the current round. This summary is critical because subsequent rounds will rely on your synthesis to track progress and frame-specific details. 
 - `<think>`: Use this tag to describe your reasoning process, including decisions about frame selection or task approach. 
 - `<recommend answer>`: Use this tag to output your current inclination, including confidence level: 
 `<recommend answer> TA = i_1, MQ = i_2, VQ = i_3, OA = i_4, CF = i_5 </recommend answer>` 
 - **CF** (Confidence): One of {{1, 2, 3, 4}}, where 4 indicates higher confidence while 0 indicate low confidence. 
 - `<final answer>`: Use this tag only when in the final decision.
"""

sys_prompt = \
"""You are a helpful assistant. 

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{\"type\": \"function\", \"function\": {\"name\": \"select_frames\", \"description\": \"Select frames from a video.\", \"parameters\": {\"type\": \"object\", \"properties\":
{\"target_frames\": {\"type\": \"array\", \"description\": \"List of frame indices to select from the video (no more than 8 frames in total).\", \"items\": {\"type\": \"integer\", \"description\": \"Frame index from 1 to 96.\"}}},
\"required\": [\"target_frames\"]}}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{\"name\": <function-name>, \"arguments\": <args-json-object>}</tool_call>"""


content_list = [{"type": "video", "video": url, "nframes": 4 } for url in video_urls]
content_list.append({"type": "text", "text": prompt_text})

messages = [
    {
        "role": "system",
        "content": sys_prompt,
    },
    {
        "role": "user",
        "content": content_list,
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(
    messages, return_video_kwargs = True
    )

print(video_kwargs)

breakpoint()

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")


generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Acknowledgement

This repo is based on pixel-reasoner and Open-RLHF. We thank the authors for their valuable contributions to the AIGC community.

Citation

@misc{wang2025vrthinkerboostingvideoreward,
      title={VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning}, 
      author={Qunzhong Wang and Jie Liu and Jiajun Liang and Yilei Jiang and Yuanxing Zhang and Jinyuan Chen and Yaozhi Zheng and Xintao Wang and Pengfei Wan and Xiangyu Yue and Jiaheng Liu},
      year={2025},
      eprint={2510.10518},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.10518}, 
}