--- base_model: - Qwen/Qwen2.5-VL-7B-Instruct datasets: - KwaiVGI/VideoGen-RewardBench - TIGER-Lab/GenAI-Bench license: mit pipeline_tag: video-text-to-text library_name: transformers --- # VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning This repository contains the **VR-Thinker** model, a thinking-with-image framework for multimodal reward models, introduced in the paper [VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning](https://huggingface.co/papers/2510.10518). ## Abstract Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling. ## Model Details VR-Thinker is the first Multimodal Reward Model utilizing a Thinking-with-Image framework. For further details, please refer to the following: - 📰 Paper: [VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning](https://huggingface.co/papers/2510.10518) - 📚 Github: https://github.com/qunzhongwang/vr-thinker - 👋 Contact: [Qunzhong Wang](http://qunzhongwang.github.io/) ### Overview ![overview](https://github.com/qunzhongwang/vr-thinker/raw/main/figs/teaser1.png) Recent advancements in **multimodal reward models (RMs)** have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: 1. **Visual inputs consume large context budgets**, forcing fewer frames and causing loss of fine-grained details 2. **All visual information is packed into the initial prompt**, exacerbating hallucination and forgetting during chain-of-thought reasoning To overcome these issues, we introduce **VR-Thinker**, a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. ![Qualitative Case](https://github.com/qunzhongwang/vr-thinker/raw/main/figs/teaser2.png) ## Training Pipeline We activate visual reasoning via a reinforcement fine-tuning pipeline: 1. **Cold-start** with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting 2. **Rejection Sampling Fine-Tuning**: Select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning 3. **Group Relative Policy Optimization (GRPO)**: Apply GRPO to strengthen reasoning ## Results Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: A 7B VR-Thinker achieves: - **80.5%** on VideoGen Reward - **82.3%** on GenAI-Bench - **75.6%** on MJ-Bench-Video These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling. ## Quick Start We provide a sample test interface here: ~~~python import json import random import torch import tqdm from PIL import Image import warnings import os import requests import cv2 import numpy as np from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration from qwen_vl_utils import process_vision_info warnings.filterwarnings("ignore") model_path = "qunwang13/vr-thinker" model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_path, torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained(model_path) video_urls = [ "https://cdn.pixabay.com/video/2024/05/20/212623_large.mp4", # sample video 1 "https://cdn.pixabay.com/video/2024/02/07/199320-912042274_large.mp4" # sample video 2 ] prompt_for_videos = "A cinematic shot of a waterfall in a lush forest." dim_name_1, dim_explain_1 = "Temporal Alignment (TA)", "How well the video adheres to the temporal aspects of the prompt." dim_name_2, dim_explain_2 = "Video Quality (VQ)", "The visual and aesthetic quality of the video." dim_name_3, dim_explain_3 = "Motion Quality (MQ)", "The smoothness and realism of the motion in the video." N = 96 prompt_text = \ f"""**Task Description**: Your task is to compare two videos generated based on the same caption by analyzing their frames in detail. This involves an iterative process of reasoning, zooming in on details, and dynamically selecting frames for further analysis. The provided frames are snapshots from these videos: - The first four input frames correspond to Video 1. - The next four input frames correspond to Video 2. The caption is: {prompt_for_videos} **Evaluation Dimensions**: You need to evaluate the videos based on the following dimensions: 1. **{dim_name_1}**: {dim_explain_1} 2. **{dim_name_2}**: {dim_explain_2} 3. **{dim_name_3}**: {dim_explain_3} **Frames and Analysis Rules**: - You are provided with 8 sampled input frames: - The first four input frames are sampled from the first {N/2} actual frames of Video 1. - The next four input frames are sampled from the first {N/2} actual frames of Video 2. - These input frames are evenly sampled (e.g., for N = 96, frames 1, 12, 24, 36 for Video 1, and frames 49, 60, 72, 84 for Video 2). - If the provided input frames are insufficient for a detailed comparison, you must request additional frames: - Select up to 8 additional frames (4 from each video, ensuring strict correspondence between the two videos, i.e., the second video frames must be Video 1 frames + {N/2}). - Frame selection must be logical and based on specific transitions or critical differences observed in the analysis. **Process**: 1. **Round 1 Analysis**: - Start by analyzing the first 8 input frames. - Compare the videos based on the evaluation dimensions. - If differences are subtle, identify specific key moments for further comparison and request additional frames. - Use the `` to select up to 8 additional frames. Example: `{{\"name\": \"select_frames\", \"arguments\": {{\"target_frames\": [12, 16, 20, 24, 60, 64, 68, 72]}}}}` - Use `` to output your current inclination and confidence level. 2. **Subsequent Rounds**: - Analyze the newly provided frames. - If differences remain unclear, request further frames and continue reasoning. - If the new frames are repetitive or insufficient, adjust your focus to different sets of frames. . - Use `` to output your current inclination and confidence level until a final answer is reached. 3. **Final Output**: - After completing your analysis, output exactly one of the following answers: - `1` if Video 1 is better, - `2` if Video 2 is better, - `0` if Video 1 and Video 2 are tied. - Provide a breakdown of the evaluation dimensions using this format: ` TA = i_1, MQ = i_2, VQ = i_3, OA = i_4 ` - **OA** (Overall Assessment): Represents the overall preference. - **i_1, i_2, i_3, i_4**: One of {{0, 1, 2}}. 4. **Format Requirements**: - Your analysis must be explicitly structured using the following tags: - ``: Use this tag to summarize the observations from the current round. This summary is critical because subsequent rounds will rely on your synthesis to track progress and frame-specific details. - ``: Use this tag to describe your reasoning process, including decisions about frame selection or task approach. - ``: Use this tag to output your current inclination, including confidence level: ` TA = i_1, MQ = i_2, VQ = i_3, OA = i_4, CF = i_5 ` - **CF** (Confidence): One of {{1, 2, 3, 4}}, where 4 indicates higher confidence while 0 indicate low confidence. - ``: Use this tag only when in the final decision. """ sys_prompt = \ """You are a helpful assistant. # Tools You may call one or more functions to assist with the user query. You are provided with function signatures within XML tags: {\"type\": \"function\", \"function\": {\"name\": \"select_frames\", \"description\": \"Select frames from a video.\", \"parameters\": {\"type\": \"object\", \"properties\": {\"target_frames\": {\"type\": \"array\", \"description\": \"List of frame indices to select from the video (no more than 8 frames in total).\", \"items\": {\"type\": \"integer\", \"description\": \"Frame index from 1 to 96.\"}}}, \"required\": [\"target_frames\"]}}} For each function call, return a json object with function name and arguments within XML tags: {\"name\": , \"arguments\": }""" content_list = [{"type": "video", "video": url, "nframes": 4 } for url in video_urls] content_list.append({"type": "text", "text": prompt_text}) messages = [ { "role": "system", "content": sys_prompt, }, { "role": "user", "content": content_list, } ] text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs, video_kwargs = process_vision_info( messages, return_video_kwargs = True ) print(video_kwargs) breakpoint() inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", **video_kwargs, ) inputs = inputs.to("cuda") generated_ids = model.generate(**inputs, max_new_tokens=2048) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ~~~ ## Acknowledgement This repo is based on [pixel-reasoner](https://github.com/TIGER-AI-Lab/Pixel-Reasoner) and [Open-RLHF](https://github.com/OpenRLHF/OpenRLHF). We thank the authors for their valuable contributions to the AIGC community. ## Citation ``` @misc{wang2025vrthinkerboostingvideoreward, title={VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning}, author={Qunzhong Wang and Jie Liu and Jiajun Liang and Yilei Jiang and Yuanxing Zhang and Jinyuan Chen and Yaozhi Zheng and Xintao Wang and Pengfei Wan and Xiangyu Yue and Jiaheng Liu}, year={2025}, eprint={2510.10518}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.10518}, } ```