qunwang13
/

vr-thinker

Safetensors

qwen2_5_vl

Model card Files Files and versions

xet

Community

Improve model card: Add pipeline tag, library name, abstract, and comprehensive details

by nielsr HF Staff - opened Oct 18, 2025

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+63

-11

Files changed (1) hide show

README.md +63 -11

README.md CHANGED Viewed

@@ -1,23 +1,60 @@
 ---
-license: mit
 datasets:
 - KwaiVGI/VideoGen-RewardBench
 - TIGER-Lab/GenAI-Bench
-base_model:
-- Qwen/Qwen2.5-VL-7B-Instruct
 ---
-## Model Summary
-VR-Thinker is the first Multimodal Reward Model utilizing Thinking-with-Image framework.
 For further details, please refer to the following:
-- 📰 Paper: https://arxiv.org/pdf/2510.10518
 - 📚 Github: https://github.com/qunzhongwang/vr-thinker
 - 👋 Contact: [Qunzhong Wang](http://qunzhongwang.github.io/)
-### Quick Start
 We provide a sample test interface here:
 ~~~python
@@ -87,7 +124,7 @@ You need to evaluate the videos based on the following dimensions:
  - Compare the videos based on the evaluation dimensions.
  - If differences are subtle, identify specific key moments for further comparison and request additional frames.
  - Use the `<tool_call>` to select up to 8 additional frames. Example:
- `<tool_call>{{"name": "select_frames", "arguments": {{"target_frames": [12, 16, 20, 24, 60, 64, 68, 72]}}}}</tool_call>`
  - Use `<recommend answer>` to output your current inclination and confidence level.
 2. **Subsequent Rounds**:
@@ -118,10 +155,22 @@ You need to evaluate the videos based on the following dimensions:
 """
 sys_prompt = \
-"""You are a helpful assistant. \n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:
-<tools>\n{\"type\": \"function\", \"function\": {\"name\": \"select_frames\", \"description\": \"Select frames from a video.\", \"parameters\": {\"type\": \"object\", \"properties\":
 {\"target_frames\": {\"type\": \"array\", \"description\": \"List of frame indices to select from the video (no more than 8 frames in total).\", \"items\": {\"type\": \"integer\", \"description\": \"Frame index from 1 to 96.\"}}},
-\"required\": [\"target_frames\"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}</tool_call>"""
 content_list = [{"type": "video", "video": url, "nframes": 4 } for url in video_urls]
@@ -171,6 +220,9 @@ print(output_text)
 ~~~
 ## Citation
 ```
 @misc{wang2025vrthinkerboostingvideoreward,

 ---
+base_model:
+- Qwen/Qwen2.5-VL-7B-Instruct
 datasets:
 - KwaiVGI/VideoGen-RewardBench
 - TIGER-Lab/GenAI-Bench
+license: mit
+pipeline_tag: video-text-to-text
+library_name: transformers
 ---
+# VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
+This repository contains the **VR-Thinker** model, a thinking-with-image framework for multimodal reward models, introduced in the paper [VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning](https://huggingface.co/papers/2510.10518).
+## Abstract
+Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.
+## Model Details
+VR-Thinker is the first Multimodal Reward Model utilizing a Thinking-with-Image framework.
 For further details, please refer to the following:
+- 📰 Paper: [VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning](https://huggingface.co/papers/2510.10518)
 - 📚 Github: https://github.com/qunzhongwang/vr-thinker
 - 👋 Contact: [Qunzhong Wang](http://qunzhongwang.github.io/)
+### Overview
+![overview](https://github.com/qunzhongwang/vr-thinker/raw/main/figs/teaser1.png)
+Recent advancements in **multimodal reward models (RMs)** have substantially improved post-training for visual generative models. However, current RMs face inherent limitations:
+1. **Visual inputs consume large context budgets**, forcing fewer frames and causing loss of fine-grained details
+2. **All visual information is packed into the initial prompt**, exacerbating hallucination and forgetting during chain-of-thought reasoning
+To overcome these issues, we introduce **VR-Thinker**, a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability.
+![Qualitative Case](https://github.com/qunzhongwang/vr-thinker/raw/main/figs/teaser2.png)
+## Training Pipeline
+We activate visual reasoning via a reinforcement fine-tuning pipeline:
+1. **Cold-start** with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting
+2. **Rejection Sampling Fine-Tuning**: Select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning
+3. **Group Relative Policy Optimization (GRPO)**: Apply GRPO to strengthen reasoning
+## Results
+Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos:
+A 7B VR-Thinker achieves:
+- **80.5%** on VideoGen Reward
+- **82.3%** on GenAI-Bench
+- **75.6%** on MJ-Bench-Video
+These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.
+## Quick Start
 We provide a sample test interface here:
 ~~~python
  - Compare the videos based on the evaluation dimensions.
  - If differences are subtle, identify specific key moments for further comparison and request additional frames.
  - Use the `<tool_call>` to select up to 8 additional frames. Example:
+ `<tool_call>{{\"name\": \"select_frames\", \"arguments\": {{\"target_frames\": [12, 16, 20, 24, 60, 64, 68, 72]}}}}</tool_call>`
  - Use `<recommend answer>` to output your current inclination and confidence level.
 2. **Subsequent Rounds**:
 """
 sys_prompt = \
+"""You are a helpful assistant.
+# Tools
+You may call one or more functions to assist with the user query.
+You are provided with function signatures within <tools></tools> XML tags:
+<tools>
+{\"type\": \"function\", \"function\": {\"name\": \"select_frames\", \"description\": \"Select frames from a video.\", \"parameters\": {\"type\": \"object\", \"properties\":
 {\"target_frames\": {\"type\": \"array\", \"description\": \"List of frame indices to select from the video (no more than 8 frames in total).\", \"items\": {\"type\": \"integer\", \"description\": \"Frame index from 1 to 96.\"}}},
+\"required\": [\"target_frames\"]}}}
+</tools>
+For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
+<tool_call>
+{\"name\": <function-name>, \"arguments\": <args-json-object>}</tool_call>"""
 content_list = [{"type": "video", "video": url, "nframes": 4 } for url in video_urls]
 ~~~
+## Acknowledgement
+This repo is based on [pixel-reasoner](https://github.com/TIGER-AI-Lab/Pixel-Reasoner) and [Open-RLHF](https://github.com/OpenRLHF/OpenRLHF). We thank the authors for their valuable contributions to the AIGC community.
 ## Citation
 ```
 @misc{wang2025vrthinkerboostingvideoreward,