Improve model card: Add pipeline tag, library name, abstract, and comprehensive details
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,23 +1,60 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
| 3 |
datasets:
|
| 4 |
- KwaiVGI/VideoGen-RewardBench
|
| 5 |
- TIGER-Lab/GenAI-Bench
|
| 6 |
-
|
| 7 |
-
|
|
|
|
| 8 |
---
|
| 9 |
|
|
|
|
| 10 |
|
| 11 |
-
|
| 12 |
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
For further details, please refer to the following:
|
| 16 |
-
- 📰 Paper: https://
|
| 17 |
- 📚 Github: https://github.com/qunzhongwang/vr-thinker
|
| 18 |
- 👋 Contact: [Qunzhong Wang](http://qunzhongwang.github.io/)
|
| 19 |
|
| 20 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
We provide a sample test interface here:
|
| 22 |
|
| 23 |
~~~python
|
|
@@ -87,7 +124,7 @@ You need to evaluate the videos based on the following dimensions:
|
|
| 87 |
- Compare the videos based on the evaluation dimensions.
|
| 88 |
- If differences are subtle, identify specific key moments for further comparison and request additional frames.
|
| 89 |
- Use the `<tool_call>` to select up to 8 additional frames. Example:
|
| 90 |
-
`<tool_call>{{"name": "select_frames", "arguments": {{"target_frames": [12, 16, 20, 24, 60, 64, 68, 72]}}}}</tool_call>`
|
| 91 |
- Use `<recommend answer>` to output your current inclination and confidence level.
|
| 92 |
|
| 93 |
2. **Subsequent Rounds**:
|
|
@@ -118,10 +155,22 @@ You need to evaluate the videos based on the following dimensions:
|
|
| 118 |
"""
|
| 119 |
|
| 120 |
sys_prompt = \
|
| 121 |
-
"""You are a helpful assistant.
|
| 122 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
{\"target_frames\": {\"type\": \"array\", \"description\": \"List of frame indices to select from the video (no more than 8 frames in total).\", \"items\": {\"type\": \"integer\", \"description\": \"Frame index from 1 to 96.\"}}},
|
| 124 |
-
\"required\": [\"target_frames\"]}}}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
|
| 127 |
content_list = [{"type": "video", "video": url, "nframes": 4 } for url in video_urls]
|
|
@@ -171,6 +220,9 @@ print(output_text)
|
|
| 171 |
|
| 172 |
~~~
|
| 173 |
|
|
|
|
|
|
|
|
|
|
| 174 |
## Citation
|
| 175 |
```
|
| 176 |
@misc{wang2025vrthinkerboostingvideoreward,
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 4 |
datasets:
|
| 5 |
- KwaiVGI/VideoGen-RewardBench
|
| 6 |
- TIGER-Lab/GenAI-Bench
|
| 7 |
+
license: mit
|
| 8 |
+
pipeline_tag: video-text-to-text
|
| 9 |
+
library_name: transformers
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
|
| 13 |
|
| 14 |
+
This repository contains the **VR-Thinker** model, a thinking-with-image framework for multimodal reward models, introduced in the paper [VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning](https://huggingface.co/papers/2510.10518).
|
| 15 |
|
| 16 |
+
## Abstract
|
| 17 |
+
Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.
|
| 18 |
+
|
| 19 |
+
## Model Details
|
| 20 |
+
|
| 21 |
+
VR-Thinker is the first Multimodal Reward Model utilizing a Thinking-with-Image framework.
|
| 22 |
|
| 23 |
For further details, please refer to the following:
|
| 24 |
+
- 📰 Paper: [VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning](https://huggingface.co/papers/2510.10518)
|
| 25 |
- 📚 Github: https://github.com/qunzhongwang/vr-thinker
|
| 26 |
- 👋 Contact: [Qunzhong Wang](http://qunzhongwang.github.io/)
|
| 27 |
|
| 28 |
+
### Overview
|
| 29 |
+

|
| 30 |
+
|
| 31 |
+
Recent advancements in **multimodal reward models (RMs)** have substantially improved post-training for visual generative models. However, current RMs face inherent limitations:
|
| 32 |
+
|
| 33 |
+
1. **Visual inputs consume large context budgets**, forcing fewer frames and causing loss of fine-grained details
|
| 34 |
+
2. **All visual information is packed into the initial prompt**, exacerbating hallucination and forgetting during chain-of-thought reasoning
|
| 35 |
+
|
| 36 |
+
To overcome these issues, we introduce **VR-Thinker**, a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability.
|
| 37 |
+
|
| 38 |
+

|
| 39 |
+
|
| 40 |
+
## Training Pipeline
|
| 41 |
+
We activate visual reasoning via a reinforcement fine-tuning pipeline:
|
| 42 |
+
|
| 43 |
+
1. **Cold-start** with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting
|
| 44 |
+
2. **Rejection Sampling Fine-Tuning**: Select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning
|
| 45 |
+
3. **Group Relative Policy Optimization (GRPO)**: Apply GRPO to strengthen reasoning
|
| 46 |
+
|
| 47 |
+
## Results
|
| 48 |
+
Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos:
|
| 49 |
+
|
| 50 |
+
A 7B VR-Thinker achieves:
|
| 51 |
+
- **80.5%** on VideoGen Reward
|
| 52 |
+
- **82.3%** on GenAI-Bench
|
| 53 |
+
- **75.6%** on MJ-Bench-Video
|
| 54 |
+
|
| 55 |
+
These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.
|
| 56 |
+
|
| 57 |
+
## Quick Start
|
| 58 |
We provide a sample test interface here:
|
| 59 |
|
| 60 |
~~~python
|
|
|
|
| 124 |
- Compare the videos based on the evaluation dimensions.
|
| 125 |
- If differences are subtle, identify specific key moments for further comparison and request additional frames.
|
| 126 |
- Use the `<tool_call>` to select up to 8 additional frames. Example:
|
| 127 |
+
`<tool_call>{{\"name\": \"select_frames\", \"arguments\": {{\"target_frames\": [12, 16, 20, 24, 60, 64, 68, 72]}}}}</tool_call>`
|
| 128 |
- Use `<recommend answer>` to output your current inclination and confidence level.
|
| 129 |
|
| 130 |
2. **Subsequent Rounds**:
|
|
|
|
| 155 |
"""
|
| 156 |
|
| 157 |
sys_prompt = \
|
| 158 |
+
"""You are a helpful assistant.
|
| 159 |
+
|
| 160 |
+
# Tools
|
| 161 |
+
|
| 162 |
+
You may call one or more functions to assist with the user query.
|
| 163 |
+
|
| 164 |
+
You are provided with function signatures within <tools></tools> XML tags:
|
| 165 |
+
<tools>
|
| 166 |
+
{\"type\": \"function\", \"function\": {\"name\": \"select_frames\", \"description\": \"Select frames from a video.\", \"parameters\": {\"type\": \"object\", \"properties\":
|
| 167 |
{\"target_frames\": {\"type\": \"array\", \"description\": \"List of frame indices to select from the video (no more than 8 frames in total).\", \"items\": {\"type\": \"integer\", \"description\": \"Frame index from 1 to 96.\"}}},
|
| 168 |
+
\"required\": [\"target_frames\"]}}}
|
| 169 |
+
</tools>
|
| 170 |
+
|
| 171 |
+
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
|
| 172 |
+
<tool_call>
|
| 173 |
+
{\"name\": <function-name>, \"arguments\": <args-json-object>}</tool_call>"""
|
| 174 |
|
| 175 |
|
| 176 |
content_list = [{"type": "video", "video": url, "nframes": 4 } for url in video_urls]
|
|
|
|
| 220 |
|
| 221 |
~~~
|
| 222 |
|
| 223 |
+
## Acknowledgement
|
| 224 |
+
This repo is based on [pixel-reasoner](https://github.com/TIGER-AI-Lab/Pixel-Reasoner) and [Open-RLHF](https://github.com/OpenRLHF/OpenRLHF). We thank the authors for their valuable contributions to the AIGC community.
|
| 225 |
+
|
| 226 |
## Citation
|
| 227 |
```
|
| 228 |
@misc{wang2025vrthinkerboostingvideoreward,
|