Safetensors
qwen2_5_vl

Improve model card: Add pipeline tag, library name, abstract, and comprehensive details

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +63 -11
README.md CHANGED
@@ -1,23 +1,60 @@
1
  ---
2
- license: mit
 
3
  datasets:
4
  - KwaiVGI/VideoGen-RewardBench
5
  - TIGER-Lab/GenAI-Bench
6
- base_model:
7
- - Qwen/Qwen2.5-VL-7B-Instruct
 
8
  ---
9
 
 
10
 
11
- ## Model Summary
12
 
13
- VR-Thinker is the first Multimodal Reward Model utilizing Thinking-with-Image framework.
 
 
 
 
 
14
 
15
  For further details, please refer to the following:
16
- - 📰 Paper: https://arxiv.org/pdf/2510.10518
17
  - 📚 Github: https://github.com/qunzhongwang/vr-thinker
18
  - 👋 Contact: [Qunzhong Wang](http://qunzhongwang.github.io/)
19
 
20
- ### Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  We provide a sample test interface here:
22
 
23
  ~~~python
@@ -87,7 +124,7 @@ You need to evaluate the videos based on the following dimensions:
87
  - Compare the videos based on the evaluation dimensions.
88
  - If differences are subtle, identify specific key moments for further comparison and request additional frames.
89
  - Use the `<tool_call>` to select up to 8 additional frames. Example:
90
- `<tool_call>{{"name": "select_frames", "arguments": {{"target_frames": [12, 16, 20, 24, 60, 64, 68, 72]}}}}</tool_call>`
91
  - Use `<recommend answer>` to output your current inclination and confidence level.
92
 
93
  2. **Subsequent Rounds**:
@@ -118,10 +155,22 @@ You need to evaluate the videos based on the following dimensions:
118
  """
119
 
120
  sys_prompt = \
121
- """You are a helpful assistant. \n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:
122
- <tools>\n{\"type\": \"function\", \"function\": {\"name\": \"select_frames\", \"description\": \"Select frames from a video.\", \"parameters\": {\"type\": \"object\", \"properties\":
 
 
 
 
 
 
 
123
  {\"target_frames\": {\"type\": \"array\", \"description\": \"List of frame indices to select from the video (no more than 8 frames in total).\", \"items\": {\"type\": \"integer\", \"description\": \"Frame index from 1 to 96.\"}}},
124
- \"required\": [\"target_frames\"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}</tool_call>"""
 
 
 
 
 
125
 
126
 
127
  content_list = [{"type": "video", "video": url, "nframes": 4 } for url in video_urls]
@@ -171,6 +220,9 @@ print(output_text)
171
 
172
  ~~~
173
 
 
 
 
174
  ## Citation
175
  ```
176
  @misc{wang2025vrthinkerboostingvideoreward,
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-VL-7B-Instruct
4
  datasets:
5
  - KwaiVGI/VideoGen-RewardBench
6
  - TIGER-Lab/GenAI-Bench
7
+ license: mit
8
+ pipeline_tag: video-text-to-text
9
+ library_name: transformers
10
  ---
11
 
12
+ # VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
13
 
14
+ This repository contains the **VR-Thinker** model, a thinking-with-image framework for multimodal reward models, introduced in the paper [VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning](https://huggingface.co/papers/2510.10518).
15
 
16
+ ## Abstract
17
+ Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.
18
+
19
+ ## Model Details
20
+
21
+ VR-Thinker is the first Multimodal Reward Model utilizing a Thinking-with-Image framework.
22
 
23
  For further details, please refer to the following:
24
+ - 📰 Paper: [VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning](https://huggingface.co/papers/2510.10518)
25
  - 📚 Github: https://github.com/qunzhongwang/vr-thinker
26
  - 👋 Contact: [Qunzhong Wang](http://qunzhongwang.github.io/)
27
 
28
+ ### Overview
29
+ ![overview](https://github.com/qunzhongwang/vr-thinker/raw/main/figs/teaser1.png)
30
+
31
+ Recent advancements in **multimodal reward models (RMs)** have substantially improved post-training for visual generative models. However, current RMs face inherent limitations:
32
+
33
+ 1. **Visual inputs consume large context budgets**, forcing fewer frames and causing loss of fine-grained details
34
+ 2. **All visual information is packed into the initial prompt**, exacerbating hallucination and forgetting during chain-of-thought reasoning
35
+
36
+ To overcome these issues, we introduce **VR-Thinker**, a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability.
37
+
38
+ ![Qualitative Case](https://github.com/qunzhongwang/vr-thinker/raw/main/figs/teaser2.png)
39
+
40
+ ## Training Pipeline
41
+ We activate visual reasoning via a reinforcement fine-tuning pipeline:
42
+
43
+ 1. **Cold-start** with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting
44
+ 2. **Rejection Sampling Fine-Tuning**: Select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning
45
+ 3. **Group Relative Policy Optimization (GRPO)**: Apply GRPO to strengthen reasoning
46
+
47
+ ## Results
48
+ Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos:
49
+
50
+ A 7B VR-Thinker achieves:
51
+ - **80.5%** on VideoGen Reward
52
+ - **82.3%** on GenAI-Bench
53
+ - **75.6%** on MJ-Bench-Video
54
+
55
+ These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.
56
+
57
+ ## Quick Start
58
  We provide a sample test interface here:
59
 
60
  ~~~python
 
124
  - Compare the videos based on the evaluation dimensions.
125
  - If differences are subtle, identify specific key moments for further comparison and request additional frames.
126
  - Use the `<tool_call>` to select up to 8 additional frames. Example:
127
+ `<tool_call>{{\"name\": \"select_frames\", \"arguments\": {{\"target_frames\": [12, 16, 20, 24, 60, 64, 68, 72]}}}}</tool_call>`
128
  - Use `<recommend answer>` to output your current inclination and confidence level.
129
 
130
  2. **Subsequent Rounds**:
 
155
  """
156
 
157
  sys_prompt = \
158
+ """You are a helpful assistant.
159
+
160
+ # Tools
161
+
162
+ You may call one or more functions to assist with the user query.
163
+
164
+ You are provided with function signatures within <tools></tools> XML tags:
165
+ <tools>
166
+ {\"type\": \"function\", \"function\": {\"name\": \"select_frames\", \"description\": \"Select frames from a video.\", \"parameters\": {\"type\": \"object\", \"properties\":
167
  {\"target_frames\": {\"type\": \"array\", \"description\": \"List of frame indices to select from the video (no more than 8 frames in total).\", \"items\": {\"type\": \"integer\", \"description\": \"Frame index from 1 to 96.\"}}},
168
+ \"required\": [\"target_frames\"]}}}
169
+ </tools>
170
+
171
+ For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
172
+ <tool_call>
173
+ {\"name\": <function-name>, \"arguments\": <args-json-object>}</tool_call>"""
174
 
175
 
176
  content_list = [{"type": "video", "video": url, "nframes": 4 } for url in video_urls]
 
220
 
221
  ~~~
222
 
223
+ ## Acknowledgement
224
+ This repo is based on [pixel-reasoner](https://github.com/TIGER-AI-Lab/Pixel-Reasoner) and [Open-RLHF](https://github.com/OpenRLHF/OpenRLHF). We thank the authors for their valuable contributions to the AIGC community.
225
+
226
  ## Citation
227
  ```
228
  @misc{wang2025vrthinkerboostingvideoreward,