Improve model card for VideoRFT with metadata and comprehensive content

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +216 -6
README.md CHANGED
@@ -1,13 +1,223 @@
1
  ---
2
- license: apache-2.0
 
 
3
  datasets:
4
  - QiWang98/VideoRFT-Data
5
  language:
6
  - en
 
7
  metrics:
8
  - accuracy
9
- base_model:
10
- - QiWang98/VideoRFT-SFT
11
- - Qwen/Qwen2.5-VL-7B-Instruct
12
- pipeline_tag: visual-question-answering
13
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model:
3
+ - QiWang98/VideoRFT-SFT
4
+ - Qwen/Qwen2.5-VL-7B-Instruct
5
  datasets:
6
  - QiWang98/VideoRFT-Data
7
  language:
8
  - en
9
+ license: apache-2.0
10
  metrics:
11
  - accuracy
12
+ pipeline_tag: video-text-to-text
13
+ library_name: transformers
14
+ ---
15
+
16
+ # 🎥 VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning
17
+
18
+ This repository contains the `VideoRFT` model, presented in the paper [VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning](https://huggingface.co/papers/2505.12434).
19
+
20
+ <p align="center">
21
+ </a>&nbsp&nbsp📖 <a href="https://arxiv.org/abs/2505.12434">ArXiv</a>
22
+ </a>&nbsp&nbsp │ &nbsp&nbsp📀 <a href="https://huggingface.co/datasets/QiWang98/VideoRFT-Data">CoT Dataset</a>
23
+ </a>&nbsp&nbsp │ &nbsp&nbsp📀 <a href="https://huggingface.co/datasets/QiWang98/VideoRFT-Data">RL Dataset</a>
24
+ </a>&nbsp&nbsp │ &nbsp&nbsp🤗 <a href="https://huggingface.co/QiWang98/VideoRFT">Models</a>
25
+ </p>
26
+
27
+ ## 📰 News
28
+ - [2025/09/19] Our paper has been **accepted to NeurIPS 2025** 🎉!
29
+ - [2025/06/01] We released our 3B Models ([🤗VideoRFT-SFT-3B](https://huggingface.co/QiWang98/VideoRFT-SFT-3B) and [🤗VideoRFT-3B](https://huggingface.co/QiWang98/VideoRFT-3B)) to huggingface.
30
+ - [2025/05/25] We released our 7B Models ([🤗VideoRFT-SFT-7B](https://huggingface.co/QiWang98/VideoRFT-SFT) and [🤗VideoRFT-7B](https://huggingface.co/QiWang98/VideoRFT)) to huggingface.
31
+ - [2025/05/20] We released our Datasets ([📀CoT Dataset](https://huggingface.co/datasets/QiWang98/VideoRFT-Data) and [📀RL Dataset](https://huggingface.co/datasets/QiWang98/VideoRFT-Data)) to huggingface.
32
+ - [2025/05/18] Our paper is released on [ArXiv](https://arxiv.org/abs/2505.12434), and we have open-sourced our code on [GitHub](https://github.com/QiWang98/VideoRFT)!
33
+
34
+ ## 🔎 Overview
35
+
36
+ Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose $\textbf{VideoRFT}$, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. $\textbf{VideoRFT}$ follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a multi-expert-driven, cognition-inspired CoT curation pipeline. First, we devise a cognition-inspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a visual-language model conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations. This pipeline results in two new datasets $-$ VideoRFT-CoT-102K for SFT and VideoRFT-RL-310K for RL. To further strength the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning and visual evidence. This reward encourages the model to produce coherent, context-aware reasoning outputs grounded in visual input. Extensive experiments show that $\textbf{VideoRFT}$ achieves state-of-the-art performance on six video reasoning benchmarks.
37
+
38
+ <div align="center">
39
+ <img src="https://github.com/QiWang98/VideoRFT/raw/main/images/overview.png" />
40
+ </div>
41
+
42
+ ## ✨ Methodology
43
+
44
+ To overcome the scarcity of video CoTs, we develop a scalable, cognitively inspired pipeline for high-quality video CoT dataset construction.
45
+
46
+ <div align="center">
47
+ <img src="https://github.com/QiWang98/VideoRFT/raw/main/images/pipeline.png" width="95%" />
48
+ </div>
49
+
50
+ To further strength the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning with visual evidence.
51
+
52
+ <div align="center">
53
+ <img src="https://github.com/QiWang98/VideoRFT/raw/main/images/grpo.png" width="95%" />
54
+ </div>
55
+
56
+ ## 📀 Datasets
57
+
58
+ Based on above pipeline, we construct two large-scale datasets, i.e., [📀VideoRFT-CoT-102K](https://huggingface.co/datasets/QiWang98/VideoRFT-Data) and [📀VideoRFT-RL-310K](https://huggingface.co/datasets/QiWang98/VideoRFT-Data).
59
+ <div align="center">
60
+ <img src="https://github.com/QiWang98/VideoRFT/raw/main/images/dataset.png" width="50%" />
61
+ </div>
62
+
63
+ ## 🛠️ Set up
64
+
65
+ ### Requirements
66
+ * `Python >= 3.11`
67
+ * `Pytorch >= 2.5.1`
68
+ * `transformers == 4.51.3`
69
+ * `vLLM == 0.7.3`
70
+ * `trl == 0.16.0`
71
+
72
+ ### Installation
73
+ ```bash
74
+ git clone https://github.com/QiWang98/VideoRFT
75
+ cd VideoRFT
76
+
77
+ # Create and activate environment
78
+ conda create -n VideoRFT python=3.11
79
+ conda activate VideoRFT
80
+ bash setup.sh
81
+
82
+ # Install decord for improved video processing
83
+ cd src/qwen-vl-utils
84
+ pip install -e .[decord]
85
+ ```
86
+
87
+ ## 🚀 Training
88
+
89
+ ### Supervised Fine-Tuning (SFT)
90
+ We begin with supervised fine-tuning on the VideoRFT-CoT dataset for one epoch:
91
+
92
+ ```bash
93
+ bash ./src/scripts/run_sft_video.sh
94
+ ```
95
+
96
+ This step can be skipped by directly using our pretrained SFT models, available at [🤗VideoRFT-SFT-7B](https://huggingface.co/QiWang98/VideoRFT-SFT) or [🤗VideoRFT-SFT-3B](https://huggingface.co/QiWang98/VideoRFT-SFT-3B).
97
+
98
+ ### Reinforcement Learning (RL)
99
+
100
+ Next, perform reinforcement learning using the VideoRFT-RL dataset:
101
+
102
+ ```bash
103
+ bash ./src/scripts/run_grpo_video.sh
104
+ ```
105
+
106
+ To enable faster training via vLLM acceleration:
107
+
108
+ ```bash
109
+ bash ./src/scripts/run_grpo_vllm_qwen25vl.sh
110
+ ```
111
+
112
+ > **Note:** During training, we adopt the following settings for efficiency:
113
+
114
+ * **VIDEO PIXELS**: 128 × 28 × 28
115
+ * **FPS FRAMES**: 16
116
+
117
+ All frame-related configurations can be adjusted in `src/qwen-vl-utils`.
118
+
119
+ ## 📈 Inference & Evaluation
120
+
121
+ > During inference, we increase the maximum frame resolution and length to boost performance:
122
+
123
+ * **VIDEO PIXELS**: 256 × 28 × 28
124
+ * **FPS FRAMES**: 32
125
+
126
+ You can configure these parameters in `src/qwen-vl-utils`.
127
+
128
+ > We evaluate all models under a unified decoding configuration following the official Qwen2.5-VL demo:
129
+
130
+ * `top_p = 0.001`
131
+ * `temperature = 0.01`
132
+
133
+ ### Evaluation Procedure
134
+
135
+ 1. Download preprocessed evaluation JSONs from: \[[🤗 eval](https://huggingface.co/datasets/Video-R1/Video-R1-eval)]
136
+
137
+ 2. Download the video data from the official sites of each benchmark and organize them as specified in the JSON files.
138
+
139
+ 3. Run the evaluation across all benchmarks:
140
+
141
+ ```bash
142
+ bash ./src/eval_bench.sh
143
+ ```
144
+
145
+ ## 🚀 Quick Inference Code
146
+
147
+ ```python
148
+ import numpy as np
149
+ import torch
150
+ from longvu.builder import load_pretrained_model
151
+ from longvu.constants import (
152
+ DEFAULT_IMAGE_TOKEN,
153
+ IMAGE_TOKEN_INDEX,
154
+ )
155
+ from longvu.conversation import conv_templates, SeparatorStyle
156
+ from longvu.mm_datautils import (
157
+ KeywordsStoppingCriteria,
158
+ process_images,
159
+ tokenizer_image_token,
160
+ )
161
+ from decord import cpu, VideoReader
162
+
163
+ tokenizer, model, image_processor, context_len = load_pretrained_model(
164
+ "./checkpoints/longvu_qwen", None, "cambrian_qwen",
165
+ )
166
+
167
+ model.eval()
168
+ video_path = "./examples/video1.mp4"
169
+ qs = "Describe this video in detail"
170
+
171
+ vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
172
+ fps = float(vr.get_avg_fps())
173
+ frame_indices = np.array([i for i in range(0, len(vr), round(fps),)])
174
+ video = []
175
+ for frame_index in frame_indices:
176
+ img = vr[frame_index].asnumpy()
177
+ video.append(img)
178
+ video = np.stack(video)
179
+ image_sizes = [video[0].shape[:2]]
180
+ video = process_images(video, image_processor, model.config)
181
+ video = [item.unsqueeze(0) for item in video]
182
+
183
+ qs = DEFAULT_IMAGE_TOKEN + "
184
+ " + qs
185
+ conv = conv_templates["qwen"].copy()
186
+ conv.append_message(conv.roles[0], qs)
187
+ conv.append_message(conv.roles[1], None)
188
+ prompt = conv.get_prompt()
189
+
190
+ input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
191
+ stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
192
+ keywords = [stop_str]
193
+ stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
194
+ with torch.inference_mode():
195
+ output_ids = model.generate(
196
+ input_ids,
197
+ images=video,
198
+ image_sizes=image_sizes,
199
+ do_sample=False,
200
+ temperature=0.2,
201
+ max_new_tokens=128,
202
+ use_cache=True,
203
+ stopping_criteria=[stopping_criteria],
204
+ )
205
+ pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
206
+ ```
207
+
208
+ ## 🙏 Acknowledgements
209
+
210
+ We gratefully acknowledge the contributions of the open-source community, particularly [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1), [Open-R1](https://github.com/huggingface/open-r1), and [R1-V](https://github.com/Deep-Agent/R1-V).
211
+
212
+ ## 📚 Citations
213
+
214
+ If you find this work helpful, please consider citing:
215
+
216
+ ```
217
+ @article{VideoRFT,
218
+ title={VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning},
219
+ author={Wang, Qi and Yu, Yanrui and Yuan, Ye and Mao, Rui and Zhou, Tianfei},
220
+ journal={arXiv preprint arXiv:2505.12434},
221
+ year={2025}
222
+ }
223
+ ```