VideoRFT-SFT / README.md
nielsr's picture
nielsr HF Staff
Improve model card: update pipeline tag, add library name, paper details & content
6dae325 verified
|
raw
history blame
7.3 kB
metadata
base_model:
  - Qwen/Qwen2.5-VL-7B-Instruct
datasets:
  - QiWang98/VideoRFT-Data
language:
  - en
license: apache-2.0
metrics:
  - accuracy
pipeline_tag: video-text-to-text
library_name: transformers

๐ŸŽฅ $\text{VideoRFT}$: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning

๐Ÿ“‘ Paper: VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning โญ๏ธ Code: https://github.com/QiWang98/VideoRFT ๐Ÿ“€ CoT Dataset: https://huggingface.co/datasets/QiWang98/VideoRFT-Data ๐Ÿ“€ RL Dataset: https://huggingface.co/datasets/QiWang98/VideoRFT-Data ๐Ÿค— Models: https://huggingface.co/QiWang98/VideoRFT

๐Ÿ“ฐ News

๐Ÿ”Ž Overview

Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose $\textbf{VideoRFT}$, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. $\textbf{VideoRFT}$ follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a multi-expert-driven, cognition-inspired CoT curation pipeline. First, we devise a cognition-inspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a MLLM conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations. This pipeline results in two new datasets $-$ VideoRFT-CoT-102K for SFT and VideoRFT-RL-310K for RL. To further strengthen the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning and visual evidence. This reward encourages the model to produce coherent, context-aware reasoning outputs grounded in visual input. Extensive experiments show that $\textbf{VideoRFT}$ achieves state-of-the-art performance on six video reasoning benchmarks.

โœจ Methodology

To overcome the scarcity of video CoTs, we develop a scalable, cognitively inspired pipeline for high-quality video CoT dataset construction.

To further strength the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning with visual evidence.

๐Ÿ“€ Datasets

Based on above pipeline, we construct two large-scale datasets, i.e., ๐Ÿ“€VideoRFT-CoT-102K and ๐Ÿ“€VideoRFT-RL-310K.

๐Ÿ› ๏ธ Set up

Requirements

  • Python >= 3.11
  • Pytorch >= 2.5.1
  • transformers == 4.51.3
  • vLLM == 0.7.3
  • trl == 0.16.0

Installation

git clone https://github.com/QiWang98/VideoRFT
cd VideoRFT

# Create and activate environment
conda create -n VideoRFT python=3.11 
conda activate VideoRFT
bash setup.sh

# Install decord for improved video processing
cd src/qwen-vl-utils
pip install -e .[decord]

๐Ÿš€ Training

Supervised Fine-Tuning (SFT)

We begin with supervised fine-tuning on the VideoRFT-CoT dataset for one epoch:

bash ./src/scripts/run_sft_video.sh

This step can be skipped by directly using our pretrained SFT models, available at ๐Ÿค—VideoRFT-SFT-7B or ๐Ÿค—VideoRFT-SFT-3B.

Reinforcement Learning (RL)

Next, perform reinforcement learning using the VideoRFT-RL dataset:

bash ./src/scripts/run_grpo_video.sh

To enable faster training via vLLM acceleration:

bash ./src/scripts/run_grpo_vllm_qwen25vl.sh

Note: During training, we adopt the following settings for efficiency:

  • VIDEO PIXELS: 128 ร— 28 ร— 28
  • FPS FRAMES: 16

All frame-related configurations can be adjusted in src/qwen-vl-utils.

๐Ÿ“ˆ Inference & Evaluation

During inference, we increase the maximum frame resolution and length to boost performance:

  • VIDEO PIXELS: 256 ร— 28 ร— 28
  • FPS FRAMES: 32

You can configure these parameters in src/qwen-vl-utils.

We evaluate all models under a unified decoding configuration following the official Qwen2.5-VL demo:

  • top_p = 0.001
  • temperature = 0.01

Evaluation Procedure

  1. Download preprocessed evaluation JSONs from: [๐Ÿค— eval]

  2. Download the video data from the official sites of each benchmark and organize them as specified in the JSON files.

  3. Run the evaluation across all benchmarks:

bash ./src/eval_bench.sh

๐Ÿ™ Acknowledgements

We gratefully acknowledge the contributions of the open-source community, particularly DeepSeek-R1, Open-R1, and R1-V.

๐Ÿ“š Citations

If you find this work helpful, please consider citing:

@article{VideoRFT,
  title={VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning},
  author={Wang, Qi and Yu, Yanrui and Yuan, Ye and Mao, Rui and Zhou, Tianfei},
  journal={arXiv preprint arXiv:2505.12434},
  year={2025}
}