Update README.md

2808292 verified 4 months ago

9.39 kB

	---
	base_model:
	- QiWang98/VideoRFT-SFT-3B
	- Qwen/Qwen2.5-VL-3B-Instruct
	datasets:
	- QiWang98/VideoRFT-Data
	language:
	- en
	license: apache-2.0
	metrics:
	- accuracy
	pipeline_tag: video-text-to-text
	library_name: transformers
	---

	# 🎥 VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning

	This repository contains the VideoRFT model, presented in the paper [VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning](https://huggingface.co/papers/2505.12434).

	<p align="center">
	</a>  📖 <a href="https://huggingface.co/papers/2505.12434">Paper</a>
	</a>   │   💻 <a href="https://github.com/QiWang98/VideoRFT">Code</a>
	</a>   │   📀 <a href="https://huggingface.co/datasets/QiWang98/VideoRFT-Data">CoT Dataset</a>
	</a>   │   📀 <a href="https://huggingface.co/datasets/QiWang98/VideoRFT-Data">RL Dataset</a>
	</a>   │   🤗 <a href="https://huggingface.co/QiWang98/VideoRFT">Models</a>
	</p>

	## 📰 News
	- [2025/09/19] Our paper has been accepted to NeurIPS 2025 🎉!
	- [2025/06/01] We released our 3B Models ([🤗VideoRFT-SFT-3B](https://huggingface.co/QiWang98/VideoRFT-SFT-3B) and [🤗VideoRFT-3B](https://huggingface.co/QiWang98/VideoRFT-3B)) to huggingface.
	- [2025/05/25] We released our 7B Models ([🤗VideoRFT-SFT-7B](https://huggingface.co/QiWang98/VideoRFT-SFT) and [🤗VideoRFT-7B](https://huggingface.co/QiWang98/VideoRFT)) to huggingface.
	- [2025/05/20] We released our Datasets ([📀CoT Dataset](https://huggingface.co/datasets/QiWang98/VideoRFT-Data) and [📀RL Dataset](https://huggingface.co/datasets/QiWang98/VideoRFT-Data)) to huggingface.
	- [2025/05/18] Our paper is released on [ArXiv](https://arxiv.org/abs/2505.12434), and we have open-sourced our code on [GitHub](https://github.com/QiWang98/VideoRFT)!

	## 🔎 Overview

	Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose VideoRFT, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. VideoRFT follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a fully automatic CoT curation pipeline. First, we devise a cognitioninspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a visual-language model conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations. This pipeline results in two new datasets VideoRFT-CoT-102K for SFT and VideoRFT-RL-310K for RL. To further strength the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning with visual evidence. This reward encourages the model to produce coherent, context-aware reasoning outputs grounded in visual input. Extensive experiments show that VideoRFT achieves state-of-the-art performance on six video reasoning benchmarks.

	<div align="center">
	<img src="https://raw.githubusercontent.com/QiWang98/VideoRFT/main/images/overview.png" />
	</div>

	## ✨ Methodology

	To overcome the scarcity of video CoTs, we develop a scalable, cognitively inspired pipeline for high-quality video CoT dataset construction.

	<div align="center">
	<img src="https://raw.githubusercontent.com/QiWang98/VideoRFT/main/images/pipeline.png" width="95%" />
	</div>

	To further strength the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning with visual evidence.

	<div align="center">
	<img src="https://raw.githubusercontent.com/QiWang98/VideoRFT/main/images/grpo.png" width="95%" />
	</div>

	## 📀 Datasets

	Based on above pipeline, we construct two large-scale datasets, i.e., [📀VideoRFT-CoT-102K](https://huggingface.co/datasets/QiWang98/VideoRFT-Data) and [📀VideoRFT-RL-310K](https://huggingface.co/datasets/QiWang98/VideoRFT-Data).
	<div align="center">
	<img src="https://raw.githubusercontent.com/QiWang98/VideoRFT/main/images/dataset.png" width="50%" />
	</div>

	## 🛠️ Set up

	### Requirements
	* `Python >= 3.11`
	* `Pytorch >= 2.5.1`
	* `transformers == 4.51.3`
	* `vLLM == 0.7.3`
	* `trl == 0.16.0`

	### Installation
	```bash
	git clone https://github.com/QiWang98/VideoRFT
	cd VideoRFT

	# Create and activate environment
	conda create -n VideoRFT python=3.11
	conda activate VideoRFT
	bash setup.sh

	# Install decord for improved video processing
	cd src/qwen-vl-utils
	pip install -e .[decord]
	```

	## 🚀 Training

	### Supervised Fine-Tuning (SFT)
	We begin with supervised fine-tuning on the VideoRFT-CoT dataset for one epoch:

	```bash
	bash ./src/scripts/run_sft_video.sh
	```

	This step can be skipped by directly using our pretrained SFT models, available at [🤗VideoRFT-SFT-7B](https://huggingface.co/QiWang98/VideoRFT-SFT) or [🤗VideoRFT-SFT-3B](https://huggingface.co/QiWang98/VideoRFT-SFT-3B).

	### Reinforcement Learning (RL)

	Next, perform reinforcement learning using the VideoRFT-RL dataset:

	```bash
	bash ./src/scripts/run_grpo_video.sh
	```

	To enable faster training via vLLM acceleration:

	```bash
	bash ./src/scripts/run_grpo_vllm_qwen25vl.sh
	```

	> Note: During training, we adopt the following settings for efficiency:

	* VIDEO PIXELS: 128 × 28 × 28
	* FPS FRAMES: 16

	All frame-related configurations can be adjusted in `src/qwen-vl-utils`.

	## 📈 Inference & Evaluation

	> During inference, we increase the maximum frame resolution and length to boost performance:

	* VIDEO PIXELS: 256 × 28 × 28
	* FPS FRAMES: 32

	You can configure these parameters in `src/qwen-vl-utils`.

	> We evaluate all models under a unified decoding configuration following the official Qwen2.5-VL demo:

	* `top_p = 0.001`
	* `temperature = 0.01`

	### Evaluation Procedure

	1. Download preprocessed evaluation JSONs from: [[🤗 eval](https://huggingface.co/datasets/Video-R1/Video-R1-eval)]

	2. Download the video data from the official sites of each benchmark and organize them as specified in the JSON files.

	3. Run the evaluation across all benchmarks:

	```bash
	bash ./src/eval_bench.sh
	```

	## Quick Inference Code

	```python
	import numpy as np
	import torch
	from longvu.builder import load_pretrained_model
	from longvu.constants import (
	DEFAULT_IMAGE_TOKEN,
	IMAGE_TOKEN_INDEX,
	)
	from longvu.conversation import conv_templates, SeparatorStyle
	from longvu.mm_datautils import (
	KeywordsStoppingCriteria,
	process_images,
	tokenizer_image_token,
	)
	from decord import cpu, VideoReader

	tokenizer, model, image_processor, context_len = load_pretrained_model(
	"./checkpoints/longvu_qwen", None, "cambrian_qwen",
	)

	model.eval()
	video_path = "./examples/video1.mp4"
	qs = "Describe this video in detail"

	vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
	fps = float(vr.get_avg_fps())
	frame_indices = np.array([i for i in range(0, len(vr), round(fps),)])
	video = []
	for frame_index in frame_indices:
	img = vr[frame_index].asnumpy()
	video.append(img)
	video = np.stack(video)
	image_sizes = [video[0].shape[:2]]
	video = process_images(video, image_processor, model.config)
	video = [item.unsqueeze(0) for item in video]

	qs = DEFAULT_IMAGE_TOKEN + "
	" + qs
	conv = conv_templates["qwen"].copy()
	conv.append_message(conv.roles[0], qs)
	conv.append_message(conv.roles[1], None)
	prompt = conv.get_prompt()

	input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
	stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
	keywords = [stop_str]
	stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
	with torch.inference_mode():
	output_ids = model.generate(
	input_ids,
	images=video,
	image_sizes=image_sizes,
	do_sample=False,
	temperature=0.2,
	max_new_tokens=128,
	use_cache=True,
	stopping_criteria=[stopping_criteria],
	)
	pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
	```

	## 🙏 Acknowledgements

	We gratefully acknowledge the contributions of the open-source community, particularly [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1), [Open-R1](https://github.com/huggingface/open-r1), and [R1-V](https://github.com/Deep-Agent/R1-V).

	## 📚 Citations

	If you find this work helpful, please consider citing:

	```
	@article{VideoRFT,
	title={VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning},
	author={Wang, Qi and Yu, Yanrui and Yuan, Ye and Mao, Rui and Zhou, Tianfei},
	journal={arXiv preprint arXiv:2505.12434},
	year={2025}
	}
	```