|
|
--- |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-7B-Instruct |
|
|
datasets: |
|
|
- Video-R1/Video-R1-data |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
pipeline_tag: video-text-to-text |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Video-R1: Reinforcing Video Reasoning in MLLMs |
|
|
|
|
|
This repository contains `Video-R1/Qwen2.5-VL-7B-COT-SFT`, the SFT (Supervised Fine-Tuning) cold start model trained using the Video-R1-COT-165k dataset. This intermediate checkpoint serves as the base model for further RL (Reinforcement Learning) training on the Video-R1-260k dataset to produce the final Video-R1 models. |
|
|
|
|
|
For more details, please refer to the paper: [Video-R1: Reinforcing Video Reasoning in MLLMs](https://huggingface.co/papers/2503.21776). |
|
|
The full code and additional resources are available on the [GitHub repository](https://github.com/tulerfeng/Video-R1). |
|
|
|
|
|
## About Video-R1 |
|
|
|
|
|
Video-R1 represents the first systematic exploration of the R1 paradigm for incentivizing video reasoning within multimodal large language models (MLLMs), inspired by the success of DeepSeek-R1. The project addresses key challenges in video reasoning, particularly the lack of temporal modeling and the scarcity of high-quality video-reasoning data. |
|
|
|
|
|
To tackle these issues, Video-R1 proposes the T-GRPO algorithm, an extension of GRPO that explicitly encourages models to leverage temporal information in videos for reasoning. It also strategically incorporates high-quality image-reasoning data into the training process. The model was trained on two newly constructed datasets: Video-R1-CoT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. |
|
|
|
|
|
Experimental results demonstrate that Video-R1 achieves significant improvements on various video reasoning benchmarks, including VideoMMMU, VSI-Bench, MVBench, and TempCompass. Notably, Video-R1-7B has shown competitive performance, even surpassing proprietary models like GPT-4o on certain video spatial reasoning tasks. |
|
|
|
|
|
|
|
|
|
|
|
## Sample Usage |
|
|
|
|
|
We provide a simple generation process for using this SFT cold start model with the `transformers` library. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor |
|
|
from PIL import Image |
|
|
import cv2 |
|
|
from decord import VideoReader, cpu |
|
|
|
|
|
# Load model, tokenizer, and processor |
|
|
model_id = "Video-R1/Qwen2.5-VL-7B-COT-SFT" # This specific SFT checkpoint |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="cuda", |
|
|
trust_remote_code=True, |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) |
|
|
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) |
|
|
|
|
|
# Function to load video frames |
|
|
def load_video_frames(video_path, num_frames=16): |
|
|
vr = VideoReader(video_path, ctx=cpu(0)) |
|
|
total_frames = len(vr) |
|
|
indices = [int(i * (total_frames / num_frames)) for i in range(num_frames)] |
|
|
frames = vr.get_batch(indices).asnumpy() |
|
|
frames = [Image.fromarray(frame) for frame in frames] |
|
|
return frames |
|
|
|
|
|
# Example usage |
|
|
# Replace with your actual video path |
|
|
# For demonstration, ensure a video file like 'examples/video1.mp4' exists or adjust path |
|
|
video_path = "./examples/video1.mp4" |
|
|
frames = load_video_frames(video_path) |
|
|
text = "Describe this video in detail." |
|
|
|
|
|
# Prepare inputs |
|
|
inputs = processor(frames=frames, text=text, return_tensors="pt").to("cuda") |
|
|
|
|
|
# Generate response |
|
|
output = model.generate(**inputs, max_new_tokens=50) |
|
|
print(tokenizer.decode(output[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find our work helpful for your research, please consider citing our work: |
|
|
|
|
|
```bibtex |
|
|
@article{feng2025video, |
|
|
title={Video-R1: Reinforcing Video Reasoning in MLLMs}, |
|
|
author={Feng, Kaituo and Gong, Kaixiong and Li, Bohao and Guo, Zonghao and Wang, Yibing and Peng, Tianshuo and Wang, Benyou and Yue, Xiangyu}, |
|
|
journal={arXiv preprint arXiv:2503.21776}, |
|
|
year={2025} |
|
|
} |
|
|
``` |