| --- |
| pipeline_tag: video-text-to-text |
| library_name: transformers |
| tags: |
| - video-understanding |
| - long-video |
| - reasoning |
| - r1 |
| - multimodal |
| --- |
| |
| # LongVideo-R1-Qwen3 |
|
|
| This repository contains the weights for **LongVideo-R1-Qwen3**, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient long video understanding. |
|
|
| This model was introduced in the paper [LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding](https://arxiv.org/abs/2602.20913), accepted at CVPR 2026. |
|
|
| ## Model Description |
|
|
| LongVideo-R1 addresses the challenge of understanding long videos under low computational budgets. Instead of an exhaustive search across all frames, the agent uses a reasoning module to navigate video context, leveraging high-level visual cues to infer the most informative video clips. |
|
|
| - **Backbone**: Fine-tuned from **Qwen-3-8B**. |
| - **Training Paradigm**: Two-stage approach involving Supervised Fine-Tuning (SFT) on 33K high-quality chain-of-thought-with-tool trajectories followed by Reinforcement Learning (RL). |
| - **Architecture**: The agent initiates traversal from top-level visual summaries and iteratively refines its focus, halting once it has sufficient knowledge to answer the query. |
|
|
| ## Links |
| - **Paper**: [arXiv:2602.20913](https://arxiv.org/abs/2602.20913) |
| - **Code**: [GitHub - qiujihao19/LongVideo-R1](https://github.com/qiujihao19/LongVideo-R1) |
| - **Data**: [LongVideo-R1-Data](https://huggingface.co/datasets/ChurchillQAQ/LongVideo-R1-Data) |
|
|
| ## Usage |
|
|
| LongVideo-R1 can be deployed using `vLLM` for online testing, supporting tool use and multi-round reasoning. |
|
|
| ### 1. Deploy the reasoning model |
| ```bash |
| # Deploy the reasoning model |
| MODEL_PATH="ChurchillQAQ/LongVideo-R1-Qwen3" |
| PORT=25600 |
| |
| vllm serve $MODEL_PATH \ |
| --tensor-parallel-size 1 \ |
| --max-model-len 32768 \ |
| --gpu-memory-utilization 0.85 \ |
| --host 127.0.0.1 \ |
| --port $PORT \ |
| --served-model-name longvideor1 |
| ``` |
|
|
| ### 2. Run Inference (CLI Demo) |
| Once the model is served (alongside the required caption and video-QA models as described in the [GitHub README](https://github.com/qiujihao19/LongVideo-R1)), you can use `cli.py`: |
|
|
| ```bash |
| python cli.py \ |
| --video_path /path/to/video.mp4 \ |
| --question "What is the man doing in this video?" \ |
| --reasoning_base_url http://127.0.0.1:25600/v1 \ |
| --caption_base_url http://127.0.0.1:9081/v1 \ |
| --videoqa_base_url http://127.0.0.1:9081/v1 |
| ``` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{qiu2026longvideo, |
| title={LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding}, |
| author={Qiu, Jihao and Xie, Lingxi and Huo, Xinyue and Tian, Qi and Ye, Qixiang}, |
| journal={arXiv preprint arXiv:2602.20913}, |
| year={2026} |
| } |
| ``` |