ChurchillQAQ
/

LongVideo-R1-Qwen3

+---
+pipeline_tag: video-text-to-text
+library_name: transformers
+tags:
+- video-understanding
+- long-video
+- reasoning
+- r1
+- multimodal
+---
+# LongVideo-R1-Qwen3
+This repository contains the weights for **LongVideo-R1-Qwen3**, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient long video understanding.
+This model was introduced in the paper [LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding](https://arxiv.org/abs/2602.20913), accepted at CVPR 2026.
+## Model Description
+LongVideo-R1 addresses the challenge of understanding long videos under low computational budgets. Instead of an exhaustive search across all frames, the agent uses a reasoning module to navigate video context, leveraging high-level visual cues to infer the most informative video clips.
+- **Backbone**: Fine-tuned from **Qwen-3-8B**.
+- **Training Paradigm**: Two-stage approach involving Supervised Fine-Tuning (SFT) on 33K high-quality chain-of-thought-with-tool trajectories followed by Reinforcement Learning (RL).
+- **Architecture**: The agent initiates traversal from top-level visual summaries and iteratively refines its focus, halting once it has sufficient knowledge to answer the query.
+## Links
+- **Paper**: [arXiv:2602.20913](https://arxiv.org/abs/2602.20913)
+- **Code**: [GitHub - qiujihao19/LongVideo-R1](https://github.com/qiujihao19/LongVideo-R1)
+- **Data**: [LongVideo-R1-Data](https://huggingface.co/datasets/ChurchillQAQ/LongVideo-R1-Data)
+## Usage
+LongVideo-R1 can be deployed using `vLLM` for online testing, supporting tool use and multi-round reasoning.
+### 1. Deploy the reasoning model
+```bash
+# Deploy the reasoning model
+MODEL_PATH="ChurchillQAQ/LongVideo-R1-Qwen3"
+PORT=25600
+vllm serve $MODEL_PATH \
+  --tensor-parallel-size 1 \
+  --max-model-len 32768 \
+  --gpu-memory-utilization 0.85 \
+  --host 127.0.0.1 \
+  --port $PORT \
+  --served-model-name longvideor1
+```
+### 2. Run Inference (CLI Demo)
+Once the model is served (alongside the required caption and video-QA models as described in the [GitHub README](https://github.com/qiujihao19/LongVideo-R1)), you can use `cli.py`:
+```bash
+python cli.py \
+  --video_path /path/to/video.mp4 \
+  --question "What is the man doing in this video?" \
+  --reasoning_base_url http://127.0.0.1:25600/v1 \
+  --caption_base_url http://127.0.0.1:9081/v1 \
+  --videoqa_base_url http://127.0.0.1:9081/v1
+```
+## Citation
+```bibtex
+@article{qiu2026longvideo,
+  title={LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding},
+  author={Qiu, Jihao and Xie, Lingxi and Huo, Xinyue and Tian, Qi and Ye, Qixiang},
+  journal={arXiv preprint arXiv:2602.20913},
+  year={2026}
+}
+```