metadata
library_name: transformers
pipeline_tag: video-text-to-text
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
LongVideo-R1 is an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation. It addresses the challenge of understanding long videos with low computational budgets by avoiding exhaustive searches. At its core, it uses a reasoning module to infer which video clips are most informative for answering a query.
The model was fine-tuned on the Qwen-3-8B model using a two-stage paradigm:
- Supervised Fine-Tuning (SFT): Using high-quality chain-of-thought-with-tool trajectories.
- Reinforcement Learning (RL): Employing a specifically designed reward function to maximize selective and efficient clip navigation.
Usage
For online testing with tool use and multi-round reasoning, you can use the cli.py provided in the official repository. Note that the reasoning model, caption model, and video_qa model should be deployed in vLLM serve mode.
python cli.py \
--video_path /path/to/video.mp4 \
--question "What is the man doing in this video?" \
--reasoning_base_url http://127.0.0.1:25600/v1 \
--caption_base_url http://127.0.0.1:9081/v1 \
--videoqa_base_url http://127.0.0.1:9081/v1
Citation
If you find LongVideo-R1 useful for your research, please cite:
@article{qiu2026longvideo,
title={LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding},
author={Qiu, Jihao and Xie, Lingxi and Huo, Xinyue and Tian, Qi and Ye, Qixiang},
journal={arXiv preprint arXiv:2602.20913},
year={2026}
}