ChurchillQAQ
/

LongVideo-R1-Qwen3

Model card Files Files and versions

LongVideo-R1-Qwen3 / README.md

nielsr's picture

nielsr HF Staff

Improve model card and add metadata

8c1fa46 verified 3 months ago

|

2.7 kB

	---
	pipeline_tag: video-text-to-text
	library_name: transformers
	tags:
	- video-understanding
	- long-video
	- reasoning
	- r1
	- multimodal
	---

	# LongVideo-R1-Qwen3

	This repository contains the weights for LongVideo-R1-Qwen3, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient long video understanding.

	This model was introduced in the paper [LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding](https://arxiv.org/abs/2602.20913), accepted at CVPR 2026.

	## Model Description

	LongVideo-R1 addresses the challenge of understanding long videos under low computational budgets. Instead of an exhaustive search across all frames, the agent uses a reasoning module to navigate video context, leveraging high-level visual cues to infer the most informative video clips.

	- Backbone: Fine-tuned from Qwen-3-8B.
	- Training Paradigm: Two-stage approach involving Supervised Fine-Tuning (SFT) on 33K high-quality chain-of-thought-with-tool trajectories followed by Reinforcement Learning (RL).
	- Architecture: The agent initiates traversal from top-level visual summaries and iteratively refines its focus, halting once it has sufficient knowledge to answer the query.

	## Links
	- Paper: [arXiv:2602.20913](https://arxiv.org/abs/2602.20913)
	- Code: [GitHub - qiujihao19/LongVideo-R1](https://github.com/qiujihao19/LongVideo-R1)
	- Data: [LongVideo-R1-Data](https://huggingface.co/datasets/ChurchillQAQ/LongVideo-R1-Data)

	## Usage

	LongVideo-R1 can be deployed using `vLLM` for online testing, supporting tool use and multi-round reasoning.

	### 1. Deploy the reasoning model
	```bash
	# Deploy the reasoning model
	MODEL_PATH="ChurchillQAQ/LongVideo-R1-Qwen3"
	PORT=25600

	vllm serve $MODEL_PATH \
	--tensor-parallel-size 1 \
	--max-model-len 32768 \
	--gpu-memory-utilization 0.85 \
	--host 127.0.0.1 \
	--port $PORT \
	--served-model-name longvideor1
	```

	### 2. Run Inference (CLI Demo)
	Once the model is served (alongside the required caption and video-QA models as described in the [GitHub README](https://github.com/qiujihao19/LongVideo-R1)), you can use `cli.py`:

	```bash
	python cli.py \
	--video_path /path/to/video.mp4 \
	--question "What is the man doing in this video?" \
	--reasoning_base_url http://127.0.0.1:25600/v1 \
	--caption_base_url http://127.0.0.1:9081/v1 \
	--videoqa_base_url http://127.0.0.1:9081/v1
	```

	## Citation

	```bibtex
	@article{qiu2026longvideo,
	title={LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding},
	author={Qiu, Jihao and Xie, Lingxi and Huo, Xinyue and Tian, Qi and Ye, Qixiang},
	journal={arXiv preprint arXiv:2602.20913},
	year={2026}
	}
	```