Improve model card and add metadata
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1 +1,71 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: video-text-to-text
|
| 3 |
+
library_name: transformers
|
| 4 |
+
tags:
|
| 5 |
+
- video-understanding
|
| 6 |
+
- long-video
|
| 7 |
+
- reasoning
|
| 8 |
+
- r1
|
| 9 |
+
- multimodal
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# LongVideo-R1-Qwen3
|
| 13 |
+
|
| 14 |
+
This repository contains the weights for **LongVideo-R1-Qwen3**, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient long video understanding.
|
| 15 |
+
|
| 16 |
+
This model was introduced in the paper [LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding](https://arxiv.org/abs/2602.20913), accepted at CVPR 2026.
|
| 17 |
+
|
| 18 |
+
## Model Description
|
| 19 |
+
|
| 20 |
+
LongVideo-R1 addresses the challenge of understanding long videos under low computational budgets. Instead of an exhaustive search across all frames, the agent uses a reasoning module to navigate video context, leveraging high-level visual cues to infer the most informative video clips.
|
| 21 |
+
|
| 22 |
+
- **Backbone**: Fine-tuned from **Qwen-3-8B**.
|
| 23 |
+
- **Training Paradigm**: Two-stage approach involving Supervised Fine-Tuning (SFT) on 33K high-quality chain-of-thought-with-tool trajectories followed by Reinforcement Learning (RL).
|
| 24 |
+
- **Architecture**: The agent initiates traversal from top-level visual summaries and iteratively refines its focus, halting once it has sufficient knowledge to answer the query.
|
| 25 |
+
|
| 26 |
+
## Links
|
| 27 |
+
- **Paper**: [arXiv:2602.20913](https://arxiv.org/abs/2602.20913)
|
| 28 |
+
- **Code**: [GitHub - qiujihao19/LongVideo-R1](https://github.com/qiujihao19/LongVideo-R1)
|
| 29 |
+
- **Data**: [LongVideo-R1-Data](https://huggingface.co/datasets/ChurchillQAQ/LongVideo-R1-Data)
|
| 30 |
+
|
| 31 |
+
## Usage
|
| 32 |
+
|
| 33 |
+
LongVideo-R1 can be deployed using `vLLM` for online testing, supporting tool use and multi-round reasoning.
|
| 34 |
+
|
| 35 |
+
### 1. Deploy the reasoning model
|
| 36 |
+
```bash
|
| 37 |
+
# Deploy the reasoning model
|
| 38 |
+
MODEL_PATH="ChurchillQAQ/LongVideo-R1-Qwen3"
|
| 39 |
+
PORT=25600
|
| 40 |
+
|
| 41 |
+
vllm serve $MODEL_PATH \
|
| 42 |
+
--tensor-parallel-size 1 \
|
| 43 |
+
--max-model-len 32768 \
|
| 44 |
+
--gpu-memory-utilization 0.85 \
|
| 45 |
+
--host 127.0.0.1 \
|
| 46 |
+
--port $PORT \
|
| 47 |
+
--served-model-name longvideor1
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
### 2. Run Inference (CLI Demo)
|
| 51 |
+
Once the model is served (alongside the required caption and video-QA models as described in the [GitHub README](https://github.com/qiujihao19/LongVideo-R1)), you can use `cli.py`:
|
| 52 |
+
|
| 53 |
+
```bash
|
| 54 |
+
python cli.py \
|
| 55 |
+
--video_path /path/to/video.mp4 \
|
| 56 |
+
--question "What is the man doing in this video?" \
|
| 57 |
+
--reasoning_base_url http://127.0.0.1:25600/v1 \
|
| 58 |
+
--caption_base_url http://127.0.0.1:9081/v1 \
|
| 59 |
+
--videoqa_base_url http://127.0.0.1:9081/v1
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
## Citation
|
| 63 |
+
|
| 64 |
+
```bibtex
|
| 65 |
+
@article{qiu2026longvideo,
|
| 66 |
+
title={LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding},
|
| 67 |
+
author={Qiu, Jihao and Xie, Lingxi and Huo, Xinyue and Tian, Qi and Ye, Qixiang},
|
| 68 |
+
journal={arXiv preprint arXiv:2602.20913},
|
| 69 |
+
year={2026}
|
| 70 |
+
}
|
| 71 |
+
```
|