Improve model card with metadata and usage information
Browse filesHi, I'm Niels from the Hugging Face community science team. I've opened this PR to improve your model card by adding relevant metadata and linking the repository to your research paper and code. These changes will help users discover and use your model more effectively.
README.md
CHANGED
|
@@ -1 +1,40 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: transformers
|
| 3 |
+
pipeline_tag: video-text-to-text
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
|
| 7 |
+
|
| 8 |
+
[📖 Paper](https://arxiv.org/abs/2602.20913) | [💻 Code](https://github.com/qiujihao19/LongVideo-R1)
|
| 9 |
+
|
| 10 |
+
LongVideo-R1 is an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation. It addresses the challenge of understanding long videos with low computational budgets by avoiding exhaustive searches. At its core, it uses a reasoning module to infer which video clips are most informative for answering a query.
|
| 11 |
+
|
| 12 |
+
The model was fine-tuned on the Qwen-3-8B model using a two-stage paradigm:
|
| 13 |
+
1. **Supervised Fine-Tuning (SFT)**: Using high-quality chain-of-thought-with-tool trajectories.
|
| 14 |
+
2. **Reinforcement Learning (RL)**: Employing a specifically designed reward function to maximize selective and efficient clip navigation.
|
| 15 |
+
|
| 16 |
+
## Usage
|
| 17 |
+
|
| 18 |
+
For online testing with tool use and multi-round reasoning, you can use the `cli.py` provided in the official repository. Note that the reasoning model, caption model, and video_qa model should be deployed in vLLM serve mode.
|
| 19 |
+
|
| 20 |
+
```bash
|
| 21 |
+
python cli.py \
|
| 22 |
+
--video_path /path/to/video.mp4 \
|
| 23 |
+
--question "What is the man doing in this video?" \
|
| 24 |
+
--reasoning_base_url http://127.0.0.1:25600/v1 \
|
| 25 |
+
--caption_base_url http://127.0.0.1:9081/v1 \
|
| 26 |
+
--videoqa_base_url http://127.0.0.1:9081/v1
|
| 27 |
+
```
|
| 28 |
+
|
| 29 |
+
## Citation
|
| 30 |
+
|
| 31 |
+
If you find LongVideo-R1 useful for your research, please cite:
|
| 32 |
+
|
| 33 |
+
```bibtex
|
| 34 |
+
@article{qiu2026longvideo,
|
| 35 |
+
title={LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding},
|
| 36 |
+
author={Qiu, Jihao and Xie, Lingxi and Huo, Xinyue and Tian, Qi and Ye, Qixiang},
|
| 37 |
+
journal={arXiv preprint arXiv:2602.20913},
|
| 38 |
+
year={2026}
|
| 39 |
+
}
|
| 40 |
+
```
|