Improve model card and add metadata

Hi! I'm Niels from the community science team at Hugging Face.

I've opened this PR to improve the model card for LongVideo-R1-Qwen3. I've added YAML metadata including the `pipeline_tag` and `library_name` to help users discover the model, and structured the README to include information about the paper, the GitHub repository, and usage instructions derived from the official implementation.

Files changed (1) hide show

README.md +71 -1

README.md CHANGED Viewed

	@@ -1 +1,71 @@
1	- ~~https://arxiv.org/abs/2602.20913~~

+---
+pipeline_tag: video-text-to-text
+library_name: transformers
+tags:
+- video-understanding
+- long-video
+- reasoning
+- r1
+- multimodal
+---
+# LongVideo-R1-Qwen3
+This repository contains the weights for **LongVideo-R1-Qwen3**, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient long video understanding.
+This model was introduced in the paper [LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding](https://arxiv.org/abs/2602.20913), accepted at CVPR 2026.
+## Model Description
+LongVideo-R1 addresses the challenge of understanding long videos under low computational budgets. Instead of an exhaustive search across all frames, the agent uses a reasoning module to navigate video context, leveraging high-level visual cues to infer the most informative video clips.
+- **Backbone**: Fine-tuned from **Qwen-3-8B**.
+- **Training Paradigm**: Two-stage approach involving Supervised Fine-Tuning (SFT) on 33K high-quality chain-of-thought-with-tool trajectories followed by Reinforcement Learning (RL).
+- **Architecture**: The agent initiates traversal from top-level visual summaries and iteratively refines its focus, halting once it has sufficient knowledge to answer the query.
+## Links
+- **Paper**: [arXiv:2602.20913](https://arxiv.org/abs/2602.20913)
+- **Code**: [GitHub - qiujihao19/LongVideo-R1](https://github.com/qiujihao19/LongVideo-R1)
+- **Data**: [LongVideo-R1-Data](https://huggingface.co/datasets/ChurchillQAQ/LongVideo-R1-Data)
+## Usage
+LongVideo-R1 can be deployed using `vLLM` for online testing, supporting tool use and multi-round reasoning.
+### 1. Deploy the reasoning model
+```bash
+# Deploy the reasoning model
+MODEL_PATH="ChurchillQAQ/LongVideo-R1-Qwen3"
+PORT=25600
+vllm serve $MODEL_PATH \
+  --tensor-parallel-size 1 \
+  --max-model-len 32768 \
+  --gpu-memory-utilization 0.85 \
+  --host 127.0.0.1 \
+  --port $PORT \
+  --served-model-name longvideor1
+```
+### 2. Run Inference (CLI Demo)
+Once the model is served (alongside the required caption and video-QA models as described in the [GitHub README](https://github.com/qiujihao19/LongVideo-R1)), you can use `cli.py`:
+```bash
+python cli.py \
+  --video_path /path/to/video.mp4 \
+  --question "What is the man doing in this video?" \
+  --reasoning_base_url http://127.0.0.1:25600/v1 \
+  --caption_base_url http://127.0.0.1:9081/v1 \
+  --videoqa_base_url http://127.0.0.1:9081/v1
+```
+## Citation
+```bibtex
+@article{qiu2026longvideo,
+  title={LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding},
+  author={Qiu, Jihao and Xie, Lingxi and Huo, Xinyue and Tian, Qi and Ye, Qixiang},
+  journal={arXiv preprint arXiv:2602.20913},
+  year={2026}
+}
+```