Add model card for SPARROW
#1
by nielsr HF Staff - opened
README.md
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: video-text-to-text
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
# SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs
|
| 6 |
+
|
| 7 |
+
**SPARROW** is a pixel-grounded video Multimodal Large Language Model (MLLM) designed to achieve high spatial precision and temporal referential consistency. It addresses common challenges in video grounding, such as spatial drift and identity switching, by introducing two key innovations:
|
| 8 |
+
1. **Target-Specific Tracked Features (TSF):** Injects temporally aligned referent cues during training to maintain stable object grounding.
|
| 9 |
+
2. **Dual-Prompt Design:** Decodes both box ([BOX]) and segmentation ([SEG]) tokens to fuse geometric priors with semantic grounding.
|
| 10 |
+
|
| 11 |
+
- **Paper:** [SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs](https://huggingface.co/papers/2603.12382)
|
| 12 |
+
- **Project Page:** [https://risys-lab.github.io/SPARROW](https://risys-lab.github.io/SPARROW)
|
| 13 |
+
- **Code:** [GitHub - RISys-Lab/SPARROW](https://github.com/RISys-Lab/SPARROW)
|
| 14 |
+
|
| 15 |
+
## Quick Run
|
| 16 |
+
|
| 17 |
+
After setting up the environment and downloading the checkpoints as described in the [official repository](https://github.com/RISys-Lab/SPARROW), you can run inference using the following command:
|
| 18 |
+
|
| 19 |
+
```bash
|
| 20 |
+
python chat.py \
|
| 21 |
+
--llava_version_or_path checkpoints/sparrow-finetune \
|
| 22 |
+
--input_path /path/to/input.mp4 \
|
| 23 |
+
--prompt_text "Please segment the horse jumping." \
|
| 24 |
+
--vis_save_path vis_output/chat_output \
|
| 25 |
+
--proposal_debug_modes both
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
## Citation
|
| 29 |
+
|
| 30 |
+
If you find SPARROW useful in your research, please consider citing the paper:
|
| 31 |
+
|
| 32 |
+
```bibtex
|
| 33 |
+
@inproceedings{alansari2026sparrow,
|
| 34 |
+
title={SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs},
|
| 35 |
+
author={Alansari, Mohamad and Suryanto, Naufal and Velayudhan, Divya and Javed, Sajid and Werghi, Naoufel and Naseer, Muzammal},
|
| 36 |
+
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
|
| 37 |
+
year={2026}
|
| 38 |
+
}
|
| 39 |
+
```
|