Add model card for SPARROW

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +53 -0
README.md ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: video-text-to-text
3
+ tags:
4
+ - video-grounding
5
+ - pixel-grounding
6
+ - mllm
7
+ - video-understanding
8
+ ---
9
+
10
+ # SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs
11
+
12
+ SPARROW is a pixel-grounded video Multimodal Large Language Model (MLLM) that unifies spatial accuracy and temporal stability. It addresses challenges like spatial drift and identity switches in video object segmentation by introducing Target-Specific Tracked Features (TSF) and a dual-prompt design that decodes both box ([BOX]) and segmentation ([SEG]) tokens.
13
+
14
+ - **Paper:** [SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs](https://huggingface.co/papers/2603.12382)
15
+ - **Project Page:** [https://risys-lab.github.io/SPARROW](https://risys-lab.github.io/SPARROW)
16
+ - **Repository:** [https://github.com/RISys-Lab/SPARROW](https://github.com/RISys-Lab/SPARROW)
17
+
18
+ ## Introduction
19
+
20
+ SPARROW introduces a novel approach to learning spatial precision and temporal referential consistency in pixel-grounded video MLLMs. It utilizes a dual-prompt initialization strategy to improve segmentation precision and stability during early frames and mitigates drift by maintaining consistent object grounding over time.
21
+
22
+ ## Quick Run
23
+
24
+ After setting up the environment and downloading the checkpoints as described in the [official repository](https://github.com/RISys-Lab/SPARROW), you can run inference on a video using the following command:
25
+
26
+ ```bash
27
+ python chat.py \
28
+ --llava_version_or_path checkpoints/sparrow-finetune \
29
+ --input_path /path/to/input.mp4 \
30
+ --prompt_text "Please segment the horse jumping." \
31
+ --vis_save_path vis_output/chat_output \
32
+ --proposal_debug_modes both
33
+ ```
34
+
35
+ Arguments:
36
+ - `--llava_version_or_path`: Path to the SPARROW checkpoint.
37
+ - `--input_path`: Path to the input image or video.
38
+ - `--prompt_text`: Text prompt describing what object or region to segment.
39
+ - `--vis_save_path`: Directory where visualization outputs will be saved.
40
+ - `--proposal_debug_modes`: Debug visualization mode (both, proposal, or none).
41
+
42
+ ## Citation
43
+
44
+ If you find SPARROW useful in your research, please consider citing:
45
+
46
+ ```bibtex
47
+ @inproceedings{alansari2026sparrow,
48
+ title={SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs},
49
+ author={Alansari, Mohamad and Suryanto, Naufal and Velayudhan, Divya and Javed, Sajid and Werghi, Naoufel and Naseer, Muzammal},
50
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
51
+ year={2026}
52
+ }
53
+ ```