RISys-Lab
/

sparrow-det-pretrain

Model card Files Files and versions

Add model card for SPARROW

#1

by nielsr HF Staff - opened Mar 16

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

Files changed (1) hide show

README.md +53 -0

README.md ADDED Viewed

	@@ -0,0 +1,53 @@

+---
+pipeline_tag: video-text-to-text
+tags:
+- video-grounding
+- pixel-grounding
+- mllm
+- video-understanding
+---
+# SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs
+SPARROW is a pixel-grounded video Multimodal Large Language Model (MLLM) that unifies spatial accuracy and temporal stability. It addresses challenges like spatial drift and identity switches in video object segmentation by introducing Target-Specific Tracked Features (TSF) and a dual-prompt design that decodes both box ([BOX]) and segmentation ([SEG]) tokens.
+- **Paper:** [SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs](https://huggingface.co/papers/2603.12382)
+- **Project Page:** [https://risys-lab.github.io/SPARROW](https://risys-lab.github.io/SPARROW)
+- **Repository:** [https://github.com/RISys-Lab/SPARROW](https://github.com/RISys-Lab/SPARROW)
+## Introduction
+SPARROW introduces a novel approach to learning spatial precision and temporal referential consistency in pixel-grounded video MLLMs. It utilizes a dual-prompt initialization strategy to improve segmentation precision and stability during early frames and mitigates drift by maintaining consistent object grounding over time.
+## Quick Run
+After setting up the environment and downloading the checkpoints as described in the [official repository](https://github.com/RISys-Lab/SPARROW), you can run inference on a video using the following command:
+```bash
+python chat.py \
+  --llava_version_or_path checkpoints/sparrow-finetune \
+  --input_path /path/to/input.mp4 \
+  --prompt_text "Please segment the horse jumping." \
+  --vis_save_path vis_output/chat_output \
+  --proposal_debug_modes both
+```
+Arguments:
+- `--llava_version_or_path`: Path to the SPARROW checkpoint.
+- `--input_path`: Path to the input image or video.
+- `--prompt_text`: Text prompt describing what object or region to segment.
+- `--vis_save_path`: Directory where visualization outputs will be saved.
+- `--proposal_debug_modes`: Debug visualization mode (both, proposal, or none).
+## Citation
+If you find SPARROW useful in your research, please consider citing:
+```bibtex
+@inproceedings{alansari2026sparrow,
+  title={SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs},
+  author={Alansari, Mohamad and Suryanto, Naufal and Velayudhan, Divya and Javed, Sajid and Werghi, Naoufel and Naseer, Muzammal},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
+  year={2026}
+}
+```