Update model card with detailed information and pipeline tag

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +163 -41
README.md CHANGED
@@ -1,71 +1,193 @@
1
  ---
2
- license: apache-2.0
3
  library_name: transformers
4
- pipeline_tag: video-text-to-text
 
5
  ---
6
 
7
  # πŸŽ₯ VideoMolmo: Spatio-Temporal Grounding meets Pointing
8
 
9
  This repository contains the model for spatio-temporal grounding, as described in [VideoMolmo: Spatio-Temporal Grounding Meets Pointing](https://huggingface.co/papers/2506.05336).
10
 
11
- [**WebPage**] -https://mbzuai-oryx.github.io/VideoMolmo/
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
- [**Paper**] –https://arxiv.org/abs/2506.05336
 
14
 
15
- [**Dataset & Benchmark (VPoS-Bench)**] –https://huggingface.co/datasets/ghazishazan/VPoS/tree/main
16
 
17
- Code: https://github.com/mbzuai-oryx/VideoMolmo
 
 
 
 
 
 
 
18
 
19
  ---
 
 
 
20
 
21
- ## 🧠 Overview
22
 
23
- **VideoMolmo** is a **large multimodal model (LMM)** designed for **fine-grained spatio-temporal localization** in videos, conditioned on natural language descriptions. Built on top of the [Molmo](https://huggingface.co/allenai/Molmo-7B-D-0924) architecture, VideoMolmo introduces **temporal attention** and a **temporal mask fusion pipeline**, enabling precise and consistent object pointing across video frames.
 
 
24
 
25
  ---
 
26
 
27
- ## πŸ” Key Features
 
 
28
 
29
- - 🧭 **Spatio-Temporal Pointing**: Localize target points across video frames with natural-language queries.
30
- - 🧠 **Temporal Attention Module**: Ensures temporal consistency by conditioning each frame on its preceding context.
31
- - πŸ”„ **Temporal Mask Fusion with SAM2**: Leverages bidirectional point propagation and local video structure for coherent mask generation.
32
- - πŸ§ͺ **Two-Stage Decomposition**:
33
- - Stage 1: LLM generates fine-grained pointing coordinates.
34
- - Stage 2: Sequential mask fusion module constructs temporally consistent segmentations.
35
- - 🌍 **VPoS-Bench Benchmark**: A diverse, out-of-distribution evaluation suite spanning 5 real-world domains.
36
 
37
  ---
 
38
 
39
- ## πŸ“¦ Model Architecture
 
 
40
 
41
- VideoMolmo consists of:
42
- - A **temporal attention module** to encode video dynamics.
43
- - An **LLM decoder** to generate pointing instructions.
44
- - A **temporal mask fusion pipeline**, built on **SAM2**, to convert predicted points into coherent masks.
45
 
46
  ---
47
 
48
- ## πŸ—ƒοΈ Dataset
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
- We introduce a new dataset containing:
51
- - **72,000** video-caption pairs.
52
- - **100,000+** annotated object points.
53
- - Diverse scenarios across domains such as cell tracking, autonomous driving, egocentric vision and robotics.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
  ---
56
 
57
- ## πŸ§ͺ Evaluation Benchmarks
58
-
59
- We evaluate VideoMolmo on:
60
- - **VPoS-Bench**: Out-of-distribution test suite covering:
61
- - 🧬 Cell Tracking
62
- - 🧍 Egocentric Vision
63
- - πŸš— Autonomous Driving
64
- - πŸ–±οΈ Video-GUI Interaction
65
- - πŸ€– Robotics
66
- - **Refer-YouTube-VOS**
67
- - **Reasoning-VOS**
68
- - **MeViS**
69
- - **Refer-DAVIS17**
70
-
71
- ---
 
1
  ---
 
2
  library_name: transformers
3
+ license: apache-2.0
4
+ pipeline_tag: image-segmentation
5
  ---
6
 
7
  # πŸŽ₯ VideoMolmo: Spatio-Temporal Grounding meets Pointing
8
 
9
  This repository contains the model for spatio-temporal grounding, as described in [VideoMolmo: Spatio-Temporal Grounding Meets Pointing](https://huggingface.co/papers/2506.05336).
10
 
11
+ <div align="left" style="margin:24px 0;">
12
+ <img src="https://user-images.githubusercontent.com/74038190/212284115-f47cd8ff-2ffb-4b04-b5bf-4d1c14c0247f.gif"
13
+ width="100%" height="4"/>
14
+ </div>
15
+
16
+ <p align="center">
17
+ <a href="https://mbzuai-oryx.github.io/VideoMolmo/"><img src="https://img.shields.io/badge/Project-Website-87CEEB?style=flat-square" alt="Website"></a>
18
+ <a href="https://arxiv.org/abs/2506.05336"><img src="https://img.shields.io/badge/arXiv-Paper-brightgreen?style=flat-square" alt="arXiv"></a>
19
+ <a href="https://huggingface.co/datasets/ghazishazan/VPoS"><img src="https://img.shields.io/badge/πŸ€—_Dataset-Access-green" alt="dataset"></a>
20
+ <a href="https://huggingface.co/ghazishazan/VideoMolmo"><img src="https://img.shields.io/badge/HuggingFace-Model-F9D371" alt="model"></a>
21
+ <a href="https://colab.research.google.com/drive/1gqg5kBP9MYkdarEry7QS5rJFQYOG7DiF?usp=sharing"><img src="https://img.shields.io/badge/Run-Colab-orange?style=flat-square&logo=google-colab" alt="Colab"></a>
22
+
23
+ </p>
24
+
25
+ <p align="center">
26
+ <a href="https://github.com/khufia"><b>Ghazi Shazan Ahmad</b></a><sup>*</sup>,
27
+ <a href="https://scholar.google.com/citations?user=JcWO9OUAAAAJ&hl=en"><b>Ahmed Heakl</b></a><sup>*</sup>,
28
+ <a href="https://hananshafi.github.io/"><b>Hanan Gani</b></a>,
29
+ <a href="https://amshaker.github.io/"><b>Abdelrahman Shaker</b></a>,
30
+ <a href="https://zhiqiangshen.com/"><b>Zhiqiang Shen</b></a>,<br>
31
+ <a href="https://scholar.google.es/citations?user=zvaeYnUAAAAJ&hl=en"><b>Fahad Shahbaz Khan</b></a>,
32
+ <a href="https://salman-h-khan.github.io/"><b>Salman Khan</b></a>
33
+ </p>
34
+
35
+
36
+ <p align="center">
37
+ <b>MBZUAI</b> Β· <b>LinkΓΆping University</b> Β· <b>ANU</b>
38
+ </p>
39
+
40
+ <p align="center"><sup>*</sup>Equal Technical Contributions</p>
41
+
42
+ ---
43
+
44
+ ## πŸ†• Latest Updates
45
 
46
+ - πŸ“’ **June 2025**: [Colab Notebook](https://colab.research.google.com/drive/1gqg5kBP9MYkdarEry7QS5rJFQYOG7DiF?usp=sharing) with our bidirectional inference method is released!
47
+ - πŸ“’ **May 2025**: Paper and inference code are released!
48
 
 
49
 
50
+
51
+ ## πŸ“Š Overview
52
+
53
+ <p align="center">
54
+ <img src="assets/videomolmo_teaser.png" width="70%" alt="VideoMolmo Architectural Overview">
55
+ </p>
56
+
57
+ **VideoMolmo** is a a large multimodal model tailored for fine-grained spatio-temporal pointing conditioned on textual descriptions. Building upon the Molmo architecture, VideoMolmo incorporates a temporal module utilizing an attention mechanism to condition each frame on preceding frames, ensuring temporal consistency. Additionally, our novel temporal mask fusion pipeline employs SAM2 for bidirectional point propagation, significantly enhancing coherence across video sequences. This two-step decomposition i.e., first using the LLM to generate precise pointing coordinates, then relying on a sequential mask-fusion module to produce coherent segmentation, not only simplifies the task for the language model but also enhances interpretability. Due to the lack of suitable datasets, we curate a comprehensive dataset comprising 72k video-caption pairs annotated with 100k object points. To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics. We also evaluate our model on Referring Video Object Segmentation (Refer-VOS) and Reasoning VOS tasks. In comparison to existing models, \method substantially improves spatio-temporal pointing accuracy and reasoning capability.
58
 
59
  ---
60
+ ## πŸ† Highlights
61
+ Key contributions of **VideoMolmo**:
62
+ 1. We introduce VideoMolmo , an LMM that accepts natural-language queries and produces point-level predictions for target objects across entire video sequences, ensuring temporal consistency.
63
 
64
+ 2. We further introduce Temporal module to leverage past temporal context and propose a novel temporal mask fusion pipeline for enhanced temporal coherence.
65
 
66
+ 3. To achieve fine-grained spatio-temporal pointing, we introduce a comprehensive dataset of 72k video-caption pairs and 100k object points.
67
+
68
+ 4. To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics. We also assess our model on Referring Video Object Segmentation (Ref-VOS) and Reasoning VOS tasks.
69
 
70
  ---
71
+ ## 🧠 Architecture
72
 
73
+ <p align="center">
74
+ <img src="assets/videomolmo_main_diagram.jpg" alt="VideoGLaMM Architecture">
75
+ </p>
76
 
77
+ **VideoMolmo** consists of four end-to-end trainable components: (1) a visual encoder, (2) a temporal module, (3) visual projector (4) a decoder-only large language model (LLM); and a post-processing module.
 
 
 
 
 
 
78
 
79
  ---
80
+ ## πŸ—οΈ Benchmark and Annotation Pipeline
81
 
82
+ <p align="center">
83
+ <img src="assets/videomolmo_annotation_pipeline.png" alt="Annotation Pipeline">
84
+ </p>
85
 
86
+ We propose a semi-automatic annotation pipeline for creating a grounded conversation generation (GCG) dataset for videos.
 
 
 
87
 
88
  ---
89
 
90
+ ## πŸ“ˆ Results
91
+
92
+
93
+ > |1| **VideoMolmo** demonstrates robust generalization and fine-grained spatio-temporal grounding across diverse out-of-distribution scenarios from our proposed benchmark, for instance, correctly pointing to traffic lights (2nd row) in challenging driving scenes despite never encountering such scenarios during training.
94
+ <p align="center">
95
+ <img src="assets/benchmark_diagram.png" width="70%" alt="Benchmark results">
96
+ </p>
97
+
98
+
99
+ > |2| Quantative results showing VideoMolmo with average improvement of 4.1% over SoTA (VideoGLaMM) and 4.8% over our baseline (Molmo+SAM2).
100
+ <p align="center">
101
+ <img src="assets/videomolmo_quantitative_results.png" alt="Benchmark results">
102
+ </p>
103
+
104
+
105
+ ---
106
 
107
+ ## πŸ”§ Running VideoMolmo
108
+
109
+ ### Environment setup
110
+
111
+ (1) Setup environment and PyTorch
112
+ ```bash
113
+ git clone https://github.com/mbzuai-oryx/VideoMolmo
114
+ cd VideoMolmo/VideoMolmo
115
+ conda create -n .videomolmo python=3.10 -y
116
+ conda activate .videomolmo
117
+ pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu121
118
+ ```
119
+
120
+ (2) Setup Molmo
121
+ ```bash
122
+ git clone https://github.com/allenai/molmo.git
123
+ cd molmo && pip install -e .[all] && cd .. # setup molmo requirements
124
+ pip install -r requirements.txt
125
+ ```
126
+
127
+ (3) Setup SAM
128
+ ```bash
129
+ python setup.py build_ext --inplace # build sam2
130
+ mkdir -p sam2_checkpoints
131
+ wget https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt -O sam2_checkpoints/sam2.1_hiera_large.pt
132
+ ```
133
+
134
+ ### πŸ”„ Inference
135
+
136
+ To run inference on the provided sample video:
137
+
138
+ ```bash
139
+ python infer.py \
140
+ --video_path ../examples/video_sample1 \
141
+ --prompt "point to the person in red shirt" \
142
+ --save_path "results"
143
+ ```
144
+
145
+ Your video should be a folder with all the frames. Sample structure:
146
+ ```
147
+ video_sample1/
148
+ β”œβ”€β”€ frame_0001.jpg
149
+ β”œβ”€β”€ frame_0002.jpg
150
+ β”œβ”€β”€ frame_0003.jpg
151
+ └── ...
152
+ ```
153
+
154
+ Output includes segmentation masks for each frame and a JSON file (`points.jsonl`) containing point coordinates.
155
+ ```
156
+ reuslts/
157
+ β”œβ”€β”€ video_sample1/
158
+ β”‚ β”œβ”€β”€ frame_0001.jpg
159
+ β”‚ β”œβ”€β”€ frame_0002.jpg
160
+ β”‚ β”œβ”€β”€ frame_0003.jpg
161
+ β”‚ β”œβ”€β”€ points.jsonl
162
+ β”‚ └── ...
163
+ └── ...
164
+ ```
165
+ ### Training and Evaluation πŸš€
166
+
167
+ To be released soon! Stay tuned for updates.
168
+
169
+
170
+ ## Todos
171
+
172
+ - [ ] Release training and evaluation scripts.
173
+ - [ ] Add support for additional datasets.
174
+ - [ ] Release dataset creation pipeline.
175
+
176
+
177
+ ## Citation πŸ“œ
178
+
179
+ ```bibtex
180
+ @misc{ahmad2025videomolmospatiotemporalgroundingmeets,
181
+ title={VideoMolmo: Spatio-Temporal Grounding Meets Pointing},
182
+ author={Ghazi Shazan Ahmad and Ahmed Heakl and Hanan Gani and Abdelrahman Shaker and Zhiqiang Shen and Ranjay Krishna and Fahad Shahbaz Khan and Salman Khan},
183
+ year={2025},
184
+ eprint={2506.05336},
185
+ archivePrefix={arXiv},
186
+ primaryClass={cs.CV},
187
+ url={https://arxiv.org/abs/2506.05336},
188
+ }
189
+ ```
190
 
191
  ---
192
 
193
+ [<img src="assets/MBZUAI_logo.png" width="360" height="90">](https://mbzuai.ac.ae)