File size: 8,323 Bytes
6b09a14
e2f1b0c
865bca8
 
6b09a14
 
dec7253
6b09a14
e2f1b0c
 
865bca8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b5a361
865bca8
 
567a1ef
6b09a14
865bca8
 
 
 
 
 
 
 
e2f1b0c
6b09a14
865bca8
 
 
6b09a14
865bca8
6b09a14
865bca8
 
 
6b09a14
 
865bca8
6b09a14
865bca8
 
 
6b09a14
865bca8
6b09a14
 
865bca8
6b09a14
865bca8
 
 
6b09a14
865bca8
6b09a14
 
 
865bca8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b09a14
865bca8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b09a14
 
 
865bca8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
library_name: transformers
license: apache-2.0
pipeline_tag: image-segmentation
---

# πŸŽ₯ VideoMolmo: Spatio-Temporal Grounding meets Pointing

This repository contains the model for spatio-temporal grounding, as described in [VideoMolmo: Spatio-Temporal Grounding Meets Pointing](https://huggingface.co/papers/2506.05336).

<div align="left" style="margin:24px 0;">
  <img src="https://user-images.githubusercontent.com/74038190/212284115-f47cd8ff-2ffb-4b04-b5bf-4d1c14c0247f.gif"
       width="100%" height="4"/>
</div>

<p align="center">
  <a href="https://mbzuai-oryx.github.io/VideoMolmo/"><img src="https://img.shields.io/badge/Project-Website-87CEEB?style=flat-square" alt="Website"></a>
  <a href="https://arxiv.org/abs/2506.05336"><img src="https://img.shields.io/badge/arXiv-Paper-brightgreen?style=flat-square" alt="arXiv"></a>
  <a href="https://huggingface.co/datasets/ghazishazan/VPoS"><img src="https://img.shields.io/badge/πŸ€—_Dataset-Access-green" alt="dataset"></a>
  <a href="https://huggingface.co/ghazishazan/VideoMolmo"><img src="https://img.shields.io/badge/HuggingFace-Model-F9D371" alt="model"></a>
  <a href="https://colab.research.google.com/drive/1gqg5kBP9MYkdarEry7QS5rJFQYOG7DiF?usp=sharing"><img src="https://img.shields.io/badge/Run-Colab-orange?style=flat-square&logo=google-colab" alt="Colab"></a>

</p>

<p align="center">
  <a href="https://github.com/khufia"><b>Ghazi Shazan Ahmad</b></a><sup>*</sup>, 
  <a href="https://scholar.google.com/citations?user=JcWO9OUAAAAJ&hl=en"><b>Ahmed Heakl</b></a><sup>*</sup>, 
  <a href="https://hananshafi.github.io/"><b>Hanan Gani</b></a>, 
  <a href="https://amshaker.github.io/"><b>Abdelrahman Shaker</b></a>,
   <a href="https://zhiqiangshen.com/"><b>Zhiqiang Shen</b></a>,<br>
  <a href="https://scholar.google.es/citations?user=zvaeYnUAAAAJ&hl=en"><b>Fahad Shahbaz Khan</b></a>, 
  <a href="https://salman-h-khan.github.io/"><b>Salman Khan</b></a>
</p>


<p align="center">
  <b>MBZUAI</b> Β· <b>LinkΓΆping University</b> Β· <b>ANU</b>
</p>

<p align="center"><sup>*</sup>Equal Technical Contributions</p>

---

## πŸ†• Latest Updates

- πŸ“’ **June 2025**: [Colab Notebook](https://colab.research.google.com/drive/1gqg5kBP9MYkdarEry7QS5rJFQYOG7DiF?usp=sharing) with our bidirectional inference method is released!
- πŸ“’ **May 2025**: Paper and inference code are released!



## πŸ“Š Overview

<p align="center">
  <img src="assets/videomolmo_teaser.png" width="70%" alt="VideoMolmo Architectural Overview">
</p>

**VideoMolmo** is a a large multimodal model tailored for fine-grained spatio-temporal pointing conditioned on textual descriptions. Building upon the Molmo architecture, VideoMolmo incorporates a temporal module utilizing an attention mechanism to condition each frame on preceding frames, ensuring temporal consistency. Additionally, our novel temporal mask fusion pipeline employs SAM2 for bidirectional point propagation, significantly enhancing coherence across video sequences. This two-step decomposition i.e., first using the LLM to generate precise pointing coordinates, then relying on a sequential mask-fusion module to produce coherent segmentation, not only simplifies the task for the language model but also enhances interpretability. Due to the lack of suitable datasets, we curate a comprehensive dataset comprising 72k video-caption pairs annotated with 100k object points. To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics. We also evaluate our model on Referring Video Object Segmentation (Refer-VOS) and Reasoning VOS tasks. In comparison to existing models, \method substantially improves spatio-temporal pointing accuracy and reasoning capability.

---
## πŸ† Highlights
Key contributions of **VideoMolmo**:
1. We introduce VideoMolmo , an LMM that accepts natural-language queries and produces point-level predictions for target objects across entire video sequences, ensuring temporal consistency.

2. We further introduce Temporal module to leverage past temporal context and propose a novel temporal mask fusion pipeline for enhanced temporal coherence.

3. To achieve fine-grained spatio-temporal pointing, we introduce a comprehensive dataset of 72k video-caption pairs and 100k object points.

4. To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics. We also assess our model on Referring Video Object Segmentation (Ref-VOS) and Reasoning VOS tasks.

---
## 🧠 Architecture

<p align="center">
  <img src="assets/videomolmo_main_diagram.jpg" alt="VideoGLaMM Architecture">
</p>

**VideoMolmo** consists of four end-to-end trainable components: (1) a visual encoder, (2) a temporal module, (3) visual projector (4) a decoder-only large language model (LLM); and a post-processing module.

---
## πŸ—οΈ Benchmark and Annotation Pipeline

<p align="center">
  <img src="assets/videomolmo_annotation_pipeline.png" alt="Annotation Pipeline">
</p>

We propose a semi-automatic annotation pipeline for creating a grounded conversation generation (GCG) dataset for videos.

---

## πŸ“ˆ Results


> |1| **VideoMolmo** demonstrates robust generalization and fine-grained spatio-temporal grounding across diverse out-of-distribution scenarios from our proposed benchmark, for instance, correctly pointing to traffic lights (2nd row) in challenging driving scenes despite never encountering such scenarios during training.
<p align="center">
  <img src="assets/benchmark_diagram.png" width="70%" alt="Benchmark results">
</p>


> |2| Quantative results showing VideoMolmo with average improvement of 4.1% over SoTA (VideoGLaMM) and 4.8% over our baseline (Molmo+SAM2). 
<p align="center">
  <img src="assets/videomolmo_quantitative_results.png" alt="Benchmark results">
</p>


---

## πŸ”§ Running VideoMolmo 

### Environment setup

(1) Setup environment and PyTorch
```bash
git clone https://github.com/mbzuai-oryx/VideoMolmo
cd VideoMolmo/VideoMolmo
conda create -n .videomolmo python=3.10 -y
conda activate .videomolmo
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu121
```

(2) Setup Molmo
```bash
git clone https://github.com/allenai/molmo.git
cd molmo && pip install -e .[all] && cd .. # setup molmo requirements
pip install -r requirements.txt
```

(3) Setup SAM
```bash
python setup.py build_ext --inplace # build sam2
mkdir -p sam2_checkpoints
wget https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt -O sam2_checkpoints/sam2.1_hiera_large.pt
```

### πŸ”„ Inference

To run inference on the provided sample video:

```bash
python infer.py \
  --video_path ../examples/video_sample1 \
  --prompt "point to the person in red shirt" \
  --save_path "results"
```

Your video should be a folder with all the frames. Sample structure:
```
video_sample1/
β”œβ”€β”€ frame_0001.jpg
β”œβ”€β”€ frame_0002.jpg
β”œβ”€β”€ frame_0003.jpg
└── ...
```

Output includes segmentation masks for each frame and a JSON file (`points.jsonl`) containing point coordinates.
```
reuslts/
β”œβ”€β”€ video_sample1/
β”‚   β”œβ”€β”€ frame_0001.jpg
β”‚   β”œβ”€β”€ frame_0002.jpg
β”‚   β”œβ”€β”€ frame_0003.jpg
β”‚   β”œβ”€β”€ points.jsonl
β”‚   └── ...
└── ...
```
### Training and Evaluation πŸš€

To be released soon! Stay tuned for updates.


## Todos

- [ ] Release training and evaluation scripts.
- [ ] Add support for additional datasets.
- [ ] Release dataset creation pipeline.


## Citation πŸ“œ

```bibtex
  @misc{ahmad2025videomolmospatiotemporalgroundingmeets,
      title={VideoMolmo: Spatio-Temporal Grounding Meets Pointing},
      author={Ghazi Shazan Ahmad and Ahmed Heakl and Hanan Gani and Abdelrahman Shaker and Zhiqiang Shen and Ranjay Krishna and Fahad Shahbaz Khan and Salman Khan},
      year={2025},
      eprint={2506.05336},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.05336},
}
```

---

[<img src="assets/MBZUAI_logo.png" width="360" height="90">](https://mbzuai.ac.ae)