File size: 6,731 Bytes
6de6ae8
 
 
 
b868e1c
 
6de6ae8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45bf545
6de6ae8
 
 
 
 
 
 
 
 
 
b868e1c
6de6ae8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7deb981
6de6ae8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b868e1c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6de6ae8
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
---
license: apache-2.0
datasets:
- allenai/Molmo2-VideoPoint
- allenai/pixmo-points
- allenai/pixmo-cap
language:
- en
base_model:
- google/siglip-so400m-patch14-384
- Qwen/Qwen3-4B-Instruct-2507
pipeline_tag: video-text-to-text
library_name: transformers
tags:
- multimodal
- olmo
- molmo
- molmo2
---

<img src="molmo_2_logo_RGB.png" alt="Logo for the Molmo2 Project" style="width: auto; height: 50px;">

# Molmo2-VideoPoint-4B

Molmo2 is a family of open vision-language models developed by the Allen Institute for AI (Ai2) that support image, video and multi-image understanding and grounding.
Molmo2 models are trained on publicly available third party datasets as referenced in [our technical report](https://allenai.org/papers/molmo2) and [Molmo2 data](https://huggingface.co/collections/allenai/molmo2-data), 
a collection of datasets with highly-curated image-text and video-text pairs.
It has state-of-the-art performance among multimodal models with a similar size.
You can find all models in the Molmo2 family [here](https://huggingface.co/collections/allenai/molmo2).

**Learn more** about the Molmo2 family [in our announcement blog post](https://allenai.org/blog/molmo2).

Molmo2-VideoPoint-4B is based on [Qwen3-4B-Instruct](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) and uses [SigLIP 2](https://huggingface.co/google/siglip-so400m-patch14-384) as vision backbone.
**Different from the general checkpoints, Molmo2-VideoPoint-4B is finetuned on the Molmo2-VideoPoint data only, after pre-training on pixmo-cap, pixmo-points and tulu's data. It is meant to be used for video pointing and counting only**. 

Ai2 is commited to open science. The Molmo2 datasets are available [here](https://huggingface.co/collections/allenai/molmo2-data). 
All other artifacts used in creating Molmo2 (training code, evaluations, intermediate checkpoints) will be made available at a later date, furthering our commitment to open-source AI development and reproducibility.

Quick links:
- ๐Ÿ“‚ [All Models](https://huggingface.co/collections/allenai/molmo2)
- ๐Ÿ“ƒ [Paper](https://allenai.org/papers/molmo2)
- ๐ŸŽฅ [Blog with Videos](https://allenai.org/blog/molmo2)

## Quick Start

### Setup Conda Environment
```
conda create --name transformers4571 python=3.11
conda activate transformers4571
pip install transformers==4.57.1
pip install torch pillow einops torchvision accelerate decord2 molmo_utils
```

### Pointing Video QA

```
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
from molmo_utils import process_vision_info
import re

model_id="allenai/Molmo2-VideoPoint-4B"

# load the processor
processor = AutoProcessor.from_pretrained(
    model_id,
    trust_remote_code=True,
    dtype="auto",
    device_map="auto"
)

# load the model
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    trust_remote_code=True,
    dtype="auto",
    device_map="auto"
)

COORD_REGEX = re.compile(rf"<(?:points|tracks).*? coords=\"([0-9\t:;, .]+)\"/?>")
FRAME_REGEX = re.compile(rf"(?:^|\t|:|,|;)([0-9\.]+) ([0-9\. ]+)")
POINTS_REGEX = re.compile(r"([0-9]+) ([0-9]{3,4}) ([0-9]{3,4})")

def _points_from_num_str(text, image_w, image_h, extract_ids=False):
    all_points = []
    for points in POINTS_REGEX.finditer(text):
        ix, x, y = points.group(1), points.group(2), points.group(3)
        # our points format assume coordinates are scaled by 1000
        x, y = float(x)/1000*image_w, float(y)/1000*image_h
        if 0 <= x <= image_w and 0 <= y <= image_h:
            yield ix, x, y


def extract_video_points(text, image_w, image_h, extract_ids=False):
    """Extract video pointing coordinates as a flattened list of (t, x, y) triplets from model output text."""
    all_points = []
    for coord in COORD_REGEX.finditer(text):
        for point_grp in FRAME_REGEX.finditer(coord.group(1)):
            frame_id = float(point_grp.group(1))
            w, h = (image_w, image_h)
            for idx, x, y in _points_from_num_str(point_grp.group(2), w, h):
                if extract_ids:
                    all_points.append((frame_id, idx, x, y))
                else:
                    all_points.append((frame_id, x, y))
    return all_points

messages = [
    {
        "role": "user",
        "content": [
            dict(type="text", text="Point to the penguins."),
            dict(type="video", video="https://storage.googleapis.com/oe-training-public/demo_videos/many_penguins.mp4"),
        ],
    }
]

# process the video using `molmo_utils.process_vision_info`
_, videos, video_kwargs = process_vision_info(messages)
videos, video_metadatas = zip(*videos)
videos, video_metadatas = list(videos), list(video_metadatas)

# apply the chat template to the input messages
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# process the video and text
inputs = processor(
    videos=videos,
    video_metadata=video_metadatas,
    text=text,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)

inputs = {k: v.to(model.device) for k, v in inputs.items()}

# generate output
with torch.inference_mode():
    generated_ids = model.generate(**inputs, max_new_tokens=2048)

# only get generated tokens; decode them to text
generated_tokens = generated_ids[0, inputs['input_ids'].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)

# decode video pointing outputs
points = extract_video_points(generated_text, image_w=video_metadatas[0]["width"], image_h=video_metadatas[0]["height"])
print(points)
```

## Evaluations

We report the accuracy and close accuracy on Molmo2-VideoCountEval here.
For details on the evals, refer to our [technical report](https://allenai.org/papers/molmo2).

| Model | Accuracy | Close Acc. |
|-----------------------------|-----------------------------------------|-----------------------------------------|
| GPT-5 | 35.8 | 50.3 |
| GPT-5 mini | 29.8 | 49.3 |
| Gemini 3 Pro | **37.1** | 53.1 |
| Gemini 2.5 Pro | 35.8 | **56.5** |
| Gemini 2.5 Flash | 31.9 | 48.2 |
| Claude Sonnet 4.5 | 27.2  | 45.1 |
| Qwen3-VL-4B | 25.3 | 44.3 |
| Qwen3-VL-8B | 29.6 | 47.7 |
| Molmo2-4B | 34.3 | <u>56.1</u> |
| Molmo2-8B | 35.5 | 53.3 |
| Molmo2-7B | 33.2 | 50.5 |
| **Molmo2-VideoPoint-4B (this model)** | <u>36.8</u> | **56.5** |


## License and Use

This model is licensed under Apache 2.0. It is intended for research and educational use in accordance with Ai2โ€™s [Responsible Use Guidelines](https://allenai.org/responsible-use).
This model is trained on third party datasets that are subject to academic and non-commercial research use only. Please review the sources to determine if this model is appropriate for your use case.