File size: 6,731 Bytes
6de6ae8 b868e1c 6de6ae8 45bf545 6de6ae8 b868e1c 6de6ae8 7deb981 6de6ae8 b868e1c 6de6ae8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
---
license: apache-2.0
datasets:
- allenai/Molmo2-VideoPoint
- allenai/pixmo-points
- allenai/pixmo-cap
language:
- en
base_model:
- google/siglip-so400m-patch14-384
- Qwen/Qwen3-4B-Instruct-2507
pipeline_tag: video-text-to-text
library_name: transformers
tags:
- multimodal
- olmo
- molmo
- molmo2
---
<img src="molmo_2_logo_RGB.png" alt="Logo for the Molmo2 Project" style="width: auto; height: 50px;">
# Molmo2-VideoPoint-4B
Molmo2 is a family of open vision-language models developed by the Allen Institute for AI (Ai2) that support image, video and multi-image understanding and grounding.
Molmo2 models are trained on publicly available third party datasets as referenced in [our technical report](https://allenai.org/papers/molmo2) and [Molmo2 data](https://huggingface.co/collections/allenai/molmo2-data),
a collection of datasets with highly-curated image-text and video-text pairs.
It has state-of-the-art performance among multimodal models with a similar size.
You can find all models in the Molmo2 family [here](https://huggingface.co/collections/allenai/molmo2).
**Learn more** about the Molmo2 family [in our announcement blog post](https://allenai.org/blog/molmo2).
Molmo2-VideoPoint-4B is based on [Qwen3-4B-Instruct](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) and uses [SigLIP 2](https://huggingface.co/google/siglip-so400m-patch14-384) as vision backbone.
**Different from the general checkpoints, Molmo2-VideoPoint-4B is finetuned on the Molmo2-VideoPoint data only, after pre-training on pixmo-cap, pixmo-points and tulu's data. It is meant to be used for video pointing and counting only**.
Ai2 is commited to open science. The Molmo2 datasets are available [here](https://huggingface.co/collections/allenai/molmo2-data).
All other artifacts used in creating Molmo2 (training code, evaluations, intermediate checkpoints) will be made available at a later date, furthering our commitment to open-source AI development and reproducibility.
Quick links:
- ๐ [All Models](https://huggingface.co/collections/allenai/molmo2)
- ๐ [Paper](https://allenai.org/papers/molmo2)
- ๐ฅ [Blog with Videos](https://allenai.org/blog/molmo2)
## Quick Start
### Setup Conda Environment
```
conda create --name transformers4571 python=3.11
conda activate transformers4571
pip install transformers==4.57.1
pip install torch pillow einops torchvision accelerate decord2 molmo_utils
```
### Pointing Video QA
```
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
from molmo_utils import process_vision_info
import re
model_id="allenai/Molmo2-VideoPoint-4B"
# load the processor
processor = AutoProcessor.from_pretrained(
model_id,
trust_remote_code=True,
dtype="auto",
device_map="auto"
)
# load the model
model = AutoModelForImageTextToText.from_pretrained(
model_id,
trust_remote_code=True,
dtype="auto",
device_map="auto"
)
COORD_REGEX = re.compile(rf"<(?:points|tracks).*? coords=\"([0-9\t:;, .]+)\"/?>")
FRAME_REGEX = re.compile(rf"(?:^|\t|:|,|;)([0-9\.]+) ([0-9\. ]+)")
POINTS_REGEX = re.compile(r"([0-9]+) ([0-9]{3,4}) ([0-9]{3,4})")
def _points_from_num_str(text, image_w, image_h, extract_ids=False):
all_points = []
for points in POINTS_REGEX.finditer(text):
ix, x, y = points.group(1), points.group(2), points.group(3)
# our points format assume coordinates are scaled by 1000
x, y = float(x)/1000*image_w, float(y)/1000*image_h
if 0 <= x <= image_w and 0 <= y <= image_h:
yield ix, x, y
def extract_video_points(text, image_w, image_h, extract_ids=False):
"""Extract video pointing coordinates as a flattened list of (t, x, y) triplets from model output text."""
all_points = []
for coord in COORD_REGEX.finditer(text):
for point_grp in FRAME_REGEX.finditer(coord.group(1)):
frame_id = float(point_grp.group(1))
w, h = (image_w, image_h)
for idx, x, y in _points_from_num_str(point_grp.group(2), w, h):
if extract_ids:
all_points.append((frame_id, idx, x, y))
else:
all_points.append((frame_id, x, y))
return all_points
messages = [
{
"role": "user",
"content": [
dict(type="text", text="Point to the penguins."),
dict(type="video", video="https://storage.googleapis.com/oe-training-public/demo_videos/many_penguins.mp4"),
],
}
]
# process the video using `molmo_utils.process_vision_info`
_, videos, video_kwargs = process_vision_info(messages)
videos, video_metadatas = zip(*videos)
videos, video_metadatas = list(videos), list(video_metadatas)
# apply the chat template to the input messages
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# process the video and text
inputs = processor(
videos=videos,
video_metadata=video_metadatas,
text=text,
padding=True,
return_tensors="pt",
**video_kwargs,
)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
# generate output
with torch.inference_mode():
generated_ids = model.generate(**inputs, max_new_tokens=2048)
# only get generated tokens; decode them to text
generated_tokens = generated_ids[0, inputs['input_ids'].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
# decode video pointing outputs
points = extract_video_points(generated_text, image_w=video_metadatas[0]["width"], image_h=video_metadatas[0]["height"])
print(points)
```
## Evaluations
We report the accuracy and close accuracy on Molmo2-VideoCountEval here.
For details on the evals, refer to our [technical report](https://allenai.org/papers/molmo2).
| Model | Accuracy | Close Acc. |
|-----------------------------|-----------------------------------------|-----------------------------------------|
| GPT-5 | 35.8 | 50.3 |
| GPT-5 mini | 29.8 | 49.3 |
| Gemini 3 Pro | **37.1** | 53.1 |
| Gemini 2.5 Pro | 35.8 | **56.5** |
| Gemini 2.5 Flash | 31.9 | 48.2 |
| Claude Sonnet 4.5 | 27.2 | 45.1 |
| Qwen3-VL-4B | 25.3 | 44.3 |
| Qwen3-VL-8B | 29.6 | 47.7 |
| Molmo2-4B | 34.3 | <u>56.1</u> |
| Molmo2-8B | 35.5 | 53.3 |
| Molmo2-7B | 33.2 | 50.5 |
| **Molmo2-VideoPoint-4B (this model)** | <u>36.8</u> | **56.5** |
## License and Use
This model is licensed under Apache 2.0. It is intended for research and educational use in accordance with Ai2โs [Responsible Use Guidelines](https://allenai.org/responsible-use).
This model is trained on third party datasets that are subject to academic and non-commercial research use only. Please review the sources to determine if this model is appropriate for your use case. |