Instructions to use allenai/Molmo2-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use allenai/Molmo2-8B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="allenai/Molmo2-8B", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained("allenai/Molmo2-8B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use allenai/Molmo2-8B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "allenai/Molmo2-8B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "allenai/Molmo2-8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/allenai/Molmo2-8B

SGLang

How to use allenai/Molmo2-8B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "allenai/Molmo2-8B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "allenai/Molmo2-8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "allenai/Molmo2-8B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "allenai/Molmo2-8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use allenai/Molmo2-8B with Docker Model Runner:
```
docker model run hf.co/allenai/Molmo2-8B
```

OOM error for 450 second video at 1 frame per second.

by hchanuk - opened Feb 13

Discussion

hchanuk

Feb 13

Hi,

I am getting an OOM error when using Molmo2 8B whilst I was running a 450 second video at only 1 FPS. I've ran other 8B VLMs at this frame rate and did not experience any issues. Molmo2 4B also faced the same issue.

Are there any suggestions on how to solve this without lowering the FPS?

For example:
Are there any input arguments for the model to set a limit to the spatial resolution/pixels/tokens per frame?

sanghol

Ai2 org Feb 15

Molmo2 resizes each video frame to a fixed resolution (378x378) before encoding it with the ViT.
Since the model was not trained with other resolutions, it's hard for us to recommend using a different input resolution.
It also applies fairly aggressive pooling (3x3) after encoding to reduce the number of visual tokens.

As a result, there isn't much room to further reduce the per-frame token count.
The most practical recommendation to reduce memory usage is to lower the maximum number of sampled frames, num_frames.

For reference, Molmo2 samples video frames using a fixed max_fps, regardless of the input video's original frame rate.
You can find more details in the following code:
https://huggingface.co/allenai/Molmo2-8B/blob/main/video_processing_molmo2.py#L637-L651

With the default configuration (num_frames=384), your 450-second video will be uniformly sampled into 384 frames.
Reducing num_frames will proportionally reduce the number of visual tokens and therefore lower memory usage.

sanghol changed discussion status to closed Feb 24

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment