Instructions to use allenai/Molmo2-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use allenai/Molmo2-8B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="allenai/Molmo2-8B", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModelForImageTextToText model = AutoModelForImageTextToText.from_pretrained("allenai/Molmo2-8B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use allenai/Molmo2-8B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "allenai/Molmo2-8B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "allenai/Molmo2-8B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/allenai/Molmo2-8B
- SGLang
How to use allenai/Molmo2-8B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "allenai/Molmo2-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "allenai/Molmo2-8B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "allenai/Molmo2-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "allenai/Molmo2-8B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use allenai/Molmo2-8B with Docker Model Runner:
docker model run hf.co/allenai/Molmo2-8B
OOM error for 450 second video at 1 frame per second.
Hi,
I am getting an OOM error when using Molmo2 8B whilst I was running a 450 second video at only 1 FPS. I've ran other 8B VLMs at this frame rate and did not experience any issues. Molmo2 4B also faced the same issue.
Are there any suggestions on how to solve this without lowering the FPS?
For example:
Are there any input arguments for the model to set a limit to the spatial resolution/pixels/tokens per frame?
Molmo2 resizes each video frame to a fixed resolution (378x378) before encoding it with the ViT.
Since the model was not trained with other resolutions, it's hard for us to recommend using a different input resolution.
It also applies fairly aggressive pooling (3x3) after encoding to reduce the number of visual tokens.
As a result, there isn't much room to further reduce the per-frame token count.
The most practical recommendation to reduce memory usage is to lower the maximum number of sampled frames, num_frames.
For reference, Molmo2 samples video frames using a fixed max_fps, regardless of the input video's original frame rate.
You can find more details in the following code:
https://huggingface.co/allenai/Molmo2-8B/blob/main/video_processing_molmo2.py#L637-L651
With the default configuration (num_frames=384), your 450-second video will be uniformly sampled into 384 frames.
Reducing num_frames will proportionally reduce the number of visual tokens and therefore lower memory usage.