Instructions to use amitha/molmo-clip-b16-olmo3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use amitha/molmo-clip-b16-olmo3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="amitha/molmo-clip-b16-olmo3", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModelForImageTextToText model = AutoModelForImageTextToText.from_pretrained("amitha/molmo-clip-b16-olmo3", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use amitha/molmo-clip-b16-olmo3 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "amitha/molmo-clip-b16-olmo3" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amitha/molmo-clip-b16-olmo3", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/amitha/molmo-clip-b16-olmo3
- SGLang
How to use amitha/molmo-clip-b16-olmo3 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "amitha/molmo-clip-b16-olmo3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amitha/molmo-clip-b16-olmo3", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "amitha/molmo-clip-b16-olmo3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amitha/molmo-clip-b16-olmo3", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use amitha/molmo-clip-b16-olmo3 with Docker Model Runner:
docker model run hf.co/amitha/molmo-clip-b16-olmo3
amitha/molmo-clip-b16-olmo3
A Molmo-style vision-language model: a frozen CLIP ViT-B/16 vision encoder
(pretrained on DataComp-medium, from amitha/clip-vit-b16-datacomp-medium)
- a trained multimodal connector + the OLMo-3-7B
language model (
Olmo3ForCausalLM).
The vision encoder was frozen during training; only the connector (a SwiGLU image projector + a CLS projector) and the language model were trained, following the Molmo recipe.
Vision weights are referenced, not stored. This repo ships the connector + LLM weights only. The vision tower is loaded at runtime from
amitha/clip-vit-b16-datacomp-medium, so that repo must remain accessible. Loading requirestrust_remote_code=True.
Checkpoints
Training ran for 4 epochs. The repo root is the final checkpoint (step14392, 4 epochs).
Three earlier checkpoints are available as subfolders:
| Checkpoint | Subfolder | Notes |
|---|---|---|
step14392 |
(root) | final (4 epochs) |
step7196 |
step7196 |
2 epochs |
step13000 |
step13000 |
~3.6 epochs |
step14000 |
step14000 |
~3.9 epochs |
Load an earlier checkpoint with subfolder=:
model = AutoModelForImageTextToText.from_pretrained(
"amitha/molmo-clip-b16-olmo3", subfolder="step13000", trust_remote_code=True)
Usage
import torch, PIL.Image, requests
from transformers import AutoModelForImageTextToText, AutoTokenizer
from transformers import AutoImageProcessor, AutoProcessor
repo = "amitha/molmo-clip-b16-olmo3"
model = AutoModelForImageTextToText.from_pretrained(
repo, trust_remote_code=True, dtype=torch.float32).eval()
processor = AutoProcessor.from_pretrained(repo, trust_remote_code=True)
image = PIL.Image.open(requests.get(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
stream=True).raw).convert("RGB")
inputs = processor(text="Describe this image in detail.", images=[image], return_tensors="pt")
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Prompt styles
The model was trained with several caption/QA styles. The processor exposes an optional
style argument (default: none) that prepends a "{style}: " prefix matching training:
inputs = processor(text="Describe this image.", images=[image],
style="long_caption", return_tensors="pt")
Known styles: long_caption, transcript, user_qa, synthetic_qa.
Architecture notes
- Image tokens: single 224×224 crop, no pooling, CLS token included → 197 image tokens (1 CLS + 196 patches) inserted into the text stream.
- LLM: native
Olmo3ForCausalLM(post-norm, YaRN RoPE), vocabulary padded to 100480; the 128 image-placeholder logits are masked during generation. - Image preprocessing: resize so the short side is 224 (bicubic), center-crop 224, normalize with OpenAI CLIP statistics.
Provenance
Converted from native Molmo training checkpoints to the HuggingFace format with a converter verified to reproduce the original Molmo inference bit-for-bit (identical input ids and image token layout; vision features and logits matching to floating-point ordering noise; identical greedy generations).
- Downloads last month
- 46
Model tree for amitha/molmo-clip-b16-olmo3
Base model
amitha/clip-vit-b16-datacomp-medium