Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 121
This is a 4-bit AWQ quantized version of allenai/Molmo-7B-D-0924 using LLM Compressor
| Metric | Value |
|---|---|
| Original (FP16) | ~14.0 GB |
| Quantized (W4A16) | ~5.18 GB |
| Reduction | ~63.0% |
| Memory Saved | ~8.8 GB |
Quantized (4-bit):
Preserved (FP16):
This selective quantization ensures that vision understanding quality remains nearly identical to the original model while significantly reducing size.
Evaluated on: AI2D, ChartQA, VQA v2.0, DocQA, InfographicVQA, TextVQA, RealWorldQA, MMMU, MathVista, CountBenchQA, Flickr Count
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from PIL import Image
import requests
# Load model and processor
processor = AutoProcessor.from_pretrained(
"ronantakizawa/molmo-7b-d-awq-w4a16",
trust_remote_code=True,
torch_dtype='auto',
device_map='auto'
)
model = AutoModelForCausalLM.from_pretrained(
"ronantakizawa/molmo-7b-d-awq-w4a16",
trust_remote_code=True,
torch_dtype='auto',
device_map='auto'
)
# Process the image and text
inputs = processor.process(
images=[Image.open(requests.get("https://picsum.photos/id/237/536/354", stream=True).raw)],
text="Describe this image."
)
# Move inputs to the correct device and make a batch of size 1
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}
# Generate output
output = model.generate_from_batch(
inputs,
GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
tokenizer=processor.tokenizer
)
# Decode the generated tokens
generated_tokens = output[0, inputs['input_ids'].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(generated_text)
from vllm import LLM, SamplingParams
llm = LLM(
model="ronantakizawa/molmo-7b-d-awq-w4a16",
trust_remote_code=True,
max_model_len=2048
)
# vLLM will automatically use GPTQ quantization for faster inference
Molmo-7B-D is part of the Molmo family of open vision-language models developed by the Allen Institute for AI:
If using transparent images, add a white or dark background first for best results.
Ensure images are in RGB format:
from PIL import Image
image = Image.open(...)
if image.mode != "RGB":
image = image.convert("RGB")
Apache 2.0 (same as base model)
@article{molmo,
title={Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models},
author={Deitke, Matt and Clark, Christopher and Lee, Sangho and others},
journal={arXiv preprint arXiv:2409.17146},
year={2024}
}
@misc{molmo-7b-d-awq,
title={Molmo-7B-D AWQ 4-bit},
author={Quantized by ronantakizawa},
year={2025},
url={https://huggingface.co/ronantakizawa/molmo-7b-d-awq-w4a16}
}
🤖 Generated with LLM Compressor