Molmo-7B-D AWQ 4-bit (Text-Only Quantization)

This is a 4-bit AWQ quantized version of allenai/Molmo-7B-D-0924 using LLM Compressor

Key Features

✅ Qwen2 text decoder quantized (4-bit AWQ) - 63% size reduction
✅ OpenAI CLIP vision encoder preserved (FP16) - maintains image quality
✅ Performance between GPT-4V and GPT-4o on academic benchmarks
✅ Smart quantization - Only LLM layers quantized, vision parts untouched
✅ vLLM compatible - Fast inference with vLLM
✅ Powers molmo.allenai.org demo

Model Details

Base Model: allenai/Molmo-7B-D-0924 (7B parameters)
Architecture: Molmo (Qwen2-7B decoder + OpenAI CLIP vision encoder)
Quantization Method: AWQ (Activation-aware Weight Quantization)
Quantization Scheme: W4A16 (4-bit weights, 16-bit activations)
Calibration Dataset: Flickr30k (128 samples)

Size Comparison

Metric	Value
Original (FP16)	~14.0 GB
Quantized (W4A16)	~5.18 GB
Reduction	~63.0%
Memory Saved	~8.8 GB

What Was Quantized

Quantized (4-bit):

Qwen2DecoderLayer (Qwen2-7B text/language model)
Text processing linear layers in the decoder

Preserved (FP16):

OpenAI CLIP vision encoder (maintains image understanding quality)
Vision-text connectors
Embeddings
Language model head

This selective quantization ensures that vision understanding quality remains nearly identical to the original model while significantly reducing size.

Performance (Original Model)

Academic Benchmarks

Average Score: 77.3 across 11 benchmarks
Human Preference Elo: 1056
Position: Between GPT-4V (71.1) and GPT-4o (78.5)

Benchmark Details

Evaluated on: AI2D, ChartQA, VQA v2.0, DocQA, InfographicVQA, TextVQA, RealWorldQA, MMMU, MathVista, CountBenchQA, Flickr Count

Usage

from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from PIL import Image
import requests

# Load model and processor
processor = AutoProcessor.from_pretrained(
    "ronantakizawa/molmo-7b-d-awq-w4a16",
    trust_remote_code=True,
    torch_dtype='auto',
    device_map='auto'
)

model = AutoModelForCausalLM.from_pretrained(
    "ronantakizawa/molmo-7b-d-awq-w4a16",
    trust_remote_code=True,
    torch_dtype='auto',
    device_map='auto'
)

# Process the image and text
inputs = processor.process(
    images=[Image.open(requests.get("https://picsum.photos/id/237/536/354", stream=True).raw)],
    text="Describe this image."
)

# Move inputs to the correct device and make a batch of size 1
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}

# Generate output
output = model.generate_from_batch(
    inputs,
    GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
    tokenizer=processor.tokenizer
)

# Decode the generated tokens
generated_tokens = output[0, inputs['input_ids'].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(generated_text)

vLLM Inference (Recommended for Production)

from vllm import LLM, SamplingParams

llm = LLM(
    model="ronantakizawa/molmo-7b-d-awq-w4a16",
    trust_remote_code=True,
    max_model_len=2048
)

# vLLM will automatically use GPTQ quantization for faster inference

Performance

Memory Usage: ~5-7 GB GPU VRAM (vs ~14 GB for FP16)
Inference Speed: Similar to FP16 on compatible hardware
Quality: Vision understanding ~100% preserved, text generation ~95-98% preserved
Recommended GPU: 16GB+ VRAM for optimal performance

About Molmo-7B-D

Molmo-7B-D is part of the Molmo family of open vision-language models developed by the Allen Institute for AI:

Training Data: PixMo dataset (1 million highly-curated image-text pairs)
Text Decoder: Qwen2-7B (state-of-the-art open LLM)
Vision Encoder: OpenAI CLIP (proven vision backbone)
Performance: Between GPT-4V and GPT-4o
Use Case: Powers the official demo at molmo.allenai.org

Quantization Details

Method: AWQ (Activation-aware Weight Quantization)
Independent Pipeline: Used with BasicPipeline for layer-by-layer quantization
Calibration: 128 Flickr30k image-text pairs
Max Sequence Length: 2048 tokens
Why AWQ: Activation-aware quantization preserves important weights

Limitations

May have slight quality degradation in complex text generation compared to FP16
Vision encoder is NOT quantized (intentional for quality)
Requires vLLM or transformers with AWQ support
Use vLLM version <=0.7.2 until preprocessing bug is fixed

Important Notes

Transparent Images

If using transparent images, add a white or dark background first for best results.

RGB Conversion

Ensure images are in RGB format:

from PIL import Image
image = Image.open(...)
if image.mode != "RGB":
    image = image.convert("RGB")

License

Apache 2.0 (same as base model)

Citation

@article{molmo,
  title={Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models},
  author={Deitke, Matt and Clark, Christopher and Lee, Sangho and others},
  journal={arXiv preprint arXiv:2409.17146},
  year={2024}
}

@misc{molmo-7b-d-awq,
  title={Molmo-7B-D AWQ 4-bit},
  author={Quantized by ronantakizawa},
  year={2025},
  url={https://huggingface.co/ronantakizawa/molmo-7b-d-awq-w4a16}
}

Acknowledgements

Base model by Allen Institute for AI
Quantization using LLM Compressor
Meta tensor fix by @ronantakizawa

🤖 Generated with LLM Compressor

Downloads last month: 54

Safetensors

Model size

2B params

Tensor type

I64

I32

F16

Model tree for ronantakizawa/molmo-7b-d-awq

Base model

Qwen/Qwen2-7B

Finetuned

allenai/Molmo-7B-D-0924

Quantized

(7)

this model

Paper for ronantakizawa/molmo-7b-d-awq

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 121