Molmo2-8B-NVFP4 / README.md
tollea1234's picture
Update README.md
3d880ba verified
metadata
license: apache-2.0
base_model: allenai/Molmo2-8B
tags:
  - vision-language
  - multimodal
  - nvfp4
  - fp4
  - quantized
  - vllm
  - molmo
library_name: transformers
pipeline_tag: image-text-to-text
language:
  - en

Molmo2-8B-NVFP4

NVFP4 (4-bit NVIDIA floating point) quantized version of allenai/Molmo2-8B for efficient inference.

Model Details

Property Value
Base Model allenai/Molmo2-8B
Quantization NVFP4 (4-bit floating point)
Format nvfp4-pack-quantized (compressed-tensors)
Model Size ~11GB (vs ~16GB original)
Vision Backbone Full precision (not quantized)

Quantization Details

  • Method: NVFP4 quantization using llmcompressor
  • Target Layers: Linear layers (excluding vision backbone, lm_head, mlp.gate)
  • Precision: 4-bit symmetric floating point
  • Group Size: 16
  • Scale Dtype: torch.float8_e4m3fn

Usage with vLLM

Important: This model requires a custom vLLM build with NVFP4 quantized weight mapping support for Molmo2.

Step 1: Start Docker Container

docker run -it --gpus all \
  --entrypoint /bin/bash \
  -e SETUPTOOLS_SCM_PRETEND_VERSION=0.9.0 \
  -v /path/to/your/models:/workspace/models \
  -p 8000:8000 \
  vllm/vllm-openai:latest

Step 2: Build Custom vLLM

Inside the container:

git clone https://github.com/George-Polya/vllm.git -b dev/molmo2-quantize
cd vllm
pip install --no-build-isolation -e .

Step 3: Serve the Model

vllm serve /workspace/models/Molmo2-8B-NVFP4 \
  --trust-remote-code \
  --max-model-len 4096 \
  --max-num-batched-tokens 8192

Step 4: Query the Model

from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

# With image URL
response = client.chat.completions.create(
    model="Molmo2-8B-NVFP4",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
                {"type": "text", "text": "Describe this image."}
            ]
        }
    ],
    max_tokens=512
)
print(response.choices[0].message.content)

Why Custom vLLM Build?

The official vLLM does not yet support NVFP4 quantized weight loading for Molmo2's vision backbone. The custom branch adds:

  1. prefix parameter to vision layers for proper weight name mapping
  2. Extended hf_to_vllm_mapper patterns for quantized weight names

See: George-Polya/vllm@dev/molmo2-quantize

Quantization Recipe

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: ['re:.*lm_head', 're:.*vision_backbone.*', 're:.*mlp.gate$']
      scheme: NVFP4

Model Variants Comparison

Model Bits Size Quality
Molmo2-8B (original) 16 ~16GB Highest
Molmo2-8B-FP8 8 ~13GB High
Molmo2-8B-NVFP4 4 ~11GB Good

Limitations

  • Vision backbone remains in full precision to preserve image understanding quality
  • Requires custom vLLM build (not compatible with stock vLLM)
  • NVFP4 requires hardware support (NVIDIA Blackwell or newer recommended)

License

This model inherits the Apache 2.0 license from the base model.