Molmo2-8B-NVFP4 / README.md

tollea1234

Update README.md

3d880ba verified 3 months ago

preview code

raw

history blame contribute delete

3.47 kB

metadata

license: apache-2.0
base_model: allenai/Molmo2-8B
tags:
  - vision-language
  - multimodal
  - nvfp4
  - fp4
  - quantized
  - vllm
  - molmo
library_name: transformers
pipeline_tag: image-text-to-text
language:
  - en

Molmo2-8B-NVFP4

NVFP4 (4-bit NVIDIA floating point) quantized version of allenai/Molmo2-8B for efficient inference.

Model Details

Property	Value
Base Model	allenai/Molmo2-8B
Quantization	NVFP4 (4-bit floating point)
Format	nvfp4-pack-quantized (compressed-tensors)
Model Size	~11GB (vs ~16GB original)
Vision Backbone	Full precision (not quantized)

Quantization Details

Method: NVFP4 quantization using llmcompressor
Target Layers: Linear layers (excluding vision backbone, lm_head, mlp.gate)
Precision: 4-bit symmetric floating point
Group Size: 16
Scale Dtype: torch.float8_e4m3fn

Usage with vLLM

Important: This model requires a custom vLLM build with NVFP4 quantized weight mapping support for Molmo2.

Step 1: Start Docker Container

docker run -it --gpus all \
  --entrypoint /bin/bash \
  -e SETUPTOOLS_SCM_PRETEND_VERSION=0.9.0 \
  -v /path/to/your/models:/workspace/models \
  -p 8000:8000 \
  vllm/vllm-openai:latest

Step 2: Build Custom vLLM

Inside the container:

git clone https://github.com/George-Polya/vllm.git -b dev/molmo2-quantize
cd vllm
pip install --no-build-isolation -e .

Step 3: Serve the Model

vllm serve /workspace/models/Molmo2-8B-NVFP4 \
  --trust-remote-code \
  --max-model-len 4096 \
  --max-num-batched-tokens 8192

Step 4: Query the Model

from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

# With image URL
response = client.chat.completions.create(
    model="Molmo2-8B-NVFP4",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
                {"type": "text", "text": "Describe this image."}
            ]
        }
    ],
    max_tokens=512
)
print(response.choices[0].message.content)

Why Custom vLLM Build?

The official vLLM does not yet support NVFP4 quantized weight loading for Molmo2's vision backbone. The custom branch adds:

prefix parameter to vision layers for proper weight name mapping
Extended hf_to_vllm_mapper patterns for quantized weight names

See: George-Polya/vllm@dev/molmo2-quantize

Quantization Recipe

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: ['re:.*lm_head', 're:.*vision_backbone.*', 're:.*mlp.gate$']
      scheme: NVFP4

Model Variants Comparison

Model	Bits	Size	Quality
Molmo2-8B (original)	16	~16GB	Highest
Molmo2-8B-FP8	8	~13GB	High
Molmo2-8B-NVFP4	4	~11GB	Good

Limitations

Vision backbone remains in full precision to preserve image understanding quality
Requires custom vLLM build (not compatible with stock vLLM)
NVFP4 requires hardware support (NVIDIA Blackwell or newer recommended)

License

This model inherits the Apache 2.0 license from the base model.