Molmo2-8B-NVFP4

NVFP4 (4-bit NVIDIA floating point) quantized version of allenai/Molmo2-8B for efficient inference.

Model Details

Property Value
Base Model allenai/Molmo2-8B
Quantization NVFP4 (4-bit floating point)
Format nvfp4-pack-quantized (compressed-tensors)
Model Size ~11GB (vs ~16GB original)
Vision Backbone Full precision (not quantized)

Quantization Details

  • Method: NVFP4 quantization using llmcompressor
  • Target Layers: Linear layers (excluding vision backbone, lm_head, mlp.gate)
  • Precision: 4-bit symmetric floating point
  • Group Size: 16
  • Scale Dtype: torch.float8_e4m3fn

Usage with vLLM

Important: This model requires a custom vLLM build with NVFP4 quantized weight mapping support for Molmo2.

Step 1: Start Docker Container

docker run -it --gpus all \
  --entrypoint /bin/bash \
  -e SETUPTOOLS_SCM_PRETEND_VERSION=0.9.0 \
  -v /path/to/your/models:/workspace/models \
  -p 8000:8000 \
  vllm/vllm-openai:latest

Step 2: Build Custom vLLM

Inside the container:

git clone https://github.com/George-Polya/vllm.git -b dev/molmo2-quantize
cd vllm
pip install --no-build-isolation -e .

Step 3: Serve the Model

vllm serve /workspace/models/Molmo2-8B-NVFP4 \
  --trust-remote-code \
  --max-model-len 4096 \
  --max-num-batched-tokens 8192

Step 4: Query the Model

from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

# With image URL
response = client.chat.completions.create(
    model="Molmo2-8B-NVFP4",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
                {"type": "text", "text": "Describe this image."}
            ]
        }
    ],
    max_tokens=512
)
print(response.choices[0].message.content)

Why Custom vLLM Build?

The official vLLM does not yet support NVFP4 quantized weight loading for Molmo2's vision backbone. The custom branch adds:

  1. prefix parameter to vision layers for proper weight name mapping
  2. Extended hf_to_vllm_mapper patterns for quantized weight names

See: George-Polya/vllm@dev/molmo2-quantize

Quantization Recipe

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: ['re:.*lm_head', 're:.*vision_backbone.*', 're:.*mlp.gate$']
      scheme: NVFP4

Model Variants Comparison

Model Bits Size Quality
Molmo2-8B (original) 16 ~16GB Highest
Molmo2-8B-FP8 8 ~13GB High
Molmo2-8B-NVFP4 4 ~11GB Good

Limitations

  • Vision backbone remains in full precision to preserve image understanding quality
  • Requires custom vLLM build (not compatible with stock vLLM)
  • NVFP4 requires hardware support (NVIDIA Blackwell or newer recommended)

License

This model inherits the Apache 2.0 license from the base model.

Downloads last month
51
Safetensors
Model size
6B params
Tensor type
F32
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for tollea1234/Molmo2-8B-NVFP4

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Finetuned
allenai/Molmo2-8B
Quantized
(2)
this model