Molmo2-4B-NVFP4

NVFP4 (4-bit NVIDIA floating point) quantized version of allenai/Molmo2-4B for efficient inference.

Model Details

Property Value
Base Model allenai/Molmo2-4B
Quantization NVFP4 (4-bit floating point)
Format nvfp4-pack-quantized (compressed-tensors)
Model Size ~6.5GB (vs ~16GB original)
Vision Backbone Full precision (not quantized)

Quantization Details

  • Method: NVFP4 quantization using llmcompressor
  • Target Layers: Linear layers (excluding vision backbone, lm_head, mlp.gate)
  • Precision: 4-bit symmetric floating point
  • Group Size: 16
  • Scale Dtype: torch.float8_e4m3fn

Usage with vLLM

Important: This model requires a custom vLLM build with NVFP4 quantized weight mapping support for Molmo2.

Step 1: Start Docker Container

docker run -it --gpus all \
  --entrypoint /bin/bash \
  -e SETUPTOOLS_SCM_PRETEND_VERSION=0.9.0 \
  -v /path/to/your/models:/workspace/models \
  -p 8000:8000 \
  vllm/vllm-openai:latest

Step 2: Build Custom vLLM

Inside the container:

git clone https://github.com/George-Polya/vllm.git -b dev/molmo2-quantize
cd vllm
pip install --no-build-isolation -e .

Step 3: Serve the Model

vllm serve /workspace/models/Molmo2-4B-NVFP4 \
  --trust-remote-code \
  --max-model-len 4096 \
  --max-num-batched-tokens 8192

Step 4: Query the Model

from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

# With image URL
response = client.chat.completions.create(
    model="Molmo2-4B-NVFP4",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
                {"type": "text", "text": "Describe this image."}
            ]
        }
    ],
    max_tokens=512
)
print(response.choices[0].message.content)

Why Custom vLLM Build?

The official vLLM does not yet support NVFP4 quantized weight loading for Molmo2's vision backbone. The custom branch adds:

  1. prefix parameter to vision layers for proper weight name mapping
  2. Extended hf_to_vllm_mapper patterns for quantized weight names

See: George-Polya/vllm@dev/molmo2-quantize

Quantization Recipe

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: ['re:.*lm_head', 're:.*vision_backbone.*', 're:.*mlp.gate$']
      scheme: NVFP4

FP8 vs NVFP4

FP8 NVFP4
Bits 8 4
Size ~8GB ~6.5GB
Quality Higher Lower
Speed Fast Faster

Choose NVFP4 for maximum memory efficiency, FP8 for better quality-size balance.

Limitations

  • Vision backbone remains in full precision to preserve image understanding quality
  • Requires custom vLLM build (not compatible with stock vLLM)
  • NVFP4 requires hardware support (NVIDIA Blackwell or newer recommended)

License

This model inherits the Apache 2.0 license from the base model.

Downloads last month
24
Safetensors
Model size
3B params
Tensor type
F32
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for tollea1234/Molmo2-4B-NVFP4

Finetuned
allenai/Molmo2-4B
Quantized
(2)
this model