Instructions to use tollea1234/Molmo2-4B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tollea1234/Molmo2-4B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="tollea1234/Molmo2-4B-NVFP4", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained("tollea1234/Molmo2-4B-NVFP4", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use tollea1234/Molmo2-4B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tollea1234/Molmo2-4B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tollea1234/Molmo2-4B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/tollea1234/Molmo2-4B-NVFP4

SGLang

How to use tollea1234/Molmo2-4B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "tollea1234/Molmo2-4B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tollea1234/Molmo2-4B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "tollea1234/Molmo2-4B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tollea1234/Molmo2-4B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use tollea1234/Molmo2-4B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/tollea1234/Molmo2-4B-NVFP4
```

Molmo2-4B-NVFP4

NVFP4 (4-bit NVIDIA floating point) quantized version of allenai/Molmo2-4B for efficient inference.

Model Details

Property	Value
Base Model	allenai/Molmo2-4B
Quantization	NVFP4 (4-bit floating point)
Format	nvfp4-pack-quantized (compressed-tensors)
Model Size	~6.5GB (vs ~16GB original)
Vision Backbone	Full precision (not quantized)

Quantization Details

Method: NVFP4 quantization using llmcompressor
Target Layers: Linear layers (excluding vision backbone, lm_head, mlp.gate)
Precision: 4-bit symmetric floating point
Group Size: 16
Scale Dtype: torch.float8_e4m3fn

Usage with vLLM

Important: This model requires a custom vLLM build with NVFP4 quantized weight mapping support for Molmo2.

Step 1: Start Docker Container

docker run -it --gpus all \
  --entrypoint /bin/bash \
  -e SETUPTOOLS_SCM_PRETEND_VERSION=0.9.0 \
  -v /path/to/your/models:/workspace/models \
  -p 8000:8000 \
  vllm/vllm-openai:latest

Step 2: Build Custom vLLM

Inside the container:

git clone https://github.com/George-Polya/vllm.git -b dev/molmo2-quantize
cd vllm
pip install --no-build-isolation -e .

Step 3: Serve the Model

vllm serve /workspace/models/Molmo2-4B-NVFP4 \
  --trust-remote-code \
  --max-model-len 4096 \
  --max-num-batched-tokens 8192

Step 4: Query the Model

from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

# With image URL
response = client.chat.completions.create(
    model="Molmo2-4B-NVFP4",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
                {"type": "text", "text": "Describe this image."}
            ]
        }
    ],
    max_tokens=512
)
print(response.choices[0].message.content)

Why Custom vLLM Build?

The official vLLM does not yet support NVFP4 quantized weight loading for Molmo2's vision backbone. The custom branch adds:

prefix parameter to vision layers for proper weight name mapping
Extended hf_to_vllm_mapper patterns for quantized weight names

See: George-Polya/vllm@dev/molmo2-quantize

Quantization Recipe

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: ['re:.*lm_head', 're:.*vision_backbone.*', 're:.*mlp.gate$']
      scheme: NVFP4

FP8 vs NVFP4

	FP8	NVFP4
Bits	8	4
Size	~8GB	~6.5GB
Quality	Higher	Lower
Speed	Fast	Faster

Choose NVFP4 for maximum memory efficiency, FP8 for better quality-size balance.

Limitations

Vision backbone remains in full precision to preserve image understanding quality
Requires custom vLLM build (not compatible with stock vLLM)
NVFP4 requires hardware support (NVIDIA Blackwell or newer recommended)

License

This model inherits the Apache 2.0 license from the base model.

Downloads last month: 4

Safetensors

Model size

3B params

Tensor type

F32

F8_E4M3

Model tree for tollea1234/Molmo2-4B-NVFP4

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

allenai/Molmo2-4B

Quantized

(4)

this model