Qwen3-VL-Embedding-2B-NVFP4

Quantized Model Overview

This repository contains an NVFP4 derivative of Qwen/Qwen3-VL-Embedding-2B prepared for direct vLLM deployment through the modelopt_fp4 backend.

What Was Quantized

Quantization method: NVIDIA Model Optimizer NVFP4_DEFAULT_CFG
Export format: ModelOpt HF checkpoint with hf_quant_config.json
Runtime backend: vLLM modelopt_fp4
Weight format: NVFP4
Group size: 16
Quantized modules: text-side quantizable language-model weights under the default ModelOpt NVFP4 recipe
Left unquantized: model.visual* and lm_head

Calibration Data

This checkpoint was calibrated on 1000 mixed retrieval samples from the local embedding benchmark workflow.

Calibration sources:

Polish text retrieval: mteb/MSMARCO-PL, mteb/NQ-PL, mteb/FiQA-PL
Multilingual text retrieval: MIRACL hard-negative slices for en, de, es, fr, ja
Multimodal retrieval: vidore/colpali_train_set and lmms-lab/flickr30k
Hard-negative augmentation: MIRACL-derived negatives

Local Benchmark Setup

The numbers below are from local full benchmark runs using the same harness for stock FP16 and quantized checkpoints.

Benchmark tasks:

mteb/MSMARCO-PL
mteb/NQ-PL
MIRACL hard-negative slices: en, de, es, fr, ja
vidore/vidore_v3_industrial
vidore/vidore_v3_computer_science

Metrics:

nDCG@10
Recall@10
MRR@10

Baseline Comparison

Compared with the stock FP16 Qwen/Qwen3-VL-Embedding-2B checkpoint on the local full benchmark:

Metric	Stock FP16	NVFP4	Delta
`nDCG@10`	`0.56222`	`0.55008`	`-0.01214`
`Recall@10`	`0.64934`	`0.63794`	`-0.01141`
`MRR@10`	`0.78883`	`0.77870`	`-0.01013`
Benchmark wall time	`434.853 s`	`377.707 s`	`13.14% faster`
Average request latency	`0.332726 s`	`0.277620 s`	`-0.055106 s`
Throughput	`18.4338 rps`	`21.2228 rps`	`+2.7890 rps`

Notes:

This was the best speed / size quantized checkpoint we tested.
It improved or matched the FP16 baseline on several text-heavy tasks, especially the multilingual MIRACL slices.
It regressed noticeably on the ViDoRe image benchmarks, so multimodal retrieval quality should be rechecked on your own workload before deployment.
Local validation smoke passed with mean probe cosine 0.9172 against the FP16 checkpoint.

vLLM Usage

HF_TOKEN=hf_xxx \
vllm serve LifetimeMistake/Qwen3-VL-Embedding-2B-NVFP4 \
  --runner pooling \
  --convert embed \
  --trust-remote-code \
  --quantization modelopt_fp4 \
  --limit-mm-per-prompt '{"image":1}'

If your vLLM build does not automatically pick up the bundled chat_template.jinja, download the repo locally and pass --chat-template /path/to/chat_template.jinja.

Base Model Introduction

This model is a quantized derivative of Qwen/Qwen3-VL-Embedding-2B, the 2B member of Qwen’s multimodal embedding series.

Upstream model highlights:

Multimodal inputs: text, images, screenshots, video, and mixed text+vision inputs
30+ language support
32k context length
Output dimension up to 2048, with support for smaller embedding dimensions
Instruction-aware retrieval behavior, with English instructions recommended even for multilingual tasks

For the full base model card, broader benchmark tables, and upstream usage examples, see: