Qwen3-VL-Embedding-2B-NVFP4
Quantized Model Overview
This repository contains an NVFP4 derivative of Qwen/Qwen3-VL-Embedding-2B prepared for direct vLLM deployment through the modelopt_fp4 backend.
What Was Quantized
- Quantization method: NVIDIA Model Optimizer
NVFP4_DEFAULT_CFG - Export format: ModelOpt HF checkpoint with
hf_quant_config.json - Runtime backend: vLLM
modelopt_fp4 - Weight format:
NVFP4 - Group size:
16 - Quantized modules: text-side quantizable language-model weights under the default ModelOpt NVFP4 recipe
- Left unquantized:
model.visual*andlm_head
Calibration Data
This checkpoint was calibrated on 1000 mixed retrieval samples from the local embedding benchmark workflow.
Calibration sources:
- Polish text retrieval:
mteb/MSMARCO-PL,mteb/NQ-PL,mteb/FiQA-PL - Multilingual text retrieval: MIRACL hard-negative slices for
en,de,es,fr,ja - Multimodal retrieval:
vidore/colpali_train_setandlmms-lab/flickr30k - Hard-negative augmentation: MIRACL-derived negatives
Local Benchmark Setup
The numbers below are from local full benchmark runs using the same harness for stock FP16 and quantized checkpoints.
Benchmark tasks:
mteb/MSMARCO-PLmteb/NQ-PL- MIRACL hard-negative slices:
en,de,es,fr,ja vidore/vidore_v3_industrialvidore/vidore_v3_computer_science
Metrics:
nDCG@10Recall@10MRR@10
Baseline Comparison
Compared with the stock FP16 Qwen/Qwen3-VL-Embedding-2B checkpoint on the local full benchmark:
| Metric | Stock FP16 | NVFP4 | Delta |
|---|---|---|---|
nDCG@10 |
0.56222 |
0.55008 |
-0.01214 |
Recall@10 |
0.64934 |
0.63794 |
-0.01141 |
MRR@10 |
0.78883 |
0.77870 |
-0.01013 |
| Benchmark wall time | 434.853 s |
377.707 s |
13.14% faster |
| Average request latency | 0.332726 s |
0.277620 s |
-0.055106 s |
| Throughput | 18.4338 rps |
21.2228 rps |
+2.7890 rps |
Notes:
- This was the best speed / size quantized checkpoint we tested.
- It improved or matched the FP16 baseline on several text-heavy tasks, especially the multilingual MIRACL slices.
- It regressed noticeably on the ViDoRe image benchmarks, so multimodal retrieval quality should be rechecked on your own workload before deployment.
- Local validation smoke passed with mean probe cosine
0.9172against the FP16 checkpoint.
vLLM Usage
HF_TOKEN=hf_xxx \
vllm serve LifetimeMistake/Qwen3-VL-Embedding-2B-NVFP4 \
--runner pooling \
--convert embed \
--trust-remote-code \
--quantization modelopt_fp4 \
--limit-mm-per-prompt '{"image":1}'
If your vLLM build does not automatically pick up the bundled chat_template.jinja, download the repo locally and pass --chat-template /path/to/chat_template.jinja.
Base Model Introduction
This model is a quantized derivative of Qwen/Qwen3-VL-Embedding-2B, the 2B member of Qwen’s multimodal embedding series.
Upstream model highlights:
- Multimodal inputs: text, images, screenshots, video, and mixed text+vision inputs
- 30+ language support
- 32k context length
- Output dimension up to
2048, with support for smaller embedding dimensions - Instruction-aware retrieval behavior, with English instructions recommended even for multilingual tasks
For the full base model card, broader benchmark tables, and upstream usage examples, see:
- Base model: https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B
- GitHub: https://github.com/QwenLM/Qwen3-VL-Embedding
- Technical report: https://arxiv.org/abs/2601.04720
- Downloads last month
- 4