Qwen3-VL-Embedding-2B-NVFP4

Technical Report Blog GitHub Base Model

Quantized Model Overview

This repository contains an NVFP4 derivative of Qwen/Qwen3-VL-Embedding-2B prepared for direct vLLM deployment through the modelopt_fp4 backend.

What Was Quantized

  • Quantization method: NVIDIA Model Optimizer NVFP4_DEFAULT_CFG
  • Export format: ModelOpt HF checkpoint with hf_quant_config.json
  • Runtime backend: vLLM modelopt_fp4
  • Weight format: NVFP4
  • Group size: 16
  • Quantized modules: text-side quantizable language-model weights under the default ModelOpt NVFP4 recipe
  • Left unquantized: model.visual* and lm_head

Calibration Data

This checkpoint was calibrated on 1000 mixed retrieval samples from the local embedding benchmark workflow.

Calibration sources:

  • Polish text retrieval: mteb/MSMARCO-PL, mteb/NQ-PL, mteb/FiQA-PL
  • Multilingual text retrieval: MIRACL hard-negative slices for en, de, es, fr, ja
  • Multimodal retrieval: vidore/colpali_train_set and lmms-lab/flickr30k
  • Hard-negative augmentation: MIRACL-derived negatives

Local Benchmark Setup

The numbers below are from local full benchmark runs using the same harness for stock FP16 and quantized checkpoints.

Benchmark tasks:

  • mteb/MSMARCO-PL
  • mteb/NQ-PL
  • MIRACL hard-negative slices: en, de, es, fr, ja
  • vidore/vidore_v3_industrial
  • vidore/vidore_v3_computer_science

Metrics:

  • nDCG@10
  • Recall@10
  • MRR@10

Baseline Comparison

Compared with the stock FP16 Qwen/Qwen3-VL-Embedding-2B checkpoint on the local full benchmark:

Metric Stock FP16 NVFP4 Delta
nDCG@10 0.56222 0.55008 -0.01214
Recall@10 0.64934 0.63794 -0.01141
MRR@10 0.78883 0.77870 -0.01013
Benchmark wall time 434.853 s 377.707 s 13.14% faster
Average request latency 0.332726 s 0.277620 s -0.055106 s
Throughput 18.4338 rps 21.2228 rps +2.7890 rps

Notes:

  • This was the best speed / size quantized checkpoint we tested.
  • It improved or matched the FP16 baseline on several text-heavy tasks, especially the multilingual MIRACL slices.
  • It regressed noticeably on the ViDoRe image benchmarks, so multimodal retrieval quality should be rechecked on your own workload before deployment.
  • Local validation smoke passed with mean probe cosine 0.9172 against the FP16 checkpoint.

vLLM Usage

HF_TOKEN=hf_xxx \
vllm serve LifetimeMistake/Qwen3-VL-Embedding-2B-NVFP4 \
  --runner pooling \
  --convert embed \
  --trust-remote-code \
  --quantization modelopt_fp4 \
  --limit-mm-per-prompt '{"image":1}'

If your vLLM build does not automatically pick up the bundled chat_template.jinja, download the repo locally and pass --chat-template /path/to/chat_template.jinja.

Base Model Introduction

This model is a quantized derivative of Qwen/Qwen3-VL-Embedding-2B, the 2B member of Qwen’s multimodal embedding series.

Upstream model highlights:

  • Multimodal inputs: text, images, screenshots, video, and mixed text+vision inputs
  • 30+ language support
  • 32k context length
  • Output dimension up to 2048, with support for smaller embedding dimensions
  • Instruction-aware retrieval behavior, with English instructions recommended even for multilingual tasks

For the full base model card, broader benchmark tables, and upstream usage examples, see:

Downloads last month
4
Safetensors
Model size
2B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LifetimeMistake/Qwen3-VL-Embedding-2B-NVFP4

Quantized
(11)
this model

Paper for LifetimeMistake/Qwen3-VL-Embedding-2B-NVFP4