Qwen3-VL-Embedding-2B-AWQ-4bit
Quantized Model Overview
This repository contains a 4-bit AWQ derivative of Qwen/Qwen3-VL-Embedding-2B prepared for direct vLLM deployment through the compressed-tensors backend.
What Was Quantized
- Quantization method:
llm-compressorAWQ (W4A16_ASYM) - Export format:
compressed-tensors - Runtime backend: vLLM
compressed-tensors - Weight format: 4-bit grouped asymmetric integer weights
- Group size:
128 - Calibration pipeline:
layer_sequential - Quantized modules: text-side
Linearlayers in the Qwen3-VL decoder - Left unquantized: all
model.visual*modules andlm_head
Calibration Data
This checkpoint was built from the same 1000-sample mixed retrieval manifest as the FP16 and NVFP4 workflow, but the final AWQ pass used 876 text-only samples and skipped 124 image-bearing rows because the vision stack remained excluded from quantization.
Calibration sources:
- Polish text retrieval:
mteb/MSMARCO-PL,mteb/NQ-PL,mteb/FiQA-PL - Multilingual text retrieval: MIRACL hard-negative slices for
en,de,es,fr,ja - Multimodal retrieval in the master manifest:
vidore/colpali_train_setandlmms-lab/flickr30k - Hard-negative augmentation: MIRACL-derived negatives
Local Benchmark Setup
The numbers below are from local full benchmark runs using the same harness for stock FP16 and quantized checkpoints.
Benchmark tasks:
mteb/MSMARCO-PLmteb/NQ-PL- MIRACL hard-negative slices:
en,de,es,fr,ja vidore/vidore_v3_industrialvidore/vidore_v3_computer_science
Metrics:
nDCG@10Recall@10MRR@10
Baseline Comparison
Compared with the stock FP16 Qwen/Qwen3-VL-Embedding-2B checkpoint on the local full benchmark:
| Metric | Stock FP16 | AWQ 4-bit | Delta |
|---|---|---|---|
nDCG@10 |
0.56222 |
0.54474 |
-0.01748 |
Recall@10 |
0.64934 |
0.63544 |
-0.01390 |
MRR@10 |
0.78883 |
0.80040 |
+0.01157 |
| Benchmark wall time | 434.853 s |
435.140 s |
0.07% slower |
| Average request latency | 0.332726 s |
0.333469 s |
+0.000743 s |
| Throughput | 18.4338 rps |
18.4217 rps |
-0.0121 rps |
Notes:
- This was the better multimodal quantized checkpoint of the two we tested.
- It preserved the ViDoRe image benchmarks substantially better than NVFP4 and improved
vidore_v3_computer_scienceover the FP16 baseline. - It did not produce a meaningful runtime speedup versus the FP16 checkpoint in this harness.
- The AWQ export is larger than the NVFP4 export and took much longer to build.
vLLM Usage
HF_TOKEN=hf_xxx \
vllm serve LifetimeMistake/Qwen3-VL-Embedding-2B-AWQ-4bit \
--runner pooling \
--convert embed \
--trust-remote-code \
--quantization compressed-tensors \
--limit-mm-per-prompt '{"image":1}'
If your vLLM build does not automatically pick up the bundled chat_template.jinja, download the repo locally and pass --chat-template /path/to/chat_template.jinja.
Base Model Introduction
This model is a quantized derivative of Qwen/Qwen3-VL-Embedding-2B, the 2B member of Qwen’s multimodal embedding series.
Upstream model highlights:
- Multimodal inputs: text, images, screenshots, video, and mixed text+vision inputs
- 30+ language support
- 32k context length
- Output dimension up to
2048, with support for smaller embedding dimensions - Instruction-aware retrieval behavior, with English instructions recommended even for multilingual tasks
For the full base model card, broader benchmark tables, and upstream usage examples, see:
- Base model: https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B
- GitHub: https://github.com/QwenLM/Qwen3-VL-Embedding
- Technical report: https://arxiv.org/abs/2601.04720
- Downloads last month
- 15