Qwen3-VL-Embedding-2B-AWQ-4bit

Quantized Model Overview

This repository contains a 4-bit AWQ derivative of Qwen/Qwen3-VL-Embedding-2B prepared for direct vLLM deployment through the compressed-tensors backend.

What Was Quantized

Quantization method: llm-compressor AWQ (W4A16_ASYM)
Export format: compressed-tensors
Runtime backend: vLLM compressed-tensors
Weight format: 4-bit grouped asymmetric integer weights
Group size: 128
Calibration pipeline: layer_sequential
Quantized modules: text-side Linear layers in the Qwen3-VL decoder
Left unquantized: all model.visual* modules and lm_head

Calibration Data

This checkpoint was built from the same 1000-sample mixed retrieval manifest as the FP16 and NVFP4 workflow, but the final AWQ pass used 876 text-only samples and skipped 124 image-bearing rows because the vision stack remained excluded from quantization.

Calibration sources:

Polish text retrieval: mteb/MSMARCO-PL, mteb/NQ-PL, mteb/FiQA-PL
Multilingual text retrieval: MIRACL hard-negative slices for en, de, es, fr, ja
Multimodal retrieval in the master manifest: vidore/colpali_train_set and lmms-lab/flickr30k
Hard-negative augmentation: MIRACL-derived negatives

Local Benchmark Setup

The numbers below are from local full benchmark runs using the same harness for stock FP16 and quantized checkpoints.

Benchmark tasks:

mteb/MSMARCO-PL
mteb/NQ-PL
MIRACL hard-negative slices: en, de, es, fr, ja
vidore/vidore_v3_industrial
vidore/vidore_v3_computer_science

Metrics:

nDCG@10
Recall@10
MRR@10

Baseline Comparison

Compared with the stock FP16 Qwen/Qwen3-VL-Embedding-2B checkpoint on the local full benchmark:

Metric	Stock FP16	AWQ 4-bit	Delta
`nDCG@10`	`0.56222`	`0.54474`	`-0.01748`
`Recall@10`	`0.64934`	`0.63544`	`-0.01390`
`MRR@10`	`0.78883`	`0.80040`	`+0.01157`
Benchmark wall time	`434.853 s`	`435.140 s`	`0.07% slower`
Average request latency	`0.332726 s`	`0.333469 s`	`+0.000743 s`
Throughput	`18.4338 rps`	`18.4217 rps`	`-0.0121 rps`

Notes:

This was the better multimodal quantized checkpoint of the two we tested.
It preserved the ViDoRe image benchmarks substantially better than NVFP4 and improved vidore_v3_computer_science over the FP16 baseline.
It did not produce a meaningful runtime speedup versus the FP16 checkpoint in this harness.
The AWQ export is larger than the NVFP4 export and took much longer to build.

vLLM Usage

HF_TOKEN=hf_xxx \
vllm serve LifetimeMistake/Qwen3-VL-Embedding-2B-AWQ-4bit \
  --runner pooling \
  --convert embed \
  --trust-remote-code \
  --quantization compressed-tensors \
  --limit-mm-per-prompt '{"image":1}'

If your vLLM build does not automatically pick up the bundled chat_template.jinja, download the repo locally and pass --chat-template /path/to/chat_template.jinja.

Base Model Introduction

This model is a quantized derivative of Qwen/Qwen3-VL-Embedding-2B, the 2B member of Qwen’s multimodal embedding series.

Upstream model highlights:

Multimodal inputs: text, images, screenshots, video, and mixed text+vision inputs
30+ language support
32k context length
Output dimension up to 2048, with support for smaller embedding dimensions
Instruction-aware retrieval behavior, with English instructions recommended even for multilingual tasks

For the full base model card, broader benchmark tables, and upstream usage examples, see: