Qwen3-VL-Embedding-2B-AWQ-4bit

Technical Report Blog GitHub Base Model

Quantized Model Overview

This repository contains a 4-bit AWQ derivative of Qwen/Qwen3-VL-Embedding-2B prepared for direct vLLM deployment through the compressed-tensors backend.

What Was Quantized

  • Quantization method: llm-compressor AWQ (W4A16_ASYM)
  • Export format: compressed-tensors
  • Runtime backend: vLLM compressed-tensors
  • Weight format: 4-bit grouped asymmetric integer weights
  • Group size: 128
  • Calibration pipeline: layer_sequential
  • Quantized modules: text-side Linear layers in the Qwen3-VL decoder
  • Left unquantized: all model.visual* modules and lm_head

Calibration Data

This checkpoint was built from the same 1000-sample mixed retrieval manifest as the FP16 and NVFP4 workflow, but the final AWQ pass used 876 text-only samples and skipped 124 image-bearing rows because the vision stack remained excluded from quantization.

Calibration sources:

  • Polish text retrieval: mteb/MSMARCO-PL, mteb/NQ-PL, mteb/FiQA-PL
  • Multilingual text retrieval: MIRACL hard-negative slices for en, de, es, fr, ja
  • Multimodal retrieval in the master manifest: vidore/colpali_train_set and lmms-lab/flickr30k
  • Hard-negative augmentation: MIRACL-derived negatives

Local Benchmark Setup

The numbers below are from local full benchmark runs using the same harness for stock FP16 and quantized checkpoints.

Benchmark tasks:

  • mteb/MSMARCO-PL
  • mteb/NQ-PL
  • MIRACL hard-negative slices: en, de, es, fr, ja
  • vidore/vidore_v3_industrial
  • vidore/vidore_v3_computer_science

Metrics:

  • nDCG@10
  • Recall@10
  • MRR@10

Baseline Comparison

Compared with the stock FP16 Qwen/Qwen3-VL-Embedding-2B checkpoint on the local full benchmark:

Metric Stock FP16 AWQ 4-bit Delta
nDCG@10 0.56222 0.54474 -0.01748
Recall@10 0.64934 0.63544 -0.01390
MRR@10 0.78883 0.80040 +0.01157
Benchmark wall time 434.853 s 435.140 s 0.07% slower
Average request latency 0.332726 s 0.333469 s +0.000743 s
Throughput 18.4338 rps 18.4217 rps -0.0121 rps

Notes:

  • This was the better multimodal quantized checkpoint of the two we tested.
  • It preserved the ViDoRe image benchmarks substantially better than NVFP4 and improved vidore_v3_computer_science over the FP16 baseline.
  • It did not produce a meaningful runtime speedup versus the FP16 checkpoint in this harness.
  • The AWQ export is larger than the NVFP4 export and took much longer to build.

vLLM Usage

HF_TOKEN=hf_xxx \
vllm serve LifetimeMistake/Qwen3-VL-Embedding-2B-AWQ-4bit \
  --runner pooling \
  --convert embed \
  --trust-remote-code \
  --quantization compressed-tensors \
  --limit-mm-per-prompt '{"image":1}'

If your vLLM build does not automatically pick up the bundled chat_template.jinja, download the repo locally and pass --chat-template /path/to/chat_template.jinja.

Base Model Introduction

This model is a quantized derivative of Qwen/Qwen3-VL-Embedding-2B, the 2B member of Qwen’s multimodal embedding series.

Upstream model highlights:

  • Multimodal inputs: text, images, screenshots, video, and mixed text+vision inputs
  • 30+ language support
  • 32k context length
  • Output dimension up to 2048, with support for smaller embedding dimensions
  • Instruction-aware retrieval behavior, with English instructions recommended even for multilingual tasks

For the full base model card, broader benchmark tables, and upstream usage examples, see:

Downloads last month
15
Safetensors
Model size
2B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LifetimeMistake/Qwen3-VL-Embedding-2B-AWQ-4bit

Quantized
(11)
this model

Paper for LifetimeMistake/Qwen3-VL-Embedding-2B-AWQ-4bit