Qwen3-VL-Embedding-8B-W8A8

W8A8 INT8 quantization of Qwen/Qwen3-VL-Embedding-8B — the #1 multimodal embedding model on MMEB-V2 (77.9).

Quantization Details

Property	Value
Method	GPTQ W8A8 INT8 (weights INT8 per-channel, activations INT8 dynamic per-token)
Format	`compressed-tensors` (vLLM native)
Tool	llm-compressor
Calibration	512 samples from ultrachat-200k (train_sft split), max 2048 tokens
Vision encoder	Kept in BF16 (not quantized)
Non-linear weights	Preserved exactly (norms, biases, embed_tokens)
Model size	~10.5 GB (down from ~16 GB BF16)

What's NOT quantized

Vision encoder (model.visual.*) — kept in full BF16
All RMSNorm weights
Embedding table (embed_tokens)
lm_head (not used for embeddings)

No SmoothQuant was applied — it corrupts norm weights which destroys embedding quality.

Serving with vLLM

vllm serve collin-park/Qwen3-VL-Embedding-8B-W8A8 \
  --quantization compressed-tensors \
  --runner pooling \
  --trust-remote-code \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --max-num-seqs 8 \
  --host 0.0.0.0 \
  --port 8100

Tested with vLLM 0.17.1 on RTX 3090 (24GB). Uses ~10.4 GB VRAM for model + ~10.9 GB KV cache at 0.90 GPU utilization.

API Usage

curl http://localhost:8100/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": "What is machine learning?", "model": "collin-park/Qwen3-VL-Embedding-8B-W8A8"}'

Original Model

Architecture: Qwen3-VL (8B params, 36 layers, 4096-dim embeddings)
Context: 32K tokens
Modalities: Text, images, screenshots, video
Benchmarks: MMEB-V2 77.9, MMTEB retrieval 81.08
MRL: Supports custom embedding dimensions (64-4096)

See Qwen/Qwen3-VL-Embedding-8B for full details.

Quantization Recipe

from transformers import Qwen3VLForConditionalGeneration
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-Embedding-8B", dtype="auto", device_map="auto", trust_remote_code=True
)

recipe = [GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head", "re:.*visual.*"])]

oneshot(
    model=model, dataset="ultrachat-200k", splits={"calibration": "train_sft"},
    recipe=recipe, max_seq_length=2048, num_calibration_samples=512,
    output_dir="Qwen3-VL-Embedding-8B-W8A8",
)

Key choices:

Qwen3VLForConditionalGeneration (not AutoModel) — preserves model.* weight prefix for vLLM compatibility
No SmoothQuantModifier — it modifies RMSNorm weights, destroying embedding quality
ignore=["re:.*visual.*"] — keep ViT in BF16 for quality

Downloads last month: 2,565

Safetensors

Model size

9B params

Tensor type

BF16

Model tree for collin-park/Qwen3-VL-Embedding-8B-W8A8

Base model

Qwen/Qwen3-VL-8B-Instruct

Finetuned

Qwen/Qwen3-VL-Embedding-8B

Quantized

(15)

this model