Qwen3-VL-Embedding-8B-W8A8
W8A8 INT8 quantization of Qwen/Qwen3-VL-Embedding-8B — the #1 multimodal embedding model on MMEB-V2 (77.9).
Quantization Details
| Property | Value |
|---|---|
| Method | GPTQ W8A8 INT8 (weights INT8 per-channel, activations INT8 dynamic per-token) |
| Format | compressed-tensors (vLLM native) |
| Tool | llm-compressor |
| Calibration | 512 samples from ultrachat-200k (train_sft split), max 2048 tokens |
| Vision encoder | Kept in BF16 (not quantized) |
| Non-linear weights | Preserved exactly (norms, biases, embed_tokens) |
| Model size | ~10.5 GB (down from ~16 GB BF16) |
What's NOT quantized
- Vision encoder (
model.visual.*) — kept in full BF16 - All RMSNorm weights
- Embedding table (
embed_tokens) lm_head(not used for embeddings)
No SmoothQuant was applied — it corrupts norm weights which destroys embedding quality.
Serving with vLLM
vllm serve collin-park/Qwen3-VL-Embedding-8B-W8A8 \
--quantization compressed-tensors \
--runner pooling \
--trust-remote-code \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--max-num-seqs 8 \
--host 0.0.0.0 \
--port 8100
Tested with vLLM 0.17.1 on RTX 3090 (24GB). Uses ~10.4 GB VRAM for model + ~10.9 GB KV cache at 0.90 GPU utilization.
API Usage
curl http://localhost:8100/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input": "What is machine learning?", "model": "collin-park/Qwen3-VL-Embedding-8B-W8A8"}'
Original Model
- Architecture: Qwen3-VL (8B params, 36 layers, 4096-dim embeddings)
- Context: 32K tokens
- Modalities: Text, images, screenshots, video
- Benchmarks: MMEB-V2 77.9, MMTEB retrieval 81.08
- MRL: Supports custom embedding dimensions (64-4096)
See Qwen/Qwen3-VL-Embedding-8B for full details.
Quantization Recipe
from transformers import Qwen3VLForConditionalGeneration
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
model = Qwen3VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen3-VL-Embedding-8B", dtype="auto", device_map="auto", trust_remote_code=True
)
recipe = [GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head", "re:.*visual.*"])]
oneshot(
model=model, dataset="ultrachat-200k", splits={"calibration": "train_sft"},
recipe=recipe, max_seq_length=2048, num_calibration_samples=512,
output_dir="Qwen3-VL-Embedding-8B-W8A8",
)
Key choices:
Qwen3VLForConditionalGeneration(notAutoModel) — preservesmodel.*weight prefix for vLLM compatibility- No
SmoothQuantModifier— it modifies RMSNorm weights, destroying embedding quality ignore=["re:.*visual.*"]— keep ViT in BF16 for quality
- Downloads last month
- 2,565