Qwen3-VL-Embedding-8B-W8A8

W8A8 INT8 quantization of Qwen/Qwen3-VL-Embedding-8B — the #1 multimodal embedding model on MMEB-V2 (77.9).

Quantization Details

Property Value
Method GPTQ W8A8 INT8 (weights INT8 per-channel, activations INT8 dynamic per-token)
Format compressed-tensors (vLLM native)
Tool llm-compressor
Calibration 512 samples from ultrachat-200k (train_sft split), max 2048 tokens
Vision encoder Kept in BF16 (not quantized)
Non-linear weights Preserved exactly (norms, biases, embed_tokens)
Model size ~10.5 GB (down from ~16 GB BF16)

What's NOT quantized

  • Vision encoder (model.visual.*) — kept in full BF16
  • All RMSNorm weights
  • Embedding table (embed_tokens)
  • lm_head (not used for embeddings)

No SmoothQuant was applied — it corrupts norm weights which destroys embedding quality.

Serving with vLLM

vllm serve collin-park/Qwen3-VL-Embedding-8B-W8A8 \
  --quantization compressed-tensors \
  --runner pooling \
  --trust-remote-code \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --max-num-seqs 8 \
  --host 0.0.0.0 \
  --port 8100

Tested with vLLM 0.17.1 on RTX 3090 (24GB). Uses ~10.4 GB VRAM for model + ~10.9 GB KV cache at 0.90 GPU utilization.

API Usage

curl http://localhost:8100/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": "What is machine learning?", "model": "collin-park/Qwen3-VL-Embedding-8B-W8A8"}'

Original Model

  • Architecture: Qwen3-VL (8B params, 36 layers, 4096-dim embeddings)
  • Context: 32K tokens
  • Modalities: Text, images, screenshots, video
  • Benchmarks: MMEB-V2 77.9, MMTEB retrieval 81.08
  • MRL: Supports custom embedding dimensions (64-4096)

See Qwen/Qwen3-VL-Embedding-8B for full details.

Quantization Recipe

from transformers import Qwen3VLForConditionalGeneration
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-Embedding-8B", dtype="auto", device_map="auto", trust_remote_code=True
)

recipe = [GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head", "re:.*visual.*"])]

oneshot(
    model=model, dataset="ultrachat-200k", splits={"calibration": "train_sft"},
    recipe=recipe, max_seq_length=2048, num_calibration_samples=512,
    output_dir="Qwen3-VL-Embedding-8B-W8A8",
)

Key choices:

  • Qwen3VLForConditionalGeneration (not AutoModel) — preserves model.* weight prefix for vLLM compatibility
  • No SmoothQuantModifier — it modifies RMSNorm weights, destroying embedding quality
  • ignore=["re:.*visual.*"] — keep ViT in BF16 for quality
Downloads last month
2,565
Safetensors
Model size
9B params
Tensor type
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for collin-park/Qwen3-VL-Embedding-8B-W8A8

Quantized
(15)
this model