Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
Paper
•
2601.04720
•
Published
•
54
This is an FP8 quantized version of Qwen/Qwen3-VL-Embedding-8B, optimized for efficient inference with vLLM.
| Attribute | Value |
|---|---|
| Base Model | Qwen/Qwen3-VL-Embedding-8B |
| Quantization | FP8 Dynamic (W8A8) |
| Original Size | ~16 GB (BF16) |
| Quantized Size | ~9 GB (FP8) |
| Memory Savings | ~45% |
| Embedding Dimension | 4096 |
| Supported Inputs | Text, Images, Videos, Multimodal |
| Context Length | 32K tokens |
| Component | Precision | Notes |
|---|---|---|
| Vision Encoder (ViT) | BF16 | Preserved for accuracy |
| LLM Decoder Layers | FP8 | Quantized for efficiency |
| Embeddings | BF16 | Preserved |
from vllm import LLM, EngineArgs
import numpy as np
# Initialize vLLM with pooling runner
engine_args = EngineArgs(
model="RamManavalan/Qwen3-VL-Embedding-8B-FP8",
runner="pooling",
dtype="bfloat16",
trust_remote_code=True,
)
llm = LLM(**vars(engine_args))
# Prepare inputs
tokenizer = llm.get_tokenizer()
def format_input(text, instruction="Represent the user's input."):
conversation = [
{"role": "system", "content": [{"type": "text", "text": instruction}]},
{"role": "user", "content": [{"type": "text", "text": text}]}
]
prompt = tokenizer.apply_chat_template(
conversation, tokenize=False, add_generation_prompt=True
)
return {"prompt": prompt}
# Get embeddings
inputs = [
format_input("A woman playing with her dog on the beach."),
format_input("Machine learning for image classification."),
]
outputs = llm.embed(inputs)
# Extract embeddings
embeddings = np.array([o.outputs.embedding for o in outputs])
print(f"Embeddings shape: {embeddings.shape}") # (2, 4096)
# Compute similarity
similarity = embeddings[0] @ embeddings[1]
print(f"Similarity: {similarity:.4f}")
# Start the server
vllm serve RamManavalan/Qwen3-VL-Embedding-8B-FP8 --task embed
# Query via API
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input": "Your text here", "model": "RamManavalan/Qwen3-VL-Embedding-8B-FP8"}'
import torch
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
model = Qwen3VLForConditionalGeneration.from_pretrained(
"RamManavalan/Qwen3-VL-Embedding-8B-FP8",
dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
processor = AutoProcessor.from_pretrained(
"RamManavalan/Qwen3-VL-Embedding-8B-FP8",
trust_remote_code=True,
)
# Prepare input
messages = [
{"role": "system", "content": [{"type": "text", "text": "Represent the user's input."}]},
{"role": "user", "content": [{"type": "text", "text": "Your text here"}]}
]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], return_tensors="pt", padding=True).to(model.device)
# Get embedding (last-token pooling)
with torch.no_grad():
outputs = model.model(**inputs, output_hidden_states=True)
# Get the last non-padding token
seq_len = inputs['attention_mask'].sum(dim=1) - 1
embedding = outputs.last_hidden_state[0, seq_len[0]]
embedding = torch.nn.functional.normalize(embedding, p=2, dim=-1)
print(f"Embedding shape: {embedding.shape}") # (4096,)
This repository includes a helper class for easier embedding extraction:
from scripts.qwen3_vl_embedding import Qwen3VLEmbedder
# Initialize
model = Qwen3VLEmbedder(model_name_or_path="RamManavalan/Qwen3-VL-Embedding-8B-FP8")
# Get embeddings for text, images, or multimodal inputs
inputs = [
{"text": "A dog on the beach"},
{"image": "path/to/image.jpg"},
{"text": "What is in this image?", "image": "path/to/image.jpg"},
]
embeddings = model.process(inputs)
print(f"Embeddings shape: {embeddings.shape}") # (3, 4096)
The base model achieves state-of-the-art performance on multimodal benchmarks:
| Benchmark | Score |
|---|---|
| MMEB-V2 Overall | 77.9 |
| MMTEB Mean | 67.88 |
FP8 quantization typically preserves >95% of the original model's accuracy.
This model was quantized using llm-compressor:
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen3-VL-Embedding-8B",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
# FP8 quantization recipe (data-free)
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=[
"lm_head",
r"re:model\.visual\..*", # Keep vision encoder in BF16
]
)
# Apply quantization
oneshot(model=model, recipe=recipe)
# Save
model.save_pretrained("Qwen3-VL-Embedding-8B-FP8", save_compressed=True)
If you use this model, please cite the original Qwen3-VL-Embedding paper:
@article{qwen3vlembedding,
title={Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking},
author={Li, Mingxin and Zhang, Yanzhao and Long, Dingkun and Chen, Keqin and Song, Sibo and Bai, Shuai and Yang, Zhibo and Xie, Pengjun and Yang, An and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang},
journal={arXiv preprint arXiv:2601.04720},
year={2026}
}
Apache 2.0 (same as base model)