arxiv:2601.04720

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Published on Jan 8

· Submitted by

Dingkun Long on Jan 12

Qwen

Upvote

Authors:

Dingkun Long ,

Abstract

The Qwen3-VL-Embedding and Qwen3-VL-Reranker models form an end-to-end multimodal search pipeline, leveraging multi-stage training and cross-attention mechanisms to achieve high-precision retrieval across diverse modalities.

AI-generated summary

In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in 2B and 8B parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of 77.8 on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.

View arXiv page View PDF GitHub 620 auto Add to collection

Community

thenlper

Paper author Paper submitter about 19 hours ago

🚀 Introducing Qwen3-VL-Embedding and Qwen3-VL-Reranker – advancing the state of the art in multimodal retrieval and cross-modal understanding!
✨ Highlights:
✅ Built upon the robust Qwen3-VL foundation model
✅ Processes text, images, screenshots, videos, and mixed modality inputs
✅ Supports 30+ languages
✅ Achieves state-of-the-art performance on multimodal retrieval benchmarks
✅ Open source and available on Hugging Face, GitHub, and ModelScope
✅ API deployment on Alibaba Cloud coming soon!

🎯 Two-stage retrieval architecture:
📊 Embedding Model – generates semantically rich vector representations in a unified embedding space
🎯 Reranker Model – computes fine-grained relevance scores for enhanced retrieval accuracy

🔍 Key application scenarios:
Image-text retrieval, video search, multimodal RAG, visual question answering, multimodal content clustering, multilingual visual search, and more!

🌟 Developer-friendly capabilities:
• Configurable embedding dimensions
• Task-specific instruction customization
• Embedding quantization support for efficient and cost-effective downstream deployment
Hugging Face：
https://huggingface.co/collections/Qwen/qwen3-vl-embedding
https://huggingface.co/collections/Qwen/qwen3-vl-reranker

Github: https://github.com/QwenLM/Qwen3-VL-Embedding
Blog: https://qwen.ai/blog?id=qwen3-vl-embedding
Tech Report: https://www.arxiv.org/abs/2601.04720