Papers
arxiv:2601.04720

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Published on Jan 8
ยท Submitted by
Dingkun Long
on Jan 12
ยท Qwen Qwen
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

The Qwen3-VL-Embedding and Qwen3-VL-Reranker models form an end-to-end multimodal search pipeline, leveraging multi-stage training and cross-attention mechanisms to achieve high-precision retrieval across diverse modalities.

AI-generated summary

In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in 2B and 8B parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of 77.8 on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.

Community

Paper author Paper submitter

๐Ÿš€ Introducing Qwen3-VL-Embedding and Qwen3-VL-Reranker โ€“ advancing the state of the art in multimodal retrieval and cross-modal understanding!
โœจ Highlights:
โœ… Built upon the robust Qwen3-VL foundation model
โœ… Processes text, images, screenshots, videos, and mixed modality inputs
โœ… Supports 30+ languages
โœ… Achieves state-of-the-art performance on multimodal retrieval benchmarks
โœ… Open source and available on Hugging Face, GitHub, and ModelScope
โœ… API deployment on Alibaba Cloud coming soon!

๐ŸŽฏ Two-stage retrieval architecture:
๐Ÿ“Š Embedding Model โ€“ generates semantically rich vector representations in a unified embedding space
๐ŸŽฏ Reranker Model โ€“ computes fine-grained relevance scores for enhanced retrieval accuracy

๐Ÿ” Key application scenarios:
Image-text retrieval, video search, multimodal RAG, visual question answering, multimodal content clustering, multilingual visual search, and more!

๐ŸŒŸ Developer-friendly capabilities:
โ€ข Configurable embedding dimensions
โ€ข Task-specific instruction customization
โ€ข Embedding quantization support for efficient and cost-effective downstream deployment
Hugging Face๏ผš
https://huggingface.co/collections/Qwen/qwen3-vl-embedding
https://huggingface.co/collections/Qwen/qwen3-vl-reranker

Github: https://github.com/QwenLM/Qwen3-VL-Embedding
Blog: https://qwen.ai/blog?id=qwen3-vl-embedding
Tech Report: https://www.arxiv.org/abs/2601.04720

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.04720 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.04720 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.04720 in a Space README.md to link it from this page.

Collections including this paper 6