Add pipeline tag, library name, and paper link

#6
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +28 -68
README.md CHANGED
@@ -1,40 +1,41 @@
1
  ---
2
- license: apache-2.0
3
  base_model:
4
  - Qwen/Qwen3-VL-8B-Instruct
 
 
 
5
  tags:
6
- - transformers
7
  - multimodal rerank
8
  ---
 
9
  # Qwen3-VL-Reranker-8B
10
 
11
  <p align="center">
12
  <img src="https://model-demo.oss-cn-hangzhou.aliyuncs.com/Qwen3-VL-Reranker.png" width="400"/>
13
  <p>
14
 
 
 
15
  ## Highlights
16
 
17
- The **Qwen3-VL-Embedding** and **Qwen3-VL-Reranker** model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities.
18
 
19
  While the Embedding model generates high-dimensional vectors for broad applications like retrieval and clustering, the Reranker model is engineered to refine these results, establishing a comprehensive pipeline for state-of-the-art multimodal search.
20
 
21
  - **Multimodal Versatility**: Both models seamlessly handle a wide range of inputs—including text, images, screenshots, and video—within a unified framework. They deliver state-of-the-art performance across diverse multimodal tasks such as image-text retrieval, video-text matching, visual question answering (VQA), and multimodal content clustering.
22
-
23
- - **Unified Representation Learning (Embedding)**: By leveraging the Qwen3-VL architecture, the Embedding model generates semantically rich vectors that capture both visual and textual information in a shared space. This facilitates efficient similarity computation and retrieval across different modalities.
24
-
25
- - **High-Precision Reranking (Reranker)**: We also introduce the Qwen3-VL-Reranker series to complement the embedding model. The reranker takes a (query, document) pair as input—where both query and document may contain arbitrary single or mixed modalities—and outputs a precise relevance score. In retrieval pipelines, the two models are typically used in tandem: the embedding model performs efficient initial recall, while the reranker refines results in a subsequent re-ranking stage. This two-stage approach significantly boosts retrieval accuracy.
26
-
27
- - **Exceptional Practicality**: Inheriting Qwen3-VL’s multilingual capabilities, the series supports over 30 languages, making it ideal for global applications. It is highly practical for real-world scenarios, offering flexible vector dimensions, customizable instructions for specific use cases, and strong performance even with quantized embeddings. These capabilities enable developers to seamlessly integrate both models into existing pipelines, unlocking powerful cross-lingual and cross-modal understanding.
28
 
29
  **Qwen3-VL-Reranker-8B** has the following features:
30
 
31
- - Model Type: MultiModal Rerank
32
- - Supported Languages: 30+ Languages
33
- - Supported Input Modalities: Text, images, screenshots, videos, and arbitrary multimodal combinations (e.g., text + image, text + video)
34
- - Number of Parameters: 8B
35
- - Context Length: 32k
36
 
37
- For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [blog](https://qwenlm.github.io/blog/qwen3-embedding/), [GitHub](https://github.com/QwenLM/Qwen3-VL-Embedding).
38
 
39
  ## Qwen3-VL-Embedding and Qwen3-VL-Reranker Model list
40
 
@@ -49,12 +50,9 @@ For more details, including benchmark evaluation, hardware requirements, and inf
49
  > - `Quantization Support` indicates the supported quantization post process for the output embedding.
50
  > - `MRL Support` indicates whether the embedding model supports custom dimensions for the final embedding.
51
  > - `Instruction Aware` notes whether the embedding or reranking model supports customizing the input instruction according to different tasks.
52
- > Our evaluation indicates that, for most downstream tasks, using instructions (instruct) typically yields an improvement of 1% to 5% compared to not using them. Therefore, we recommend that developers create tailored instructions specific to their tasks and scenarios. In multilingual contexts, we also advise users to write their instructions in English, as most instructions utilized during the model training process were originally written in English.
53
 
54
  ## Model Performance
55
 
56
- We utilize retrieval task datasets from various subtasks of [MMEB-v2](https://huggingface.co/spaces/TIGER-Lab/MMEB-Leaderboard) and [MMTEB](https://huggingface.co/spaces/mteb/leaderboard) retrieval benchmarks. For visual document retrieval, we employ [JinaVDR](https://huggingface.co/collections/jinaai/jinavdr-visual-document-retrieval) and [ViDoRe v3](https://huggingface.co/blog/QuentinJG/introducing-vidore-v3) datasets. Our results demonstrate that all Qwen3-VL-Reranker models consistently outperform the base embedding model and baseline rerankers, with the 8B variant achieving the best performance across most tasks.
57
-
58
  | Model | Size | MMEB-v2(Retrieval) - Avg | MMEB-v2(Retrieval) - Image | MMEB-v2(Retrieval) - Video | MMEB-v2(Retrieval) - VisDoc | MMTEB(Retrieval) | JinaVDR | ViDoRe(v3) |
59
  |-------|------|--------------------------|----------------------------|----------------------------|------------------------------|------------------|---------|------------|
60
  | Qwen3-VL-Embedding-2B | 2B | 73.4 | 74.8 | 53.6 | 79.2 | 68.1 | 71.0 | 52.9 |
@@ -64,7 +62,7 @@ We utilize retrieval task datasets from various subtasks of [MMEB-v2](https://hu
64
 
65
  ## Usage
66
 
67
- - **requirements**
68
  ```text
69
  transformers>=4.57.0
70
  qwen-vl-utils>=0.0.14
@@ -79,11 +77,8 @@ from scripts.qwen3_vl_reranker import Qwen3VLReranker
79
  # Specify the model path
80
  model_name_or_path = "Qwen/Qwen3-VL-Reranker-8B"
81
 
82
- # Initialize the Qwen3VLEmbedder model
83
  model = Qwen3VLReranker(model_name_or_path=model_name_or_path)
84
- # We recommend enabling flash_attention_2 for better acceleration and memory saving,
85
- # model = Qwen3VLReranker(model_name_or_path=model_name_or_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2")
86
- # Combine queries and documents into a single input list
87
 
88
  inputs = {
89
  "instruction": "Retrieval relevant image or text with user's query",
@@ -98,7 +93,6 @@ inputs = {
98
 
99
  scores = model.process(inputs)
100
  print(scores)
101
- # [0.7838293313980103, 0.585621178150177, 0.6147719025611877]
102
  ```
103
 
104
  ### vLLM Basic Usage Example
@@ -110,7 +104,6 @@ from typing import Dict, Any
110
  from vllm import LLM, EngineArgs
111
  from vllm.entrypoints.score_utils import ScoreMultiModalParam
112
 
113
-
114
  queries = [
115
  {"text": "A woman playing with her dog on a beach at sunset."}
116
  ]
@@ -122,55 +115,30 @@ documents = [
122
  "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"}
123
  ]
124
 
125
-
126
  def format_document_to_score_param(doc_dict: Dict[str, Any]) -> ScoreMultiModalParam:
127
  content = []
128
-
129
  text = doc_dict.get('text')
130
  image = doc_dict.get('image')
131
 
132
  if text:
133
- content.append({
134
- "type": "text",
135
- "text": text
136
- })
137
 
138
  if image:
139
  image_url = image
140
  if isinstance(image, str) and not image.startswith(('http', 'https', 'oss')):
141
  abs_image_path = os.path.abspath(image)
142
  image_url = 'file://' + abs_image_path
143
-
144
- content.append({
145
- "type": "image_url",
146
- "image_url": {
147
- "url": image_url
148
- }
149
- })
150
 
151
  if not content:
152
- content.append({
153
- "type": "text",
154
- "text": ""
155
- })
156
-
157
  return {"content": content}
158
 
159
-
160
  def main():
161
- parser = argparse.ArgumentParser(description="Offline Reranker with vLLM")
162
- parser.add_argument("--model-path", type=str, default="models/Qwen3-VL-Reranker-8B", help="Path to the reranker model")
163
- parser.add_argument("--dtype", type=str, default="bfloat16", help="Data type (e.g., bfloat16)")
164
- parser.add_argument("--template-path", type=str, default="vllm/examples/pooling/score/template/qwen3_vl_reranker.jinja",
165
- help="Path to chat template file")
166
- args = parser.parse_args()
167
-
168
- print(f"Loading model from {args.model_path}...")
169
-
170
  engine_args = EngineArgs(
171
- model=args.model_path,
172
  runner="pooling",
173
- dtype=args.dtype,
174
  trust_remote_code=True,
175
  hf_overrides={
176
  "architectures": ["Qwen3VLForSequenceClassification"],
@@ -181,37 +149,29 @@ def main():
181
 
182
  llm = LLM(**vars(engine_args))
183
 
184
- template_path = Path(args.template_path)
185
- chat_template = template_path.read_text() if template_path.exists() else None
186
-
187
  for query_dict in queries:
188
  query_text = query_dict.get('text', '')
189
- print(f"\nQuery: {query_text}")
 
190
 
191
  scores = []
192
  for doc_dict in documents:
193
  doc_param = format_document_to_score_param(doc_dict)
194
- outputs = llm.score(query_text, doc_param, chat_template=chat_template)
195
  score = outputs[0].outputs.score
196
  scores.append(score)
197
-
198
  print(scores)
199
 
200
-
201
  if __name__ == "__main__":
202
  main()
203
-
204
  ```
205
- For more usage examples, please visit our [GitHub repository](https://github.com/QwenLM/Qwen3-VL-Embedding).
206
 
207
  ## Citation
208
 
209
- If you find our work helpful, feel free to give us a cite.
210
-
211
- ```
212
  @article{qwen3vlembedding,
213
  title={Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking},
214
- author={Li, Mingxin and Zhang, Yanzhao and Long, Dingkun and Chen Keqin and Song, Sibo and Bai, Shuai and Yang, Zhibo and Xie, Pengjun and Yang, An and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang},
215
  journal={arXiv},
216
  year={2026}
217
  }
 
1
  ---
 
2
  base_model:
3
  - Qwen/Qwen3-VL-8B-Instruct
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ pipeline_tag: text-ranking
7
  tags:
 
8
  - multimodal rerank
9
  ---
10
+
11
  # Qwen3-VL-Reranker-8B
12
 
13
  <p align="center">
14
  <img src="https://model-demo.oss-cn-hangzhou.aliyuncs.com/Qwen3-VL-Reranker.png" width="400"/>
15
  <p>
16
 
17
+ [Paper](https://huggingface.co/papers/2601.04720) | [GitHub](https://github.com/QwenLM/Qwen3-VL-Embedding) | [Blog](https://qwenlm.github.io/blog/qwen3-embedding/)
18
+
19
  ## Highlights
20
 
21
+ The **Qwen3-VL-Embedding** and **Qwen3-VL-Reranker** model series are the latest additions to the Qwen family, built upon the powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities.
22
 
23
  While the Embedding model generates high-dimensional vectors for broad applications like retrieval and clustering, the Reranker model is engineered to refine these results, establishing a comprehensive pipeline for state-of-the-art multimodal search.
24
 
25
  - **Multimodal Versatility**: Both models seamlessly handle a wide range of inputs—including text, images, screenshots, and video—within a unified framework. They deliver state-of-the-art performance across diverse multimodal tasks such as image-text retrieval, video-text matching, visual question answering (VQA), and multimodal content clustering.
26
+ - **Unified Representation Learning (Embedding)**: By leveraging the Qwen3-VL architecture, the Embedding model generates semantically rich vectors that capture both visual and textual information in a shared space.
27
+ - **High-Precision Reranking (Reranker)**: The reranker takes a (query, document) pair as input—where both query and document may contain arbitrary single or mixed modalities—and outputs a precise relevance score. In retrieval pipelines, the two models are typically used in tandem: the embedding model performs efficient initial recall, while the reranker refines results in a subsequent re-ranking stage.
28
+ - **Exceptional Practicality**: Inheriting Qwen3-VL’s multilingual capabilities, the series supports over 30 languages. It offers flexible vector dimensions, customizable instructions, and strong performance even with quantized embeddings.
 
 
 
29
 
30
  **Qwen3-VL-Reranker-8B** has the following features:
31
 
32
+ - **Model Type**: MultiModal Rerank
33
+ - **Supported Languages**: 30+ Languages
34
+ - **Supported Input Modalities**: Text, images, screenshots, videos, and arbitrary multimodal combinations
35
+ - **Number of Parameters**: 8B
36
+ - **Context Length**: 32k
37
 
38
+ For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to the [blog](https://qwenlm.github.io/blog/qwen3-embedding/) and [GitHub repository](https://github.com/QwenLM/Qwen3-VL-Embedding).
39
 
40
  ## Qwen3-VL-Embedding and Qwen3-VL-Reranker Model list
41
 
 
50
  > - `Quantization Support` indicates the supported quantization post process for the output embedding.
51
  > - `MRL Support` indicates whether the embedding model supports custom dimensions for the final embedding.
52
  > - `Instruction Aware` notes whether the embedding or reranking model supports customizing the input instruction according to different tasks.
 
53
 
54
  ## Model Performance
55
 
 
 
56
  | Model | Size | MMEB-v2(Retrieval) - Avg | MMEB-v2(Retrieval) - Image | MMEB-v2(Retrieval) - Video | MMEB-v2(Retrieval) - VisDoc | MMTEB(Retrieval) | JinaVDR | ViDoRe(v3) |
57
  |-------|------|--------------------------|----------------------------|----------------------------|------------------------------|------------------|---------|------------|
58
  | Qwen3-VL-Embedding-2B | 2B | 73.4 | 74.8 | 53.6 | 79.2 | 68.1 | 71.0 | 52.9 |
 
62
 
63
  ## Usage
64
 
65
+ - **Requirements**
66
  ```text
67
  transformers>=4.57.0
68
  qwen-vl-utils>=0.0.14
 
77
  # Specify the model path
78
  model_name_or_path = "Qwen/Qwen3-VL-Reranker-8B"
79
 
80
+ # Initialize the Qwen3VLReranker model
81
  model = Qwen3VLReranker(model_name_or_path=model_name_or_path)
 
 
 
82
 
83
  inputs = {
84
  "instruction": "Retrieval relevant image or text with user's query",
 
93
 
94
  scores = model.process(inputs)
95
  print(scores)
 
96
  ```
97
 
98
  ### vLLM Basic Usage Example
 
104
  from vllm import LLM, EngineArgs
105
  from vllm.entrypoints.score_utils import ScoreMultiModalParam
106
 
 
107
  queries = [
108
  {"text": "A woman playing with her dog on a beach at sunset."}
109
  ]
 
115
  "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"}
116
  ]
117
 
 
118
  def format_document_to_score_param(doc_dict: Dict[str, Any]) -> ScoreMultiModalParam:
119
  content = []
 
120
  text = doc_dict.get('text')
121
  image = doc_dict.get('image')
122
 
123
  if text:
124
+ content.append({"type": "text", "text": text})
 
 
 
125
 
126
  if image:
127
  image_url = image
128
  if isinstance(image, str) and not image.startswith(('http', 'https', 'oss')):
129
  abs_image_path = os.path.abspath(image)
130
  image_url = 'file://' + abs_image_path
131
+ content.append({"type": "image_url", "image_url": {"url": image_url}})
 
 
 
 
 
 
132
 
133
  if not content:
134
+ content.append({"type": "text", "text": ""})
 
 
 
 
135
  return {"content": content}
136
 
 
137
  def main():
 
 
 
 
 
 
 
 
 
138
  engine_args = EngineArgs(
139
+ model="Qwen/Qwen3-VL-Reranker-8B",
140
  runner="pooling",
141
+ dtype="bfloat16",
142
  trust_remote_code=True,
143
  hf_overrides={
144
  "architectures": ["Qwen3VLForSequenceClassification"],
 
149
 
150
  llm = LLM(**vars(engine_args))
151
 
 
 
 
152
  for query_dict in queries:
153
  query_text = query_dict.get('text', '')
154
+ print(f"
155
+ Query: {query_text}")
156
 
157
  scores = []
158
  for doc_dict in documents:
159
  doc_param = format_document_to_score_param(doc_dict)
160
+ outputs = llm.score(query_text, doc_param)
161
  score = outputs[0].outputs.score
162
  scores.append(score)
 
163
  print(scores)
164
 
 
165
  if __name__ == "__main__":
166
  main()
 
167
  ```
 
168
 
169
  ## Citation
170
 
171
+ ```bibtex
 
 
172
  @article{qwen3vlembedding,
173
  title={Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking},
174
+ author={Li, Mingxin and Zhang, Yanzhao and Long, Dingkun and Chen, Keqin and Song, Sibo and Bai, Shuai and Yang, Zhibo and Xie, Pengjun and Yang, An and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang},
175
  journal={arXiv},
176
  year={2026}
177
  }