File size: 9,529 Bytes
e7e8b04 3136022 e7e8b04 3136022 e7e8b04 3136022 e7e8b04 3136022 e7e8b04 3136022 e7e8b04 ea95c33 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 |
# RzenEmbed-v2-7B
RzenEmbed-v2-7B is a multimodal embedding model developed and open-sourced by 360CVGroup. It achieves state-of-the-art (SOTA) results on the MMEB-V2, MMEB-Visdoc, and MMEB-Video benchmarks (as of September 29, 2025).
[](https://arxiv.org/abs/2510.27350)
[](https://github.com/360CVGroup/RzenEmbed)
[](https://huggingface.co/spaces/TIGER-Lab/MMEB-Leaderboard)
### MMEB-V2
| Model | Model Size (B) | Overall | Image-Overall | Video-Overall | Visdoc-Overall |
| ------------------------ | -------------- | --------- | ------------- | ------------- | -------------- |
| RzenEmbed-v2-7B | 8.29 | **71.61** | 75.92 | **55.73** | **77.06** |
| seed-1.6-embedding | unknown | 71.27 | **77.78** | 55.34 | 73.44 |
| Ops-MM-embedding-v1-7B | 8.29 | 67.61 | 72.72 | 53.76 | 70.34 |
| Ops-MM-embedding-v1-2B | 2.21 | 63.44 | 69.03 | 47.56 | 66.96 |
| interestFM-UIR-CAFe-7B | 8.03 | 60.63 | 67.56 | 42.4 | 63.92 |
| VLM2Vec-V2.0-Qwen2VL-2B | 2.21 | 58.02 | 64.85 | 34.85 | 65.36 |
| gme-Qwen2-VL-7B-Instruct | 8.29 | 57.83 | 55.95 | 38.43 | 75.18 |
| gme-Qwen2-VL-2B-Instruct | 2.21 | 54.08 | 51.89 | 33.64 | 72.71 |
### MMEB-Image
| Models | Model Size(B) | Image-Overall | I-CLS | I-QA | I-RET | I-VG |
| ---------------------- | ------------- | ------------- | --------- | --------- | -------- | -------- |
| seed-1.6-embedding | unknown | **77.78** | **76.06** | **73.97** | 77.9 | 91.25 |
| RzenEmbed-v2-7B | 8.29 | 75.92 | 70.61 | 71.67 | **78.5** | **92.1** |
| QQMM-embed-v2 | 8.29 | 75.28 | 72.97 | 71.85 | 76.01 | 87.42 |
| ReCo-7B | 8.29 | 73.87 | 70.95 | 71.52 | 73.66 | 87.70 |
| OEmbedding-v1-7B | 8.29 | 72.79 | 70.05 | 68.1 | 73.84 | 88.25 |
| Ops-MM-embedding-v1-7B | 8.29 | 72.72 | 69.65 | 69.58 | 73.09 | 87.15 |
| QQMM-embed | 8.29 | 72.18 | 70.07 | 69.52 | 71.18 | 87.08 |
| B3_Qwen2_7B | 8.29 | 72.00 | 70.00 | 66.50 | 74.10 | 84.60 |
### MMEB-Video
| Models | Model Size(B) | Video-Overall | V-CLS | V-QA | V-RET | V-MRET |
| ------------------------ | ------------- | ------------- | --------- | -------- | --------- | --------- |
| RzenEmbed-v2-7B | 8.29 | **55.73** | 58.82 | **63.5** | 50.97 | 45.54 |
| seed-1.6-embedding | unknown | 55.34 | 54.99 | 60.85 | **51.33** | **53.45** |
| Ops-MM-embedding-v1-7B | 8.29 | 53.76 | **59.68** | 62.22 | 45.72 | 43.21 |
| interestFM-UIR-CAFe-7B | 8.03 | 42.40 | 35.81 | 58.66 | 34.44 | 39.53 |
| gme-Qwen2-VL-7B-Instruct | 8.29 | 38.43 | 37.44 | 50.35 | 28.37 | 36.96 |
| interestFM-UIR-CAFe-0.5B | 0.89 | 35.87 | 33.90 | 41.72 | 29.69 | 39.69 |
| LamRA-Ret | 8.29 | 34.96 | 39.27 | 42.6 | 24.26 | 32.84 |
| VLM2Vec-V2.0-Qwen2VL-2B | 2.21 | 34.58 | 39.30 | 34.32 | 28.77 | 36.82 |
### MMEB-Visdoc
| Models | Model Size(B) | Visdoc-Overall | ViDoRe-V1 | ViDoRe-V2 | VisRAG | VisDoc-OOD |
| ------------------------ | ------------- | -------------- | --------- | --------- | -------- | ---------- |
| RzenEmbed-v2-7B | 8.29 | **77.06** | **89.7** | **60.7** | **88.7** | 44.38 |
| gme-Qwen2-VL-7B-Instruct | 8.29 | 75.18 | 89.44 | 55.61 | 84.99 | **44.4** |
| seed-1.6-embedding | unknown | 73.44 | 85.53 | 56.57 | 84.74 | 43.14 |
| gme-Qwen2-VL-2B-Instruct | 2.21 | 72.71 | 86.15 | 53.96 | 82.52 | 43.12 |
| colpali-v1.3 | 2.92 | 70.97 | 83.60 | 51.98 | 81.13 | 43.12 |
| Ops-MM-embedding-v1-7B | 8.29 | 70.34 | 80.05 | 59.59 | 79.32 | 43.34 |
| Ops-MM-embedding-v1-2B | 2.21 | 66.96 | 76.39 | 53.18 | 77.64 | 41.17 |
| VLM2Vec-V2.0-Qwen2VL-2B | 2.21 | 65.36 | 75.52 | 44.86 | 79.38 | 39.43 |
## Usage
### Text-to-Image Retrieval
Retrieve images that match text captions.
```python
from rzen_embed_inference import RzenEmbed
rzen = RzenEmbed("qihoo360/RzenEmbed")
queries = [
"A curious kitten and a gentle puppy share a moment of connection on the grass.",
"Fresh fridge full of berries yogurt milk and snacks."
]
candidates = [
"assets/example1.jpg",
"assets/example2.jpg",
]
query_instruction = "Find me an everyday image that matches the given caption: "
candidate_instruction = "Represent the given image."
# Generate embeddings and compute similarity
query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries)
candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates)
# Calculate text-to-image similarity scores
similarity_scores = query_embeds @ candidate_embeds.T
print(similarity_scores)
```
### Image-to-Text Retrieval
Find text captions that best match given images.
```python
from rzen_embed_inference import RzenEmbed
rzen = RzenEmbed("qihoo360/RzenEmbed")
queries = [
"assets/example1.jpg",
"assets/example2.jpg",
]
candidates = [
"A curious kitten and a gentle puppy share a moment of connection on the grass.",
"Fresh fridge full of berries yogurt milk and snacks."
]
query_instruction = "Find an image caption describing the given everyday image."
query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, images=queries)
candidate_embeds = rzen.get_fused_embeddings(texts=candidates)
# Calculate image-to-text similarity scores
similarity_scores = query_embeds @ candidate_embeds.T
print(similarity_scores)
```
### Document Retrieval
Match text queries with document images for information retrieval.
```python
from rzen_embed_inference import RzenEmbed
rzen = RzenEmbed("qihoo360/RzenEmbed")
queries = [
"What is the main variable being analyzed on the x-axis of these graphs?",
"What is the personnel costs in the 4th year?"
]
candidates = [
"assets/example3.jpg",
"assets/example4.jpg",
]
query_instruction = "Find a document image that matches the given query: "
candidate_instruction = "Understand the content of the provided document image."
# Generate embeddings for document retrieval
query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries)
candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates)
# Calculate text-to-document similarity
similarity_scores = query_embeds @ candidate_embeds.T
print(similarity_scores)
```
### Video Retrieval
Retrieve videos based on text captions.
```python
import cv2
import numpy as np
from rzen_embed_inference import RzenEmbed
def extract_frames(video_path, num_frames):
cap = cv2.VideoCapture(video_path)
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
frame_indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
frames = []
for idx in frame_indices:
cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
ret, frame = cap.read()
if ret:
frames.append(Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))
else:
break
cap.release()
return frames
rzen = RzenEmbed("qihoo360/RzenEmbed")
queries = [
"A traditional boat glides along a river lined with blooming cherry blossoms under an overcast sky in a modern cityscape.",
"Tiny ginger kitten meows cutely by the water."
]
# Extract frames from videos
video_path_list = [
"assets/example5.mp4",
"assets/example6.mp4",
]
candidates = [extract_frames(video_path, num_frames=8) for video_path in video_path_list]
query_instruction = "Find the video snippet that corresponds to the given caption: "
candidate_instruction = "Understand the content of the provided video."
# Generate embeddings for video retrieval
query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries)
candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates)
# Calculate text-to-video similarity scores
similarity_scores = query_embeds @ candidate_embeds.T
print(similarity_scores)
```
## Citation
If you find RzenEmbed useful for your research and applications, please cite using this BibTeX:
```
@article{jian2025rzenembed,
title={RzenEmbed: Towards Comprehensive Multimodal Retrieval},
author={Jian, Weijian and Zhang, Yajun and Liang, Dawei and Xie, Chunyu and He, Yixiao and Leng, Dawei and Yin, Yuhui},
journal={arXiv preprint arXiv:2510.27350},
year={2025}
}
``` |