| | --- |
| | license: mit |
| | language: |
| | - en |
| | pipeline_tag: visual-document-retrieval |
| | tags: |
| | - rag |
| | - embedding |
| | --- |
| | # RzenEmbed-v2-7B |
| |
|
| | RzenEmbed-v2-7B is a multimodal embedding model developed and open-sourced by 360CVGroup. It achieves state-of-the-art (SOTA) results on the MMEB-V2, MMEB-Visdoc, and MMEB-Video benchmarks (as of September 29, 2025). |
| |
|
| |
|
| | [](https://arxiv.org/abs/2510.27350) |
| | [](https://github.com/360CVGroup/RzenEmbed) |
| | [](https://huggingface.co/spaces/TIGER-Lab/MMEB-Leaderboard) |
| |
|
| | ### MMEB-V2 |
| |
|
| | | Model | Model Size (B) | Overall | Image-Overall | Video-Overall | Visdoc-Overall | |
| | | ------------------------ | -------------- | --------- | ------------- | ------------- | -------------- | |
| | | RzenEmbed-v2-7B | 8.29 | **71.61** | 75.92 | **55.73** | **77.06** | |
| | | seed-1.6-embedding | unknown | 71.27 | **77.78** | 55.34 | 73.44 | |
| | | Ops-MM-embedding-v1-7B | 8.29 | 67.61 | 72.72 | 53.76 | 70.34 | |
| | | Ops-MM-embedding-v1-2B | 2.21 | 63.44 | 69.03 | 47.56 | 66.96 | |
| | | interestFM-UIR-CAFe-7B | 8.03 | 60.63 | 67.56 | 42.4 | 63.92 | |
| | | VLM2Vec-V2.0-Qwen2VL-2B | 2.21 | 58.02 | 64.85 | 34.85 | 65.36 | |
| | | gme-Qwen2-VL-7B-Instruct | 8.29 | 57.83 | 55.95 | 38.43 | 75.18 | |
| | | gme-Qwen2-VL-2B-Instruct | 2.21 | 54.08 | 51.89 | 33.64 | 72.71 | |
| |
|
| | ### MMEB-Image |
| |
|
| | | Models | Model Size(B) | Image-Overall | I-CLS | I-QA | I-RET | I-VG | |
| | | ---------------------- | ------------- | ------------- | --------- | --------- | -------- | -------- | |
| | | seed-1.6-embedding | unknown | **77.78** | **76.06** | **73.97** | 77.9 | 91.25 | |
| | | RzenEmbed-v2-7B | 8.29 | 75.92 | 70.61 | 71.67 | **78.5** | **92.1** | |
| | | QQMM-embed-v2 | 8.29 | 75.28 | 72.97 | 71.85 | 76.01 | 87.42 | |
| | | ReCo-7B | 8.29 | 73.87 | 70.95 | 71.52 | 73.66 | 87.70 | |
| | | OEmbedding-v1-7B | 8.29 | 72.79 | 70.05 | 68.1 | 73.84 | 88.25 | |
| | | Ops-MM-embedding-v1-7B | 8.29 | 72.72 | 69.65 | 69.58 | 73.09 | 87.15 | |
| | | QQMM-embed | 8.29 | 72.18 | 70.07 | 69.52 | 71.18 | 87.08 | |
| | | B3_Qwen2_7B | 8.29 | 72.00 | 70.00 | 66.50 | 74.10 | 84.60 | |
| |
|
| | ### MMEB-Video |
| |
|
| | | Models | Model Size(B) | Video-Overall | V-CLS | V-QA | V-RET | V-MRET | |
| | | ------------------------ | ------------- | ------------- | --------- | -------- | --------- | --------- | |
| | | RzenEmbed-v2-7B | 8.29 | **55.73** | 58.82 | **63.5** | 50.97 | 45.54 | |
| | | seed-1.6-embedding | unknown | 55.34 | 54.99 | 60.85 | **51.33** | **53.45** | |
| | | Ops-MM-embedding-v1-7B | 8.29 | 53.76 | **59.68** | 62.22 | 45.72 | 43.21 | |
| | | interestFM-UIR-CAFe-7B | 8.03 | 42.40 | 35.81 | 58.66 | 34.44 | 39.53 | |
| | | gme-Qwen2-VL-7B-Instruct | 8.29 | 38.43 | 37.44 | 50.35 | 28.37 | 36.96 | |
| | | interestFM-UIR-CAFe-0.5B | 0.89 | 35.87 | 33.90 | 41.72 | 29.69 | 39.69 | |
| | | LamRA-Ret | 8.29 | 34.96 | 39.27 | 42.6 | 24.26 | 32.84 | |
| | | VLM2Vec-V2.0-Qwen2VL-2B | 2.21 | 34.58 | 39.30 | 34.32 | 28.77 | 36.82 | |
| |
|
| | ### MMEB-Visdoc |
| |
|
| | | Models | Model Size(B) | Visdoc-Overall | ViDoRe-V1 | ViDoRe-V2 | VisRAG | VisDoc-OOD | |
| | | ------------------------ | ------------- | -------------- | --------- | --------- | -------- | ---------- | |
| | | RzenEmbed-v2-7B | 8.29 | **77.06** | **89.7** | **60.7** | **88.7** | 44.38 | |
| | | gme-Qwen2-VL-7B-Instruct | 8.29 | 75.18 | 89.44 | 55.61 | 84.99 | **44.4** | |
| | | seed-1.6-embedding | unknown | 73.44 | 85.53 | 56.57 | 84.74 | 43.14 | |
| | | gme-Qwen2-VL-2B-Instruct | 2.21 | 72.71 | 86.15 | 53.96 | 82.52 | 43.12 | |
| | | colpali-v1.3 | 2.92 | 70.97 | 83.60 | 51.98 | 81.13 | 43.12 | |
| | | Ops-MM-embedding-v1-7B | 8.29 | 70.34 | 80.05 | 59.59 | 79.32 | 43.34 | |
| | | Ops-MM-embedding-v1-2B | 2.21 | 66.96 | 76.39 | 53.18 | 77.64 | 41.17 | |
| | | VLM2Vec-V2.0-Qwen2VL-2B | 2.21 | 65.36 | 75.52 | 44.86 | 79.38 | 39.43 | |
| |
|
| | ## Usage |
| |
|
| | ### Text-to-Image Retrieval |
| |
|
| | Retrieve images that match text captions. |
| |
|
| | ```python |
| | from rzen_embed_inference import RzenEmbed |
| | |
| | rzen = RzenEmbed("qihoo360/RzenEmbed") |
| | |
| | queries = [ |
| | "A curious kitten and a gentle puppy share a moment of connection on the grass.", |
| | "Fresh fridge full of berries yogurt milk and snacks." |
| | ] |
| | candidates = [ |
| | "assets/example1.jpg", |
| | "assets/example2.jpg", |
| | ] |
| | |
| | query_instruction = "Find me an everyday image that matches the given caption: " |
| | candidate_instruction = "Represent the given image." |
| | |
| | # Generate embeddings and compute similarity |
| | query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries) |
| | candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates) |
| | |
| | # Calculate text-to-image similarity scores |
| | similarity_scores = query_embeds @ candidate_embeds.T |
| | print(similarity_scores) |
| | ``` |
| |
|
| | ### Image-to-Text Retrieval |
| |
|
| | Find text captions that best match given images. |
| |
|
| | ```python |
| | from rzen_embed_inference import RzenEmbed |
| | |
| | rzen = RzenEmbed("qihoo360/RzenEmbed") |
| | |
| | queries = [ |
| | "assets/example1.jpg", |
| | "assets/example2.jpg", |
| | ] |
| | candidates = [ |
| | "A curious kitten and a gentle puppy share a moment of connection on the grass.", |
| | "Fresh fridge full of berries yogurt milk and snacks." |
| | ] |
| | |
| | query_instruction = "Find an image caption describing the given everyday image." |
| | |
| | query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, images=queries) |
| | candidate_embeds = rzen.get_fused_embeddings(texts=candidates) |
| | |
| | # Calculate image-to-text similarity scores |
| | similarity_scores = query_embeds @ candidate_embeds.T |
| | print(similarity_scores) |
| | ``` |
| |
|
| | ### Document Retrieval |
| |
|
| | Match text queries with document images for information retrieval. |
| |
|
| | ```python |
| | from rzen_embed_inference import RzenEmbed |
| | |
| | rzen = RzenEmbed("qihoo360/RzenEmbed") |
| | |
| | queries = [ |
| | "What is the main variable being analyzed on the x-axis of these graphs?", |
| | "What is the personnel costs in the 4th year?" |
| | ] |
| | candidates = [ |
| | "assets/example3.jpg", |
| | "assets/example4.jpg", |
| | ] |
| | |
| | query_instruction = "Find a document image that matches the given query: " |
| | candidate_instruction = "Understand the content of the provided document image." |
| | |
| | # Generate embeddings for document retrieval |
| | query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries) |
| | candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates) |
| | |
| | # Calculate text-to-document similarity |
| | similarity_scores = query_embeds @ candidate_embeds.T |
| | print(similarity_scores) |
| | ``` |
| |
|
| | ### Video Retrieval |
| |
|
| | Retrieve videos based on text captions. |
| |
|
| | ```python |
| | import cv2 |
| | import numpy as np |
| | from rzen_embed_inference import RzenEmbed |
| | |
| | def extract_frames(video_path, num_frames): |
| | cap = cv2.VideoCapture(video_path) |
| | total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) |
| | frame_indices = np.linspace(0, total_frames - 1, num_frames, dtype=int) |
| | frames = [] |
| | for idx in frame_indices: |
| | cap.set(cv2.CAP_PROP_POS_FRAMES, idx) |
| | ret, frame = cap.read() |
| | if ret: |
| | frames.append(Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))) |
| | else: |
| | break |
| | cap.release() |
| | return frames |
| | |
| | rzen = RzenEmbed("qihoo360/RzenEmbed") |
| | |
| | queries = [ |
| | "A traditional boat glides along a river lined with blooming cherry blossoms under an overcast sky in a modern cityscape.", |
| | "Tiny ginger kitten meows cutely by the water." |
| | ] |
| | |
| | # Extract frames from videos |
| | video_path_list = [ |
| | "assets/example5.mp4", |
| | "assets/example6.mp4", |
| | ] |
| | candidates = [extract_frames(video_path, num_frames=8) for video_path in video_path_list] |
| | |
| | query_instruction = "Find the video snippet that corresponds to the given caption: " |
| | candidate_instruction = "Understand the content of the provided video." |
| | |
| | # Generate embeddings for video retrieval |
| | query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries) |
| | candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates) |
| | |
| | # Calculate text-to-video similarity scores |
| | similarity_scores = query_embeds @ candidate_embeds.T |
| | print(similarity_scores) |
| | ``` |
| |
|
| | ## Citation |
| | If you find RzenEmbed useful for your research and applications, please cite using this BibTeX: |
| |
|
| | ``` |
| | @article{jian2025rzenembed, |
| | title={RzenEmbed: Towards Comprehensive Multimodal Retrieval}, |
| | author={Jian, Weijian and Zhang, Yajun and Liang, Dawei and Xie, Chunyu and He, Yixiao and Leng, Dawei and Yin, Yuhui}, |
| | journal={arXiv preprint arXiv:2510.27350}, |
| | year={2025} |
| | } |
| | ``` |