RzenEmbed / README.md

Update README.md

ea95c33 verified 2 months ago

9.53 kB

	# RzenEmbed-v2-7B

	RzenEmbed-v2-7B is a multimodal embedding model developed and open-sourced by 360CVGroup. It achieves state-of-the-art (SOTA) results on the MMEB-V2, MMEB-Visdoc, and MMEB-Video benchmarks (as of September 29, 2025).


	[![arXiv](https://img.shields.io/badge/arXiv-2510.27350-b31b1b.svg)](https://arxiv.org/abs/2510.27350)
	[![GitHub](https://img.shields.io/badge/GitHub-Repository-blue?logo=github)](https://github.com/360CVGroup/RzenEmbed)
	[![Benchmark](https://img.shields.io/badge/MMEB-Benchmark-blue.svg)](https://huggingface.co/spaces/TIGER-Lab/MMEB-Leaderboard)

	### MMEB-V2

	\| Model \| Model Size (B) \| Overall \| Image-Overall \| Video-Overall \| Visdoc-Overall \|
	\| ------------------------ \| -------------- \| --------- \| ------------- \| ------------- \| -------------- \|
	\| RzenEmbed-v2-7B \| 8.29 \| 71.61 \| 75.92 \| 55.73 \| 77.06 \|
	\| seed-1.6-embedding \| unknown \| 71.27 \| 77.78 \| 55.34 \| 73.44 \|
	\| Ops-MM-embedding-v1-7B \| 8.29 \| 67.61 \| 72.72 \| 53.76 \| 70.34 \|
	\| Ops-MM-embedding-v1-2B \| 2.21 \| 63.44 \| 69.03 \| 47.56 \| 66.96 \|
	\| interestFM-UIR-CAFe-7B \| 8.03 \| 60.63 \| 67.56 \| 42.4 \| 63.92 \|
	\| VLM2Vec-V2.0-Qwen2VL-2B \| 2.21 \| 58.02 \| 64.85 \| 34.85 \| 65.36 \|
	\| gme-Qwen2-VL-7B-Instruct \| 8.29 \| 57.83 \| 55.95 \| 38.43 \| 75.18 \|
	\| gme-Qwen2-VL-2B-Instruct \| 2.21 \| 54.08 \| 51.89 \| 33.64 \| 72.71 \|

	### MMEB-Image

	\| Models \| Model Size(B) \| Image-Overall \| I-CLS \| I-QA \| I-RET \| I-VG \|
	\| ---------------------- \| ------------- \| ------------- \| --------- \| --------- \| -------- \| -------- \|
	\| seed-1.6-embedding \| unknown \| 77.78 \| 76.06 \| 73.97 \| 77.9 \| 91.25 \|
	\| RzenEmbed-v2-7B \| 8.29 \| 75.92 \| 70.61 \| 71.67 \| 78.5 \| 92.1 \|
	\| QQMM-embed-v2 \| 8.29 \| 75.28 \| 72.97 \| 71.85 \| 76.01 \| 87.42 \|
	\| ReCo-7B \| 8.29 \| 73.87 \| 70.95 \| 71.52 \| 73.66 \| 87.70 \|
	\| OEmbedding-v1-7B \| 8.29 \| 72.79 \| 70.05 \| 68.1 \| 73.84 \| 88.25 \|
	\| Ops-MM-embedding-v1-7B \| 8.29 \| 72.72 \| 69.65 \| 69.58 \| 73.09 \| 87.15 \|
	\| QQMM-embed \| 8.29 \| 72.18 \| 70.07 \| 69.52 \| 71.18 \| 87.08 \|
	\| B3_Qwen2_7B \| 8.29 \| 72.00 \| 70.00 \| 66.50 \| 74.10 \| 84.60 \|

	### MMEB-Video

	\| Models \| Model Size(B) \| Video-Overall \| V-CLS \| V-QA \| V-RET \| V-MRET \|
	\| ------------------------ \| ------------- \| ------------- \| --------- \| -------- \| --------- \| --------- \|
	\| RzenEmbed-v2-7B \| 8.29 \| 55.73 \| 58.82 \| 63.5 \| 50.97 \| 45.54 \|
	\| seed-1.6-embedding \| unknown \| 55.34 \| 54.99 \| 60.85 \| 51.33 \| 53.45 \|
	\| Ops-MM-embedding-v1-7B \| 8.29 \| 53.76 \| 59.68 \| 62.22 \| 45.72 \| 43.21 \|
	\| interestFM-UIR-CAFe-7B \| 8.03 \| 42.40 \| 35.81 \| 58.66 \| 34.44 \| 39.53 \|
	\| gme-Qwen2-VL-7B-Instruct \| 8.29 \| 38.43 \| 37.44 \| 50.35 \| 28.37 \| 36.96 \|
	\| interestFM-UIR-CAFe-0.5B \| 0.89 \| 35.87 \| 33.90 \| 41.72 \| 29.69 \| 39.69 \|
	\| LamRA-Ret \| 8.29 \| 34.96 \| 39.27 \| 42.6 \| 24.26 \| 32.84 \|
	\| VLM2Vec-V2.0-Qwen2VL-2B \| 2.21 \| 34.58 \| 39.30 \| 34.32 \| 28.77 \| 36.82 \|

	### MMEB-Visdoc

	\| Models \| Model Size(B) \| Visdoc-Overall \| ViDoRe-V1 \| ViDoRe-V2 \| VisRAG \| VisDoc-OOD \|
	\| ------------------------ \| ------------- \| -------------- \| --------- \| --------- \| -------- \| ---------- \|
	\| RzenEmbed-v2-7B \| 8.29 \| 77.06 \| 89.7 \| 60.7 \| 88.7 \| 44.38 \|
	\| gme-Qwen2-VL-7B-Instruct \| 8.29 \| 75.18 \| 89.44 \| 55.61 \| 84.99 \| 44.4 \|
	\| seed-1.6-embedding \| unknown \| 73.44 \| 85.53 \| 56.57 \| 84.74 \| 43.14 \|
	\| gme-Qwen2-VL-2B-Instruct \| 2.21 \| 72.71 \| 86.15 \| 53.96 \| 82.52 \| 43.12 \|
	\| colpali-v1.3 \| 2.92 \| 70.97 \| 83.60 \| 51.98 \| 81.13 \| 43.12 \|
	\| Ops-MM-embedding-v1-7B \| 8.29 \| 70.34 \| 80.05 \| 59.59 \| 79.32 \| 43.34 \|
	\| Ops-MM-embedding-v1-2B \| 2.21 \| 66.96 \| 76.39 \| 53.18 \| 77.64 \| 41.17 \|
	\| VLM2Vec-V2.0-Qwen2VL-2B \| 2.21 \| 65.36 \| 75.52 \| 44.86 \| 79.38 \| 39.43 \|

	## Usage

	### Text-to-Image Retrieval

	Retrieve images that match text captions.

	```python
	from rzen_embed_inference import RzenEmbed

	rzen = RzenEmbed("qihoo360/RzenEmbed")

	queries = [
	"A curious kitten and a gentle puppy share a moment of connection on the grass.",
	"Fresh fridge full of berries yogurt milk and snacks."
	]
	candidates = [
	"assets/example1.jpg",
	"assets/example2.jpg",
	]

	query_instruction = "Find me an everyday image that matches the given caption: "
	candidate_instruction = "Represent the given image."

	# Generate embeddings and compute similarity
	query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries)
	candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates)

	# Calculate text-to-image similarity scores
	similarity_scores = query_embeds @ candidate_embeds.T
	print(similarity_scores)
	```

	### Image-to-Text Retrieval

	Find text captions that best match given images.

	```python
	from rzen_embed_inference import RzenEmbed

	rzen = RzenEmbed("qihoo360/RzenEmbed")

	queries = [
	"assets/example1.jpg",
	"assets/example2.jpg",
	]
	candidates = [
	"A curious kitten and a gentle puppy share a moment of connection on the grass.",
	"Fresh fridge full of berries yogurt milk and snacks."
	]

	query_instruction = "Find an image caption describing the given everyday image."

	query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, images=queries)
	candidate_embeds = rzen.get_fused_embeddings(texts=candidates)

	# Calculate image-to-text similarity scores
	similarity_scores = query_embeds @ candidate_embeds.T
	print(similarity_scores)
	```

	### Document Retrieval

	Match text queries with document images for information retrieval.

	```python
	from rzen_embed_inference import RzenEmbed

	rzen = RzenEmbed("qihoo360/RzenEmbed")

	queries = [
	"What is the main variable being analyzed on the x-axis of these graphs?",
	"What is the personnel costs in the 4th year?"
	]
	candidates = [
	"assets/example3.jpg",
	"assets/example4.jpg",
	]

	query_instruction = "Find a document image that matches the given query: "
	candidate_instruction = "Understand the content of the provided document image."

	# Generate embeddings for document retrieval
	query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries)
	candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates)

	# Calculate text-to-document similarity
	similarity_scores = query_embeds @ candidate_embeds.T
	print(similarity_scores)
	```

	### Video Retrieval

	Retrieve videos based on text captions.

	```python
	import cv2
	import numpy as np
	from rzen_embed_inference import RzenEmbed

	def extract_frames(video_path, num_frames):
	cap = cv2.VideoCapture(video_path)
	total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
	frame_indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
	frames = []
	for idx in frame_indices:
	cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
	ret, frame = cap.read()
	if ret:
	frames.append(Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))
	else:
	break
	cap.release()
	return frames

	rzen = RzenEmbed("qihoo360/RzenEmbed")

	queries = [
	"A traditional boat glides along a river lined with blooming cherry blossoms under an overcast sky in a modern cityscape.",
	"Tiny ginger kitten meows cutely by the water."
	]

	# Extract frames from videos
	video_path_list = [
	"assets/example5.mp4",
	"assets/example6.mp4",
	]
	candidates = [extract_frames(video_path, num_frames=8) for video_path in video_path_list]

	query_instruction = "Find the video snippet that corresponds to the given caption: "
	candidate_instruction = "Understand the content of the provided video."

	# Generate embeddings for video retrieval
	query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries)
	candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates)

	# Calculate text-to-video similarity scores
	similarity_scores = query_embeds @ candidate_embeds.T
	print(similarity_scores)
	```

	## Citation
	If you find RzenEmbed useful for your research and applications, please cite using this BibTeX:

	```
	@article{jian2025rzenembed,
	title={RzenEmbed: Towards Comprehensive Multimodal Retrieval},
	author={Jian, Weijian and Zhang, Yajun and Liang, Dawei and Xie, Chunyu and He, Yixiao and Leng, Dawei and Yin, Yuhui},
	journal={arXiv preprint arXiv:2510.27350},
	year={2025}
	}
	```