CORAL / README.md

Update Reademe.md

e064cb5 verified 4 months ago

6.02 kB

	---
	license: apache-2.0
	datasets:
	- WeiChow/merit
	language:
	- en
	- id
	- ms
	- th
	- vn
	base_model:
	- Qwen/Qwen2.5-3B-Instruct
	---

	<h1 align="center" style="line-height: 50px;">
	MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query
	</h1>

	<div align="center">

	[![arXiv](https://img.shields.io/badge/arXiv-2506.03144-b31b1b.svg)](https://arxiv.org/abs/2506.03144)
	[![Dataset](https://img.shields.io/badge/🤗%20Huggingface-Dataset-yellow)](https://huggingface.co/datasets/WeiChow/merit)
	[![Checkpoint](https://img.shields.io/badge/🤗%20Huggingface-CKPT-blue)](https://huggingface.co/Bia/CORAL)
	[![GitHub](https://img.shields.io/badge/GitHub-Repo-181717?logo=github)](https://github.com/weichow23/merit)
	[![Page](https://img.shields.io/badge/Home-Page-b3.svg)](https://merit-2025.github.io/)

	</div>


	## Model Details
	We introduce CORAL, a multi-modal embedding model built upon Qwen2.5-3B-Instruct. CORAL enables interleaved multi-condition semantic retrieval queries, and was trained using MERIT, a novel dataset proposed in our paper, [MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query](http://arxiv.org/abs/2506.03144).

	CORAL is short for Contrastive Reconstruction for Multimodal Retrieval. The loss function of CORAL consists of three components: Contrastive Learning Loss, Vision Reconstruction Loss, and Masked Language Modeling Loss. During training, we reconstruct both the query and its corresponding positive sample.

	<p align="center">
	<img src="https://merit-2025.github.io/static/images/part3/method.png" alt="CORAL Overview" style="width: 100%; max-width: 600px;">
	</p>

	<p align="center"><b>Overview for CORAL</b></p>

	<p align="center">
	<img src="images/example.jpg" alt="Example" style="width: 100%; max-width: 600px;">
	</p>

	<p align="center"><b>Example Query and Ground Truth</b></p>

	## Usage

	Transformers

	We provide the checkpoint of CORAL on Huggingface. You can load the model using the following code:

	```python
	import torch
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
	from qwen_vl_utils import process_vision_info


	## Initialize Model and Processor
	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	"Bia/CORAL", torch_dtype="auto", device_map="auto"
	)

	processor = AutoProcessor.from_pretrained("Bia/CORAL")

	## Prepare Inputs
	query = [
	{
	"role": "user",
	"content": [
	{"type": "text", "text": "Find a product of backpack that have the same brand with <Product 1> \n "},
	{
	"type": "image",
	"image": "CORAL/images/product_1.png",
	},
	{"type": "text", "text": "\n Ransel MOSSDOOM Polyester dengan Ruang Komputer dan Penyimpanan Besar, Ukuran $30 \times 12 \times 38$ cm , Berat 0.32 kg. </Product 1> and the same fashion style with <Product 2> "},
	{
	"type": "image",
	"image": "CORAL/images/product_2.png",
	},
	{"type": "text", "text": "\n Elegant Pink Flats with Low Heel and Buckle Closure for Stylish Party Wear </Product 2> with a quilted texture and a chain strap."}
	],
	}
	]

	candidate = [
	{
	"role": "user",
	"content": [
	{"type": "text", "text": "Represent the given product: "},
	{
	"type": "image",
	"image": "CORAL/images/product_3.png",
	},
	{"type": "text", "text": "\n MOSSDOOM Elegant Pink PU Leather Handbag with Chain Strap and Large Capacity, Compact Size $18 \times 9.5 \times 15 \mathrm{~cm}$."},
	],
	}
	]

	query_text = processor.apply_chat_template(
	query, tokenize=False, add_generation_prompt=True
	)

	candidate_text = processor.apply_chat_template(
	candidate, tokenize=False, add_generation_prompt=True
	)

	query_image_inputs, query_video_inputs = process_vision_info(query)

	candidate_image_inputs, candidate_video_inputs = process_vision_info(candidate)

	query_inputs = processor(
	text=[query_text],
	images=query_image_inputs,
	videos=query_video_inputs,
	padding=True,
	return_tensors="pt",
	).to("cuda")

	candidate_inputs = processor(
	text=[candidate_text],
	images=candidate_image_inputs,
	videos=candidate_video_inputs,
	padding=True,
	return_tensors="pt",
	).to("cuda")


	# Encode Embeddings
	with torch.inference_mode():
	query_outputs = model(**query_inputs, return_dict=True, output_hidden_states=True)
	query_embedding = query_outputs.hidden_states[-1][:,-1,:]
	query_embedding = torch.nn.functional.normalize(query_embedding, dim=-1)
	print(query_embedding.shape) # torch.Size([1, 2048])

	candidate_outputs = model(**candidate_inputs, return_dict=True, output_hidden_states=True)
	candidate_embedding = candidate_outputs.hidden_states[-1][:,-1,:]
	candidate_embedding = torch.nn.functional.normalize(candidate_embedding, dim=-1)
	print(candidate_embedding.shape) # torch.Size([1, 2048])

	# Compute Similarity
	similarity = torch.matmul(query_embedding, candidate_embedding.T)
	print(similarity) # tensor([[0.6992]], device='cuda:0', dtype=torch.bfloat16)
	```

	## Evaluation

	We provide the experiment results of CORAL on the MERIT dataset.

	<p align="center">
	<img src="https://merit-2025.github.io/static/images/part3/ablation.png" alt="CORAL Structure" style="width: 100%; max-width: 500px;">
	</p>

	<p align="center"><b>Performance of CORAL on MERIT</b></p>

	## Citation
	Chow W, Gao Y, Li L, et al. MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query[J]. arXiv preprint arXiv:2506.03144, 2025.

	BibTeX:
	```bibtex
	@article{chow2025merit,
	title={MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query},
	author={Chow, Wei and Gao, Yuan and Li, Linfeng and Wang, Xian and Xu, Qi and Song, Hang and Kong, Lingdong and Zhou, Ran and Zeng, Yi and Cai, Yidong and others},
	journal={arXiv preprint arXiv:2506.03144},
	year={2025}
	}