File size: 6,021 Bytes
e8f957e e064cb5 e8f957e e064cb5 e8f957e e064cb5 e8f957e e064cb5 e8f957e e064cb5 e8f957e e064cb5 e8f957e e064cb5 e8f957e e064cb5 e8f957e e064cb5 e8f957e e064cb5 e8f957e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
---
license: apache-2.0
datasets:
- WeiChow/merit
language:
- en
- id
- ms
- th
- vn
base_model:
- Qwen/Qwen2.5-3B-Instruct
---
<h1 align="center" style="line-height: 50px;">
MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query
</h1>
<div align="center">
[](https://arxiv.org/abs/2506.03144)
[](https://huggingface.co/datasets/WeiChow/merit)
[](https://huggingface.co/Bia/CORAL)
[](https://github.com/weichow23/merit)
[](https://merit-2025.github.io/)
</div>
## Model Details
We introduce CORAL, a multi-modal embedding model built upon Qwen2.5-3B-Instruct. CORAL enables interleaved multi-condition semantic retrieval queries, and was trained using MERIT, a novel dataset proposed in our paper, [MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query](http://arxiv.org/abs/2506.03144).
CORAL is short for Contrastive Reconstruction for Multimodal Retrieval. The loss function of CORAL consists of three components: Contrastive Learning Loss, Vision Reconstruction Loss, and Masked Language Modeling Loss. During training, we reconstruct both the query and its corresponding positive sample.
<p align="center">
<img src="https://merit-2025.github.io/static/images/part3/method.png" alt="CORAL Overview" style="width: 100%; max-width: 600px;">
</p>
<p align="center"><b>Overview for CORAL</b></p>
<p align="center">
<img src="images/example.jpg" alt="Example" style="width: 100%; max-width: 600px;">
</p>
<p align="center"><b>Example Query and Ground Truth</b></p>
## Usage
**Transformers**
We provide the checkpoint of CORAL on Huggingface. You can load the model using the following code:
```python
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
## Initialize Model and Processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Bia/CORAL", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Bia/CORAL")
## Prepare Inputs
query = [
{
"role": "user",
"content": [
{"type": "text", "text": "Find a product of backpack that have the same brand with <Product 1> \n "},
{
"type": "image",
"image": "CORAL/images/product_1.png",
},
{"type": "text", "text": "\n Ransel MOSSDOOM Polyester dengan Ruang Komputer dan Penyimpanan Besar, Ukuran $30 \times 12 \times 38$ cm , Berat 0.32 kg. </Product 1> and the same fashion style with <Product 2> "},
{
"type": "image",
"image": "CORAL/images/product_2.png",
},
{"type": "text", "text": "\n Elegant Pink Flats with Low Heel and Buckle Closure for Stylish Party Wear </Product 2> with a quilted texture and a chain strap."}
],
}
]
candidate = [
{
"role": "user",
"content": [
{"type": "text", "text": "Represent the given product: "},
{
"type": "image",
"image": "CORAL/images/product_3.png",
},
{"type": "text", "text": "\n MOSSDOOM Elegant Pink PU Leather Handbag with Chain Strap and Large Capacity, Compact Size $18 \times 9.5 \times 15 \mathrm{~cm}$."},
],
}
]
query_text = processor.apply_chat_template(
query, tokenize=False, add_generation_prompt=True
)
candidate_text = processor.apply_chat_template(
candidate, tokenize=False, add_generation_prompt=True
)
query_image_inputs, query_video_inputs = process_vision_info(query)
candidate_image_inputs, candidate_video_inputs = process_vision_info(candidate)
query_inputs = processor(
text=[query_text],
images=query_image_inputs,
videos=query_video_inputs,
padding=True,
return_tensors="pt",
).to("cuda")
candidate_inputs = processor(
text=[candidate_text],
images=candidate_image_inputs,
videos=candidate_video_inputs,
padding=True,
return_tensors="pt",
).to("cuda")
# Encode Embeddings
with torch.inference_mode():
query_outputs = model(**query_inputs, return_dict=True, output_hidden_states=True)
query_embedding = query_outputs.hidden_states[-1][:,-1,:]
query_embedding = torch.nn.functional.normalize(query_embedding, dim=-1)
print(query_embedding.shape) # torch.Size([1, 2048])
candidate_outputs = model(**candidate_inputs, return_dict=True, output_hidden_states=True)
candidate_embedding = candidate_outputs.hidden_states[-1][:,-1,:]
candidate_embedding = torch.nn.functional.normalize(candidate_embedding, dim=-1)
print(candidate_embedding.shape) # torch.Size([1, 2048])
# Compute Similarity
similarity = torch.matmul(query_embedding, candidate_embedding.T)
print(similarity) # tensor([[0.6992]], device='cuda:0', dtype=torch.bfloat16)
```
## Evaluation
We provide the experiment results of CORAL on the MERIT dataset.
<p align="center">
<img src="https://merit-2025.github.io/static/images/part3/ablation.png" alt="CORAL Structure" style="width: 100%; max-width: 500px;">
</p>
<p align="center"><b>Performance of CORAL on MERIT</b></p>
## Citation
Chow W, Gao Y, Li L, et al. MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query[J]. arXiv preprint arXiv:2506.03144, 2025.
**BibTeX:**
```bibtex
@article{chow2025merit,
title={MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query},
author={Chow, Wei and Gao, Yuan and Li, Linfeng and Wang, Xian and Xu, Qi and Song, Hang and Kong, Lingdong and Zhou, Ran and Zeng, Yi and Cai, Yidong and others},
journal={arXiv preprint arXiv:2506.03144},
year={2025}
} |