CORAL

File size: 6,021 Bytes

---
license: apache-2.0
datasets:
- WeiChow/merit
language:
- en
- id
- ms
- th
- vn
base_model:
- Qwen/Qwen2.5-3B-Instruct
---

<h1 align="center" style="line-height: 50px;">
  MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query
</h1>

<div align="center">

[![arXiv](https://img.shields.io/badge/arXiv-2506.03144-b31b1b.svg)](https://arxiv.org/abs/2506.03144)
[![Dataset](https://img.shields.io/badge/🤗%20Huggingface-Dataset-yellow)](https://huggingface.co/datasets/WeiChow/merit)
[![Checkpoint](https://img.shields.io/badge/🤗%20Huggingface-CKPT-blue)](https://huggingface.co/Bia/CORAL)
[![GitHub](https://img.shields.io/badge/GitHub-Repo-181717?logo=github)](https://github.com/weichow23/merit)
[![Page](https://img.shields.io/badge/Home-Page-b3.svg)](https://merit-2025.github.io/)

</div>


## Model Details
We introduce CORAL, a multi-modal embedding model built upon Qwen2.5-3B-Instruct. CORAL enables interleaved multi-condition semantic retrieval queries, and was trained using MERIT, a novel dataset proposed in our paper, [MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query](http://arxiv.org/abs/2506.03144).

CORAL is short for Contrastive Reconstruction for Multimodal Retrieval. The loss function of CORAL consists of three components: Contrastive Learning Loss, Vision Reconstruction Loss, and Masked Language Modeling Loss. During training, we reconstruct both the query and its corresponding positive sample. 

<p align="center">
  <img src="https://merit-2025.github.io/static/images/part3/method.png" alt="CORAL Overview" style="width: 100%; max-width: 600px;">
</p>

<p align="center"><b>Overview for CORAL</b></p>

<p align="center">
  <img src="images/example.jpg" alt="Example" style="width: 100%; max-width: 600px;">
</p>

<p align="center"><b>Example Query and Ground Truth</b></p>

## Usage

**Transformers**

We provide the checkpoint of CORAL on Huggingface. You can load the model using the following code:

```python
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info


## Initialize Model and Processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Bia/CORAL", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("Bia/CORAL")

## Prepare Inputs
query = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Find a product of backpack that have the same brand with <Product 1> \n "},
            {
                "type": "image",
                "image": "CORAL/images/product_1.png",
            },
            {"type": "text", "text": "\n Ransel MOSSDOOM Polyester dengan Ruang Komputer dan Penyimpanan Besar, Ukuran $30 \times 12 \times 38$ cm , Berat 0.32 kg. </Product 1> and the same fashion style with <Product 2> "},
            {
                "type": "image",
                "image": "CORAL/images/product_2.png",
            },
            {"type": "text", "text": "\n Elegant Pink Flats with Low Heel and Buckle Closure for Stylish Party Wear </Product 2> with a quilted texture and a chain strap."}
            ],
    }
]

candidate = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Represent the given product: "},
            {
                "type": "image",
                "image": "CORAL/images/product_3.png",
            },
            {"type": "text", "text": "\n MOSSDOOM Elegant Pink PU Leather Handbag with Chain Strap and Large Capacity, Compact Size $18 \times 9.5 \times 15 \mathrm{~cm}$."},
        ],
    }
]

query_text = processor.apply_chat_template(
    query, tokenize=False, add_generation_prompt=True
)

candidate_text = processor.apply_chat_template(
    candidate, tokenize=False, add_generation_prompt=True
)

query_image_inputs, query_video_inputs = process_vision_info(query)

candidate_image_inputs, candidate_video_inputs = process_vision_info(candidate)

query_inputs = processor(
    text=[query_text],
    images=query_image_inputs,
    videos=query_video_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

candidate_inputs = processor(
    text=[candidate_text],
    images=candidate_image_inputs,
    videos=candidate_video_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")


# Encode Embeddings
with torch.inference_mode():
    query_outputs = model(**query_inputs, return_dict=True, output_hidden_states=True)
    query_embedding = query_outputs.hidden_states[-1][:,-1,:]
    query_embedding = torch.nn.functional.normalize(query_embedding, dim=-1)
    print(query_embedding.shape)  # torch.Size([1, 2048])

    candidate_outputs = model(**candidate_inputs, return_dict=True, output_hidden_states=True)
    candidate_embedding = candidate_outputs.hidden_states[-1][:,-1,:]
    candidate_embedding = torch.nn.functional.normalize(candidate_embedding, dim=-1)
    print(candidate_embedding.shape)  # torch.Size([1, 2048])

# Compute Similarity
similarity = torch.matmul(query_embedding, candidate_embedding.T)
print(similarity)  # tensor([[0.6992]], device='cuda:0', dtype=torch.bfloat16)
```

## Evaluation

We provide the experiment results of CORAL on the MERIT dataset. 

<p align="center">
  <img src="https://merit-2025.github.io/static/images/part3/ablation.png" alt="CORAL Structure" style="width: 100%; max-width: 500px;">
</p>

<p align="center"><b>Performance of CORAL on MERIT</b></p>

## Citation 
Chow W, Gao Y, Li L, et al. MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query[J]. arXiv preprint arXiv:2506.03144, 2025.

**BibTeX:**
```bibtex
@article{chow2025merit,
  title={MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query},
  author={Chow, Wei and Gao, Yuan and Li, Linfeng and Wang, Xian and Xu, Qi and Song, Hang and Kong, Lingdong and Zhou, Ran and Zeng, Yi and Cai, Yidong and others},
  journal={arXiv preprint arXiv:2506.03144},
  year={2025}
}