Safetensors
qwen2_5_vl
File size: 6,021 Bytes
e8f957e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e064cb5
e8f957e
 
 
 
e064cb5
 
 
 
 
e8f957e
 
 
 
 
 
 
 
e064cb5
e8f957e
 
 
 
 
 
 
 
e064cb5
e8f957e
 
 
 
 
 
 
 
 
 
e064cb5
e8f957e
 
 
 
e064cb5
e8f957e
 
 
 
 
 
 
 
 
 
 
 
 
e064cb5
e8f957e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e064cb5
 
 
 
 
e8f957e
e064cb5
 
 
 
e8f957e
 
 
e064cb5
e8f957e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
license: apache-2.0
datasets:
- WeiChow/merit
language:
- en
- id
- ms
- th
- vn
base_model:
- Qwen/Qwen2.5-3B-Instruct
---

<h1 align="center" style="line-height: 50px;">
  MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query
</h1>

<div align="center">

[![arXiv](https://img.shields.io/badge/arXiv-2506.03144-b31b1b.svg)](https://arxiv.org/abs/2506.03144)
[![Dataset](https://img.shields.io/badge/🤗%20Huggingface-Dataset-yellow)](https://huggingface.co/datasets/WeiChow/merit)
[![Checkpoint](https://img.shields.io/badge/🤗%20Huggingface-CKPT-blue)](https://huggingface.co/Bia/CORAL)
[![GitHub](https://img.shields.io/badge/GitHub-Repo-181717?logo=github)](https://github.com/weichow23/merit)
[![Page](https://img.shields.io/badge/Home-Page-b3.svg)](https://merit-2025.github.io/)

</div>


## Model Details
We introduce CORAL, a multi-modal embedding model built upon Qwen2.5-3B-Instruct. CORAL enables interleaved multi-condition semantic retrieval queries, and was trained using MERIT, a novel dataset proposed in our paper, [MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query](http://arxiv.org/abs/2506.03144).

CORAL is short for Contrastive Reconstruction for Multimodal Retrieval. The loss function of CORAL consists of three components: Contrastive Learning Loss, Vision Reconstruction Loss, and Masked Language Modeling Loss. During training, we reconstruct both the query and its corresponding positive sample. 

<p align="center">
  <img src="https://merit-2025.github.io/static/images/part3/method.png" alt="CORAL Overview" style="width: 100%; max-width: 600px;">
</p>

<p align="center"><b>Overview for CORAL</b></p>

<p align="center">
  <img src="images/example.jpg" alt="Example" style="width: 100%; max-width: 600px;">
</p>

<p align="center"><b>Example Query and Ground Truth</b></p>

## Usage

**Transformers**

We provide the checkpoint of CORAL on Huggingface. You can load the model using the following code:

```python
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info


## Initialize Model and Processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Bia/CORAL", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("Bia/CORAL")

## Prepare Inputs
query = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Find a product of backpack that have the same brand with <Product 1> \n "},
            {
                "type": "image",
                "image": "CORAL/images/product_1.png",
            },
            {"type": "text", "text": "\n Ransel MOSSDOOM Polyester dengan Ruang Komputer dan Penyimpanan Besar, Ukuran $30 \times 12 \times 38$ cm , Berat 0.32 kg. </Product 1> and the same fashion style with <Product 2> "},
            {
                "type": "image",
                "image": "CORAL/images/product_2.png",
            },
            {"type": "text", "text": "\n Elegant Pink Flats with Low Heel and Buckle Closure for Stylish Party Wear </Product 2> with a quilted texture and a chain strap."}
            ],
    }
]

candidate = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Represent the given product: "},
            {
                "type": "image",
                "image": "CORAL/images/product_3.png",
            },
            {"type": "text", "text": "\n MOSSDOOM Elegant Pink PU Leather Handbag with Chain Strap and Large Capacity, Compact Size $18 \times 9.5 \times 15 \mathrm{~cm}$."},
        ],
    }
]

query_text = processor.apply_chat_template(
    query, tokenize=False, add_generation_prompt=True
)

candidate_text = processor.apply_chat_template(
    candidate, tokenize=False, add_generation_prompt=True
)

query_image_inputs, query_video_inputs = process_vision_info(query)

candidate_image_inputs, candidate_video_inputs = process_vision_info(candidate)

query_inputs = processor(
    text=[query_text],
    images=query_image_inputs,
    videos=query_video_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

candidate_inputs = processor(
    text=[candidate_text],
    images=candidate_image_inputs,
    videos=candidate_video_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")


# Encode Embeddings
with torch.inference_mode():
    query_outputs = model(**query_inputs, return_dict=True, output_hidden_states=True)
    query_embedding = query_outputs.hidden_states[-1][:,-1,:]
    query_embedding = torch.nn.functional.normalize(query_embedding, dim=-1)
    print(query_embedding.shape)  # torch.Size([1, 2048])

    candidate_outputs = model(**candidate_inputs, return_dict=True, output_hidden_states=True)
    candidate_embedding = candidate_outputs.hidden_states[-1][:,-1,:]
    candidate_embedding = torch.nn.functional.normalize(candidate_embedding, dim=-1)
    print(candidate_embedding.shape)  # torch.Size([1, 2048])

# Compute Similarity
similarity = torch.matmul(query_embedding, candidate_embedding.T)
print(similarity)  # tensor([[0.6992]], device='cuda:0', dtype=torch.bfloat16)
```

## Evaluation

We provide the experiment results of CORAL on the MERIT dataset. 

<p align="center">
  <img src="https://merit-2025.github.io/static/images/part3/ablation.png" alt="CORAL Structure" style="width: 100%; max-width: 500px;">
</p>

<p align="center"><b>Performance of CORAL on MERIT</b></p>

## Citation 
Chow W, Gao Y, Li L, et al. MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query[J]. arXiv preprint arXiv:2506.03144, 2025.

**BibTeX:**
```bibtex
@article{chow2025merit,
  title={MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query},
  author={Chow, Wei and Gao, Yuan and Li, Linfeng and Wang, Xian and Xu, Qi and Song, Hang and Kong, Lingdong and Zhou, Ran and Zeng, Yi and Cai, Yidong and others},
  journal={arXiv preprint arXiv:2506.03144},
  year={2025}
}