Safetensors
English
Chinese
qwen2_5_vl
File size: 7,458 Bytes
fdd7432
 
 
 
 
 
f56cd6f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fdd7432
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
license: apache-2.0
language:
- en
- zh
---
# ReMatch: Boosting Representation through Matching for Multimodal Retrieval

<p>
  <a href="https://arxiv.org/abs/2511.19278"><img src="https://img.shields.io/badge/arXiv-2511.19278-b31b1b.svg" alt="arXiv"></a>
  <a href="https://github.com/FireRedTeam/ReMatch"><img src="https://img.shields.io/badge/Code-ReMatch-green.svg" alt="Code"></a>
  <a href="https://huggingface.co/FireRedTeam/ReMatch-3B"><img src="https://img.shields.io/badge/Model-ReMatch--3B-yellow.svg" alt="Model"></a>
</p>

This repository contains the official implementation of **ReMatch**, accepted to **CVPR 2026**.

ReMatch turns a multimodal large language model into a stronger multimodal retriever by adding a chat-style generative matching objective during training. The same MLLM learns to judge query-document relevance from both raw multimodal inputs and projected embeddings, complementing standard contrastive learning with instance-wise supervision on hard negatives. ReMatch also augments each input with multiple learnable representation tokens and fuses them into an efficient single-vector embedding for retrieval.

## πŸ‘₯ Authors

[Qianying Liu](https://scholar.google.com/citations?hl=zh-TW&user=QnMV-uYAAAAJ&view_op=list_works&sortby=pubdate)\*, Xiao Liang\*, Zhiqiang Zhang#, Yibo Chen, Xu Tang, Zhongfei Qing, Fengfan Zhou, Yao Hu, Paul Henderson

University of Glasgow, Xiaohongshu Inc., Huazhong University of Science and Technology

\* Equal contribution. # Project leader.

## πŸ” Method

![ReMatch framework](assets/rematch_framework.png)

ReMatch is built around two core ideas:

- **Query-Document Matching**: an additional autoregressive matching stage that predicts relevance from the query, document, and their projected embeddings.
- **Learnable Multi-Token Embeddings**: multiple learnable tokens capture fine-grained contextual signals; an orthogonality regularizer encourages complementary representations, and the fused output remains a standard dense embedding.

## πŸ”₯ News

- **2026-05**: ReMatch code, the **ReMatch-3B** checkpoint, and evaluation scripts are released.
- **2026-02**: ReMatch is accepted to **CVPR 2026**.
- **2025-11**: The ReMatch technical report is available on arXiv.

## πŸ› οΈ Installation

```bash
conda create -n rematch python=3.10 -y
conda activate rematch
pip install -r requirements.txt
```

`flash-attn` can be sensitive to CUDA, PyTorch, and compiler versions. If installation fails, install the wheel matching your environment from the official FlashAttention release instructions, then rerun the remaining dependencies.

## πŸ€— Checkpoints

We release **ReMatch-3B**, a Qwen2.5-VL-3B based checkpoint trained with the ReMatch recipe:

- [FireRedTeam/ReMatch-3B](https://huggingface.co/FireRedTeam/ReMatch-3B)

For local checkpoints, pass the base model through `--model_name` and the adapter/full checkpoint through `--checkpoint_path` when evaluating.

## πŸš€ Training

The public ReMatch-3B training entry point is:

```bash
bash experiments/public/rematch/train-rematch-itm.sh
```

Before training, download the mmE5 hard-negative MMEB training data from Hugging Face:

- [intfloat/mmE5-MMEB-hardneg](https://huggingface.co/datasets/intfloat/mmE5-MMEB-hardneg)

In addition to mmE5, please follow the original [VLM2Vec](https://github.com/TIGER-AI-Lab/VLM2Vec) data preparation instructions to download the corresponding MMEB training and evaluation data used by the public configs in this repository.

Then edit [experiments/public/rematch/train_image_mme5_hardneg.yaml](experiments/public/rematch/train_image_mme5_hardneg.yaml) and replace every `DATASET_BASE_PATH` with the directory that contains your `mmE5/` folder. The expected layout is:

```text
DATASET_BASE_PATH/
└── mmE5/
    └── mmE5-MMEB-hardneg/
```

The default script trains a Qwen2.5-VL-3B based ReMatch model with LoRA, 16 learnable query tokens, residual average fusion, orthogonal regularization, and the matching objective enabled. You can override common paths without editing the script:

```bash
EXP_DIR=/path/to/output \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
bash experiments/public/rematch/train-rematch-itm.sh
```

## πŸ“Š Evaluation

Evaluation configs are under [experiments/public/eval](experiments/public/eval):

- [image.yaml](experiments/public/eval/image.yaml)

Please prepare the MMEB evaluation data following the original [VLM2Vec](https://github.com/TIGER-AI-Lab/VLM2Vec) instructions, then set `DATA_BASEDIR` to the directory containing the downloaded evaluation files.

> **Note:** Evaluation scores may vary slightly across environments, as different PyTorch, CUDA, and `flash-attn` versions can introduce small numerical differences.

For checkpoints produced by this repository, we recommend using `eval_all.py`. It reads the experiment name and automatically matches the evaluation configuration used by ReMatch, including backbone type, target-side instruction prefix, chat template, learnable query tokens, and residual embedding fusion. For example, an experiment name containing `Qwen2.5vl`, `TgtInstruction`, `Queries16`, `ResidualAvg`, and `ChatTemplate` will be evaluated with the corresponding `qwen2_5_vl`, target instruction, 16 learnable tokens, average residual fusion, and chat-template settings.

Evaluate one experiment checkpoint:

```bash
DATA_BASEDIR=/path/to/vlm2vec_eval \
MODEL_BASEDIR=/path/to/training/outputs \
OUTPUT_BASEDIR=/path/to/eval/outputs \
MODALITIES="image" \
python eval_all.py \
  --model_name Rematch_Qwen2.5vl_3B.image.autoresize.lora32.loraAlpha64.BS1024.IB64.GCq32p32NormTemp002.lr1e4.step3kwarm100.lrCosine.TgtInstruction.mmE5H1.Queries16.ResidualAvg.OrthTriu0.2.ChatTemplate.ITM.V1.Ratio0.1 \
  --checkpoint_name checkpoint-2200
```

If no arguments are provided, `eval_all.py` scans `outputs/<model_name>/<checkpoint_name>/`, evaluates every checkpoint directory, and writes summaries to:

```text
outputs/evals/<model_name>/<checkpoint_name>/final_results.json
```

For the released **ReMatch-3B** checkpoint, use `eval.py` directly and pass the matching ReMatch configuration explicitly:

```bash
torchrun --nproc_per_node=8 --master_port=2277 eval.py \
  --lora True \
  --pooling eos \
  --normalize true \
  --tgt_prefix_instruction True \
  --learnable_queries True \
  --residual_embedding True \
  --residual_embedding_method avg \
  --enable_chat_template True \
  --num_queries 16 \
  --per_device_eval_batch_size 16 \
  --model_backbone qwen2_5_vl \
  --model_name ReMatch-3B-PATH \
  --checkpoint_path ReMatch-3B-PATH \
  --dataset_config experiments/public/eval/image.yaml \
  --encode_output_path outputs/evals/ReMatch-3B/image \
  --data_basedir /path/to/MMEB
```

## πŸ™ Acknowledgements

This codebase is built on top of [VLM2Vec](https://github.com/TIGER-AI-Lab/VLM2Vec). We sincerely thank the VLM2Vec authors for releasing their training and evaluation infrastructure for massive multimodal embedding tasks.

We also thank the authors of Qwen2.5-VL, MMEB, and mmE5 for their open models, benchmarks, and data resources.

## πŸ“š Citation

```bibtex
@article{liu2025rematch,
  title={ReMatch: Boosting Representation through Matching for Multimodal Retrieval},
  author={Liu, Qianying and Liang, Xiao and Zhang, Zhiqiang and Chen, Yibo and Tang, Xu and Qing, Zhongfei and Zhou, Fengfan and Hu, Yao and Henderson, Paul},
  journal={arXiv preprint arXiv:2511.19278},
  year={2025}
}
```