File size: 7,458 Bytes
fdd7432 f56cd6f fdd7432 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | ---
license: apache-2.0
language:
- en
- zh
---
# ReMatch: Boosting Representation through Matching for Multimodal Retrieval
<p>
<a href="https://arxiv.org/abs/2511.19278"><img src="https://img.shields.io/badge/arXiv-2511.19278-b31b1b.svg" alt="arXiv"></a>
<a href="https://github.com/FireRedTeam/ReMatch"><img src="https://img.shields.io/badge/Code-ReMatch-green.svg" alt="Code"></a>
<a href="https://huggingface.co/FireRedTeam/ReMatch-3B"><img src="https://img.shields.io/badge/Model-ReMatch--3B-yellow.svg" alt="Model"></a>
</p>
This repository contains the official implementation of **ReMatch**, accepted to **CVPR 2026**.
ReMatch turns a multimodal large language model into a stronger multimodal retriever by adding a chat-style generative matching objective during training. The same MLLM learns to judge query-document relevance from both raw multimodal inputs and projected embeddings, complementing standard contrastive learning with instance-wise supervision on hard negatives. ReMatch also augments each input with multiple learnable representation tokens and fuses them into an efficient single-vector embedding for retrieval.
## π₯ Authors
[Qianying Liu](https://scholar.google.com/citations?hl=zh-TW&user=QnMV-uYAAAAJ&view_op=list_works&sortby=pubdate)\*, Xiao Liang\*, Zhiqiang Zhang#, Yibo Chen, Xu Tang, Zhongfei Qing, Fengfan Zhou, Yao Hu, Paul Henderson
University of Glasgow, Xiaohongshu Inc., Huazhong University of Science and Technology
\* Equal contribution. # Project leader.
## π Method

ReMatch is built around two core ideas:
- **Query-Document Matching**: an additional autoregressive matching stage that predicts relevance from the query, document, and their projected embeddings.
- **Learnable Multi-Token Embeddings**: multiple learnable tokens capture fine-grained contextual signals; an orthogonality regularizer encourages complementary representations, and the fused output remains a standard dense embedding.
## π₯ News
- **2026-05**: ReMatch code, the **ReMatch-3B** checkpoint, and evaluation scripts are released.
- **2026-02**: ReMatch is accepted to **CVPR 2026**.
- **2025-11**: The ReMatch technical report is available on arXiv.
## π οΈ Installation
```bash
conda create -n rematch python=3.10 -y
conda activate rematch
pip install -r requirements.txt
```
`flash-attn` can be sensitive to CUDA, PyTorch, and compiler versions. If installation fails, install the wheel matching your environment from the official FlashAttention release instructions, then rerun the remaining dependencies.
## π€ Checkpoints
We release **ReMatch-3B**, a Qwen2.5-VL-3B based checkpoint trained with the ReMatch recipe:
- [FireRedTeam/ReMatch-3B](https://huggingface.co/FireRedTeam/ReMatch-3B)
For local checkpoints, pass the base model through `--model_name` and the adapter/full checkpoint through `--checkpoint_path` when evaluating.
## π Training
The public ReMatch-3B training entry point is:
```bash
bash experiments/public/rematch/train-rematch-itm.sh
```
Before training, download the mmE5 hard-negative MMEB training data from Hugging Face:
- [intfloat/mmE5-MMEB-hardneg](https://huggingface.co/datasets/intfloat/mmE5-MMEB-hardneg)
In addition to mmE5, please follow the original [VLM2Vec](https://github.com/TIGER-AI-Lab/VLM2Vec) data preparation instructions to download the corresponding MMEB training and evaluation data used by the public configs in this repository.
Then edit [experiments/public/rematch/train_image_mme5_hardneg.yaml](experiments/public/rematch/train_image_mme5_hardneg.yaml) and replace every `DATASET_BASE_PATH` with the directory that contains your `mmE5/` folder. The expected layout is:
```text
DATASET_BASE_PATH/
βββ mmE5/
βββ mmE5-MMEB-hardneg/
```
The default script trains a Qwen2.5-VL-3B based ReMatch model with LoRA, 16 learnable query tokens, residual average fusion, orthogonal regularization, and the matching objective enabled. You can override common paths without editing the script:
```bash
EXP_DIR=/path/to/output \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
bash experiments/public/rematch/train-rematch-itm.sh
```
## π Evaluation
Evaluation configs are under [experiments/public/eval](experiments/public/eval):
- [image.yaml](experiments/public/eval/image.yaml)
Please prepare the MMEB evaluation data following the original [VLM2Vec](https://github.com/TIGER-AI-Lab/VLM2Vec) instructions, then set `DATA_BASEDIR` to the directory containing the downloaded evaluation files.
> **Note:** Evaluation scores may vary slightly across environments, as different PyTorch, CUDA, and `flash-attn` versions can introduce small numerical differences.
For checkpoints produced by this repository, we recommend using `eval_all.py`. It reads the experiment name and automatically matches the evaluation configuration used by ReMatch, including backbone type, target-side instruction prefix, chat template, learnable query tokens, and residual embedding fusion. For example, an experiment name containing `Qwen2.5vl`, `TgtInstruction`, `Queries16`, `ResidualAvg`, and `ChatTemplate` will be evaluated with the corresponding `qwen2_5_vl`, target instruction, 16 learnable tokens, average residual fusion, and chat-template settings.
Evaluate one experiment checkpoint:
```bash
DATA_BASEDIR=/path/to/vlm2vec_eval \
MODEL_BASEDIR=/path/to/training/outputs \
OUTPUT_BASEDIR=/path/to/eval/outputs \
MODALITIES="image" \
python eval_all.py \
--model_name Rematch_Qwen2.5vl_3B.image.autoresize.lora32.loraAlpha64.BS1024.IB64.GCq32p32NormTemp002.lr1e4.step3kwarm100.lrCosine.TgtInstruction.mmE5H1.Queries16.ResidualAvg.OrthTriu0.2.ChatTemplate.ITM.V1.Ratio0.1 \
--checkpoint_name checkpoint-2200
```
If no arguments are provided, `eval_all.py` scans `outputs/<model_name>/<checkpoint_name>/`, evaluates every checkpoint directory, and writes summaries to:
```text
outputs/evals/<model_name>/<checkpoint_name>/final_results.json
```
For the released **ReMatch-3B** checkpoint, use `eval.py` directly and pass the matching ReMatch configuration explicitly:
```bash
torchrun --nproc_per_node=8 --master_port=2277 eval.py \
--lora True \
--pooling eos \
--normalize true \
--tgt_prefix_instruction True \
--learnable_queries True \
--residual_embedding True \
--residual_embedding_method avg \
--enable_chat_template True \
--num_queries 16 \
--per_device_eval_batch_size 16 \
--model_backbone qwen2_5_vl \
--model_name ReMatch-3B-PATH \
--checkpoint_path ReMatch-3B-PATH \
--dataset_config experiments/public/eval/image.yaml \
--encode_output_path outputs/evals/ReMatch-3B/image \
--data_basedir /path/to/MMEB
```
## π Acknowledgements
This codebase is built on top of [VLM2Vec](https://github.com/TIGER-AI-Lab/VLM2Vec). We sincerely thank the VLM2Vec authors for releasing their training and evaluation infrastructure for massive multimodal embedding tasks.
We also thank the authors of Qwen2.5-VL, MMEB, and mmE5 for their open models, benchmarks, and data resources.
## π Citation
```bibtex
@article{liu2025rematch,
title={ReMatch: Boosting Representation through Matching for Multimodal Retrieval},
author={Liu, Qianying and Liang, Xiao and Zhang, Zhiqiang and Chen, Yibo and Tang, Xu and Qing, Zhongfei and Zhou, Fengfan and Hu, Yao and Henderson, Paul},
journal={arXiv preprint arXiv:2511.19278},
year={2025}
}
``` |