File size: 5,198 Bytes
35d6ec5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | ---
license: apache-2.0
tags:
- pytorch
---
<a id="top"></a>
<div align="center">
<h1>π MEET: Redundancy Mitigation: Towards Accurate and Efficient Image-Text Retrieval</h1>
<p>
<b>Kun Wang</b><sup>1</sup>
<b>Yupeng Hu</b><sup>1</sup>
<b>Hao Liu</b><sup>1</sup>
<b>Lirong Jie</b><sup>1</sup>
<b>Liqiang Nie</b><sup>2</sup>
</p>
<p>
<sup>1</sup>School of Software, Shandong University, Jinan, China<br>
<sup>2</sup>School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
</p>
</div>
These are the official implementation, pre-trained model weights, and configuration files for **MEET**, a novel framework explicitly designed to address semantic and relationship redundancy in Image-Text Retrieval (ITR).
π **Paper:** [Accepted by TCSVT 2025](https://ieeexplore.ieee.org/document/11299108)
π **GitHub Repository:** [iLearn-Lab/TCSVT25-MEET](https://github.com/iLearn-Lab/TCSVT25-MEET)
---
## π Model Information
### 1. Model Name
**MEET** (iMage-text retrieval rEdundancy miTigation)
### 2. Task Type & Applicable Tasks
- **Task Type:** Image-Text Retrieval (ITR) / Vision-Language / Multimodal Learning
- **Applicable Tasks:** Accurate and efficient cross-modal retrieval. It specifically addresses redundancy by mitigating semantic redundancy within unimodal representations and relationship redundancy in cross-modal alignments.
### 3. Project Introduction
Existing Image-Text Retrieval methods often suffer from a fundamental yet overlooked challenge: redundancy. **MEET** introduces an iMage-text retrieval rEdundancy miTigation framework to explicitly analyze and address the ITR problem from a redundancy perspective. This approach helps the model effectively produce compact yet highly discriminative representations for accurate and efficient retrieval.
> π‘ **Method Highlight:** MEET mitigates semantic redundancy by repurposing deep hashing and quantization, and progressively refines the cross-modal alignment space by filtering misleading negative samples and adaptively reweighting informative pairs. It supports end-to-end model training, diverse feature encoders, and unified optimization.
### 4. Training Data Source
The model is evaluated using features from Bi-GRU and BERT on standard ITR datasets:
- **MSCOCO** (1K and 5K splits)
- **Flickr30K**
*(Splits produced by HREM)*
---
## π Usage & Basic Inference
### Step 1: Prepare the Environment
Clone the GitHub repository and ensure you have the required dependencies (evaluated on Python >= 3.8 and PyTorch >= 1.7.0):
```bash
git clone https://github.com/iLearn-Lab/TCSVT25-MEET.git
cd MEET
```
pip install torchvision>=0.8.0 transformers>=2.1.1 opencv-python tensorboard
### Step 2: Download Model Weights & Data
1. **Pre-trained Checkpoints:** Download the model checkpoints and place them in your designated `LOGGER_PATH`.
2. **Language Models & Features:**
- Obtain pretrained files for [BERT-base](https://huggingface.co/bert-base-uncased).
- Obtain pretrained VSE model checkpoints (e.g., [ESA](https://github.com/KevinLight831/ESA) as an example).
3. **Datasets:** Structure the MSCOCO and Flickr30K datasets as outlined in the data tree structure (e.g., `coco_precomp`, `f30k_precomp`, `vocab`, `VSE`).
### Step 3: Run Training & Evaluation
**Evaluation (Eval):**
Depending on the text features you are using, open the corresponding script (`at/lib/test.py` for BiGRU, `at_bert/lib/test.py` for BERT) and modify the `RUN_PATH`.
#### For MSCOCO 1K 5-fold splits, first generate the folds:
python scripts/make_coco_1k_folds.py
#### Run testing (ensure MODEL_PATH is set to the correct VSE weights)
PYTHONPATH=. python -m lib.test
**Training from Scratch:**
Make sure to specify the dataset name (`coco_precomp` or `f30k_precomp`) after the `--data_name` flag:
PYTHONPATH=. python hq_train.py --num_epochs 12 --batch_size 128 --workers 8 --H 64 --M 8 --K 8 --data_name <dataset_name>
---
## β οΈ Limitations & Notes
**Disclaimer:** This framework and its pre-trained weights are intended for **academic research purposes only**.
- The model requires access to the original source datasets (MSCOCO, Flickr30K) for full evaluation.
- While designed for redundancy mitigation, the performance may still fluctuate based on extreme domain shifts not covered by the training distribution.
---
## β οΈ Acknowledgements & Contact
- **Acknowledgement:** Thanks to the [HREM](https://github.com/crossmodalgroup/hrem) open-source community for strong baselines and tooling. Thanks to all collaborators and contributors of this project.
- **Contact:** If you have any questions, feel free to contact me at `khylon.kun.wang@gmail.com`.
---
## πβοΈ Citation
If you find our work or this repository useful in your research, please consider citing our paper:
@article{wang2025redundancy,
title={Redundancy Mitigation: Towards Accurate and Efficient Image-Text Retrieval},
author={Wang, Kun and Hu, Yupeng and Liu, Hao and Jie, Lirong and Nie, Liqiang},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
year={2025},
publisher={IEEE}
} |