| --- |
| license: apache-2.0 |
| tags: |
| - pytorch |
| --- |
| |
| <a id="top"></a> |
| <div align="center"> |
| <h1>π MEET: Redundancy Mitigation: Towards Accurate and Efficient Image-Text Retrieval</h1> |
|
|
| <p> |
| <b>Kun Wang</b><sup>1</sup> |
| <b>Yupeng Hu</b><sup>1</sup> |
| <b>Hao Liu</b><sup>1</sup> |
| <b>Lirong Jie</b><sup>1</sup> |
| <b>Liqiang Nie</b><sup>2</sup> |
| </p> |
| |
| <p> |
| <sup>1</sup>School of Software, Shandong University, Jinan, China<br> |
| <sup>2</sup>School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China |
| </p> |
| </div> |
| |
| These are the official implementation, pre-trained model weights, and configuration files for **MEET**, a novel framework explicitly designed to address semantic and relationship redundancy in Image-Text Retrieval (ITR). |
|
|
| π **Paper:** [Accepted by TCSVT 2025](https://ieeexplore.ieee.org/document/11299108) |
| π **GitHub Repository:** [iLearn-Lab/TCSVT25-MEET](https://github.com/iLearn-Lab/TCSVT25-MEET) |
|
|
| --- |
|
|
| ## π Model Information |
|
|
| ### 1. Model Name |
| **MEET** (iMage-text retrieval rEdundancy miTigation) |
|
|
| ### 2. Task Type & Applicable Tasks |
| - **Task Type:** Image-Text Retrieval (ITR) / Vision-Language / Multimodal Learning |
| - **Applicable Tasks:** Accurate and efficient cross-modal retrieval. It specifically addresses redundancy by mitigating semantic redundancy within unimodal representations and relationship redundancy in cross-modal alignments. |
|
|
| ### 3. Project Introduction |
| Existing Image-Text Retrieval methods often suffer from a fundamental yet overlooked challenge: redundancy. **MEET** introduces an iMage-text retrieval rEdundancy miTigation framework to explicitly analyze and address the ITR problem from a redundancy perspective. This approach helps the model effectively produce compact yet highly discriminative representations for accurate and efficient retrieval. |
|
|
| > π‘ **Method Highlight:** MEET mitigates semantic redundancy by repurposing deep hashing and quantization, and progressively refines the cross-modal alignment space by filtering misleading negative samples and adaptively reweighting informative pairs. It supports end-to-end model training, diverse feature encoders, and unified optimization. |
|
|
| ### 4. Training Data Source |
| The model is evaluated using features from Bi-GRU and BERT on standard ITR datasets: |
| - **MSCOCO** (1K and 5K splits) |
| - **Flickr30K** |
| *(Splits produced by HREM)* |
|
|
| --- |
|
|
| ## π Usage & Basic Inference |
|
|
| ### Step 1: Prepare the Environment |
| Clone the GitHub repository and ensure you have the required dependencies (evaluated on Python >= 3.8 and PyTorch >= 1.7.0): |
| ```bash |
| git clone https://github.com/iLearn-Lab/TCSVT25-MEET.git |
| cd MEET |
| ``` |
|
|
| pip install torchvision>=0.8.0 transformers>=2.1.1 opencv-python tensorboard |
|
|
|
|
| ### Step 2: Download Model Weights & Data |
| 1. **Pre-trained Checkpoints:** Download the model checkpoints and place them in your designated `LOGGER_PATH`. |
| 2. **Language Models & Features:** |
| - Obtain pretrained files for [BERT-base](https://huggingface.co/bert-base-uncased). |
| - Obtain pretrained VSE model checkpoints (e.g., [ESA](https://github.com/KevinLight831/ESA) as an example). |
| 3. **Datasets:** Structure the MSCOCO and Flickr30K datasets as outlined in the data tree structure (e.g., `coco_precomp`, `f30k_precomp`, `vocab`, `VSE`). |
|
|
| ### Step 3: Run Training & Evaluation |
|
|
| **Evaluation (Eval):** |
| Depending on the text features you are using, open the corresponding script (`at/lib/test.py` for BiGRU, `at_bert/lib/test.py` for BERT) and modify the `RUN_PATH`. |
|
|
| #### For MSCOCO 1K 5-fold splits, first generate the folds: |
| python scripts/make_coco_1k_folds.py |
| |
| #### Run testing (ensure MODEL_PATH is set to the correct VSE weights) |
| PYTHONPATH=. python -m lib.test |
|
|
|
|
| **Training from Scratch:** |
| Make sure to specify the dataset name (`coco_precomp` or `f30k_precomp`) after the `--data_name` flag: |
|
|
| PYTHONPATH=. python hq_train.py --num_epochs 12 --batch_size 128 --workers 8 --H 64 --M 8 --K 8 --data_name <dataset_name> |
|
|
|
|
| --- |
|
|
| ## β οΈ Limitations & Notes |
|
|
| **Disclaimer:** This framework and its pre-trained weights are intended for **academic research purposes only**. |
| - The model requires access to the original source datasets (MSCOCO, Flickr30K) for full evaluation. |
| - While designed for redundancy mitigation, the performance may still fluctuate based on extreme domain shifts not covered by the training distribution. |
|
|
|
|
|
|
| --- |
|
|
| ## β οΈ Acknowledgements & Contact |
|
|
| - **Acknowledgement:** Thanks to the [HREM](https://github.com/crossmodalgroup/hrem) open-source community for strong baselines and tooling. Thanks to all collaborators and contributors of this project. |
| - **Contact:** If you have any questions, feel free to contact me at `khylon.kun.wang@gmail.com`. |
|
|
| --- |
|
|
| ## πβοΈ Citation |
|
|
| If you find our work or this repository useful in your research, please consider citing our paper: |
|
|
|
|
| @article{wang2025redundancy, |
| title={Redundancy Mitigation: Towards Accurate and Efficient Image-Text Retrieval}, |
| author={Wang, Kun and Hu, Yupeng and Liu, Hao and Jie, Lirong and Nie, Liqiang}, |
| journal={IEEE Transactions on Circuits and Systems for Video Technology}, |
| year={2025}, |
| publisher={IEEE} |
| } |