ICASSP25-PAIR / README.md
zhihfu's picture
Update README.md
e7a4b16 verified
metadata
license: apache-2.0
tags:
  - composed-image-retrieval
  - vision-language
  - multimodal
  - disentanglement
  - pytorch

PAIR: Complementarity-guided Disentanglement for Composed Image Retrieval

Zhiheng Fu1  Zixu Li1  Zhiwei Chen1  Chunxiao Wang3  Xuemeng Song2  Yupeng Hu1βœ‰  Liqiang Nie4

1School of Software, Shandong University  2School of Computer Science and Technology, Shandong University
3Qilu University of Technology (Shandong Academy of Sciences)  4Harbin Institute of Technology (Shenzhen)

These are the official pre-trained model weights for PAIR, a novel framework designed for Composed Image Retrieval (CIR) via complementarity-guided disentanglement.

πŸ”— Paper: [Accepted by ICASSP 2025] πŸ”— GitHub Repository: ZhihFu/PAIR πŸ”— Project Website: PAIR Webpage


πŸ“Œ Model Information

1. Model Name

PAIR (Complementarity-guided Disentanglement for Composed Image Retrieval) Checkpoints.

2. Task Type & Applicable Tasks

  • Task Type: Composed Image Retrieval (CIR) / Vision-Language / Multimodal Alignment
  • Applicable Tasks: Retrieving target images based on a reference image combined with a relative text modification.

3. Project Introduction

Existing methods for Composed Image Retrieval (CIR) often suffer from semantic entanglement between multimodal queries and target images.

PAIR addresses this limitation by exploring the inherent relationships between these modalities. Guided by their complementarity, PAIR effectively disentangles the visual and textual representations, achieving more precise multimodal alignment and significantly boosting retrieval performance.

4. Training Data Source

The pre-trained checkpoints are primarily trained and evaluated on three standard CIR datasets:

  • CIRR (Open Domain)
  • FashionIQ (Fashion Domain)
  • Shoes (Fashion Domain)

πŸš€ Usage & Basic Inference

These weights are designed to be used directly with the official PAIR GitHub repository.

Step 1: Prepare the Environment

Clone the GitHub repository and install dependencies (evaluated on Python 3.8.10 and PyTorch 2.0.0):

git clone [https://github.com/ZhihFu/PAIR](https://github.com/ZhihFu/PAIR)
cd PAIR
conda create -n pair python=3.8.10 -y
conda activate pair

# Install PyTorch
pip install torch==2.0.0 torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu118](https://download.pytorch.org/whl/cu118)

# Install core dependencies
pip install -r requirements.txt

Step 2: Download Model Weights & Data

Download the checkpoint files (e.g., PAIR_CIRR.pt) from this Hugging Face repository and place them in the checkpoints/ directory of your cloned GitHub repo. Ensure you also download and structure the dataset images as specified in the GitHub repo's Data Preparation section.

Step 3: Run Testing / Inference

To evaluate the model or generate prediction files using the downloaded checkpoint (for example, on the CIRR dataset), run:

python src/cirr_test_submission.py checkpoints/PAIR_CIRR.pt

To train from scratch, please refer to the train.py instructions in the official repository.


⚠️ Limitations & Notes

Disclaimer: This framework and its pre-trained weights are intended for academic research and multimodal evaluation.

  • The model requires access to the original source datasets (CIRR, FashionIQ, Shoes) for full evaluation. Users must comply with the original licenses of those respective datasets.

πŸ“β­οΈ Citation

If you find our work or these model weights useful in your research, please consider leaving a Star ⭐️ on our GitHub repo and citing our paper:

@article{PAIR2025,
    title={PAIR: Complementarity-guided Disentanglement for Composed Image Retrieval},
    author={Fu, Zhiheng and Li, Zixu and Chen, Zhiwei and Wang, Chunxiao and Song, Xuemeng and Hu, Yupeng and Nie, Liqiang},
    journal={IEEE},
    year = {2025}
}