--- license: apache-2.0 tags: - composed-image-retrieval - vision-language - multimodal - disentanglement - pytorch ---

PAIR: Complementarity-guided Disentanglement for Composed Image Retrieval

Zhiheng Fu1  Zixu Li1  Zhiwei Chen1  Chunxiao Wang3  Xuemeng Song2  Yupeng Hu1✉  Liqiang Nie4

1School of Software, Shandong University  2School of Computer Science and Technology, Shandong University
3Qilu University of Technology (Shandong Academy of Sciences)  4Harbin Institute of Technology (Shenzhen)

These are the official pre-trained model weights for **PAIR**, a novel framework designed for Composed Image Retrieval (CIR) via complementarity-guided disentanglement. 🔗 **Paper:** [Accepted by ICASSP 2025] 🔗 **GitHub Repository:** [ZhihFu/PAIR](https://github.com/ZhihFu/PAIR) 🔗 **Project Website:** [PAIR Webpage](https://zhihfu.github.io/PAIR.github.io/) --- ## 📌 Model Information ### 1. Model Name **PAIR** (Complementarity-guided Disentanglement for Composed Image Retrieval) Checkpoints. ### 2. Task Type & Applicable Tasks - **Task Type:** Composed Image Retrieval (CIR) / Vision-Language / Multimodal Alignment - **Applicable Tasks:** Retrieving target images based on a reference image combined with a relative text modification. ### 3. Project Introduction Existing methods for Composed Image Retrieval (CIR) often suffer from semantic entanglement between multimodal queries and target images. **PAIR** addresses this limitation by exploring the inherent relationships between these modalities. Guided by their complementarity, PAIR effectively **disentangles the visual and textual representations**, achieving more precise multimodal alignment and significantly boosting retrieval performance. ### 4. Training Data Source The pre-trained checkpoints are primarily trained and evaluated on three standard CIR datasets: - **CIRR** (Open Domain) - **FashionIQ** (Fashion Domain) - **Shoes** (Fashion Domain) --- ## 🚀 Usage & Basic Inference These weights are designed to be used directly with the official PAIR GitHub repository. ### Step 1: Prepare the Environment Clone the GitHub repository and install dependencies (evaluated on Python 3.8.10 and PyTorch 2.0.0): ```bash git clone [https://github.com/ZhihFu/PAIR](https://github.com/ZhihFu/PAIR) cd PAIR conda create -n pair python=3.8.10 -y conda activate pair # Install PyTorch pip install torch==2.0.0 torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu118](https://download.pytorch.org/whl/cu118) # Install core dependencies pip install -r requirements.txt ``` ### Step 2: Download Model Weights & Data Download the checkpoint files (e.g., `PAIR_CIRR.pt`) from this Hugging Face repository and place them in the `checkpoints/` directory of your cloned GitHub repo. Ensure you also download and structure the dataset images as specified in the [GitHub repo's Data Preparation section](https://github.com/ZhihFu/PAIR). ### Step 3: Run Testing / Inference To evaluate the model or generate prediction files using the downloaded checkpoint (for example, on the CIRR dataset), run: ```bash python src/cirr_test_submission.py checkpoints/PAIR_CIRR.pt ``` To train from scratch, please refer to the `train.py` instructions in the official repository. --- ## ⚠️ Limitations & Notes **Disclaimer:** This framework and its pre-trained weights are intended for **academic research and multimodal evaluation**. - The model requires access to the original source datasets (CIRR, FashionIQ, Shoes) for full evaluation. Users must comply with the original licenses of those respective datasets. --- ## 📝⭐️ Citation If you find our work or these model weights useful in your research, please consider leaving a **Star** ⭐️ on our GitHub repo and citing our paper: ```bibtex @article{PAIR2025, title={PAIR: Complementarity-guided Disentanglement for Composed Image Retrieval}, author={Fu, Zhiheng and Li, Zixu and Chen, Zhiwei and Wang, Chunxiao and Song, Xuemeng and Hu, Yupeng and Nie, Liqiang}, journal={IEEE}, year = {2025} } ```