--- license: apache-2.0 task_categories: - image-retrieval - vision-language-navigation tags: - composed-image-retrieval - multimodal-retrieval - pytorch - aaai-2025 ---

(AAAI 2025) ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval (Model Weights)

Zixu Li¹, Zhiwei Chen¹, Haokun Wen^2,3, Zhiheng Fu¹, Yupeng Hu^1✉, Weili Guan²

¹School of Software, Shandong University
²School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen),
²School of Data Science, City University of Hong Kong
^✉Corresponding author

This repository hosts the official pre-trained model weights for **ENCODER**, a novel network designed to explicitly mine visual entities and modification actions, and securely bind implicit modification relations in Composed Image Retrieval (CIR). --- ## 📌 Model Information ### 1. Model Name **ENCODER** (Entity miNing and modifiCation relation binDing nEtwoRk) Checkpoints. ### 2. Task Type & Applicable Tasks - **Task Type:** Composed Image Retrieval (CIR). - **Applicable Tasks:** Retrieving a target image based on a reference image and a corresponding modification text. The model excels at capturing fine-grained modification relations through multimodal semantic alignment. ### 3. Project Introduction Existing CIR approaches often struggle with the modification relation between visual entities and modification actions due to irrelevant factor perturbation, vague semantic boundaries, and implicit modification relations. **ENCODER** introduces three innovative modules to achieve precise multimodal semantic alignment: - 🔍 **Latent Factor Filter (LFF):** Filters out irrelevant visual and textual factors. - 🔗 **Entity-Action Binding (EAB):** Employs modality-shared Learnable Relation Queries (LRQ) to mine visual entities and actions, learning their implicit relations to bind them effectively. - 🧩 **Multi-scale Composition (MSC):** Performs multi-scale feature composition to precisely push the retrieved feature closer to the target image. ### 4. Training Data Source & Hosted Weights The models were trained across four widely-used CIR datasets: **FashionIQ**, **Shoes**, **Fashion200K**, and **CIRR**. This Hugging Face repository provides the pre-trained `.pt` checkpoint files for each corresponding dataset: * 📄 `cirr.pt`: Checkpoint trained on the open-domain CIRR dataset. * 📄 `fashion200k.pt`: Checkpoint trained on the Fashion200K dataset. * 📄 `fashioniq.pt`: Checkpoint trained on the FashionIQ dataset. * 📄 `shoes.pt`: Checkpoint trained on the Shoes dataset. --- ## 🚀 Usage & Basic Inference These weights are designed to be evaluated seamlessly using the official [ENCODER GitHub repository](https://github.com/Lee-zixu/ENCODER). ### Step 1: Prepare the Environment Clone the GitHub repository and install dependencies: ```bash git clone [https://github.com/Lee-zixu/ENCODER.git](https://github.com/Lee-zixu/ENCODER.git) cd ENCODER conda create -n encoder_env python=3.9 conda activate encoder_env pip install torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu118](https://download.pytorch.org/whl/cu118) pip install -r requirements.txt ``` ### Step 2: Download Model Weights Download the specific `.pt` files you wish to evaluate from this Hugging Face repository. Place them into a `checkpoints/` directory within your cloned GitHub repo. ### Step 3: Run Evaluation To test your trained model on the validation set, use the `evaluate_model.py` script and point it to the downloaded weights: ```bash python3 evaluation_model.py \ --model_dir checkpoints/fashioniq.pt \ --dataset fashioniq \ --fashioniq_path "path/to/FashionIQ" ``` To generate the predictions file for uploading to the [CIRR Evaluation Server](https://cirr.cecs.anu.edu.au/), run: ```bash python src/cirr_test_submission.py checkpoints/cirr.pt ``` --- ## ⚠️ Limitations & Notes - **Version Compatibility:** Different versions of `open_clip` can impact model performance. To ensure consistent State-of-the-Art performance as reported in the paper, please strictly adhere to the environment dependencies specified in the `requirements.txt` file of the official repository. - **State Dict Version:** These hosted weights are the updated "state_dict" version for stable evaluation. --- ## 📝⭐️ Citation If you find this code or our paper useful for your research, please consider leaving a **Star** ⭐️ on our GitHub repository and citing our AAAI 2025 paper: ```bibtex @inproceedings{ENCODER, title={Encoder: Entity mining and modification relation binding for composed image retrieval}, author={Li, Zixu and Chen, Zhiwei and Wen, Haokun and Fu, Zhiheng and Hu, Yupeng and Guan, Weili}, booktitle={Proceedings of the AAAI Conference on Artificial Intelligence}, volume={39}, number={5}, pages={5101--5109}, year={2025} } ```