Update README.md

dd9b086 verified about 13 hours ago

6.28 kB

license: apache-2.0
task_categories:
  - image-retrieval
  - vision-language-navigation
tags:
  - composed-image-retrieval
  - multimodal-retrieval
  - pytorch
  - aaai-2025

(AAAI 2025) ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval (Model Weights)

Zixu Li¹, Zhiwei Chen¹, Haokun Wen^2,3, Zhiheng Fu¹, Yupeng Hu^1✉, Weili Guan²

¹School of Software, Shandong University
²School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen),
²School of Data Science, City University of Hong Kong
^✉Corresponding author

This repository hosts the official pre-trained model weights for ENCODER, a novel network designed to explicitly mine visual entities and modification actions, and securely bind implicit modification relations in Composed Image Retrieval (CIR).

📌 Model Information

1. Model Name

ENCODER (Entity miNing and modifiCation relation binDing nEtwoRk) Checkpoints.

2. Task Type & Applicable Tasks

Task Type: Composed Image Retrieval (CIR).
Applicable Tasks: Retrieving a target image based on a reference image and a corresponding modification text. The model excels at capturing fine-grained modification relations through multimodal semantic alignment.

3. Project Introduction

Existing CIR approaches often struggle with the modification relation between visual entities and modification actions due to irrelevant factor perturbation, vague semantic boundaries, and implicit modification relations.

ENCODER introduces three innovative modules to achieve precise multimodal semantic alignment:

🔍 Latent Factor Filter (LFF): Filters out irrelevant visual and textual factors.
🔗 Entity-Action Binding (EAB): Employs modality-shared Learnable Relation Queries (LRQ) to mine visual entities and actions, learning their implicit relations to bind them effectively.
🧩 Multi-scale Composition (MSC): Performs multi-scale feature composition to precisely push the retrieved feature closer to the target image.

4. Training Data Source & Hosted Weights

The models were trained across four widely-used CIR datasets: FashionIQ, Shoes, Fashion200K, and CIRR. This Hugging Face repository provides the pre-trained .pt checkpoint files for each corresponding dataset:

📄 cirr.pt: Checkpoint trained on the open-domain CIRR dataset.
📄 fashion200k.pt: Checkpoint trained on the Fashion200K dataset.
📄 fashioniq.pt: Checkpoint trained on the FashionIQ dataset.
📄 shoes.pt: Checkpoint trained on the Shoes dataset.

🚀 Usage & Basic Inference

These weights are designed to be evaluated seamlessly using the official ENCODER GitHub repository.

Step 1: Prepare the Environment

Clone the GitHub repository and install dependencies:

git clone [https://github.com/Lee-zixu/ENCODER.git](https://github.com/Lee-zixu/ENCODER.git)
cd ENCODER
conda create -n encoder_env python=3.9
conda activate encoder_env
pip install torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu118](https://download.pytorch.org/whl/cu118)
pip install -r requirements.txt

Step 2: Download Model Weights

Download the specific .pt files you wish to evaluate from this Hugging Face repository. Place them into a checkpoints/ directory within your cloned GitHub repo.

Step 3: Run Evaluation

To test your trained model on the validation set, use the evaluate_model.py script and point it to the downloaded weights:

python3 evaluation_model.py \
    --model_dir checkpoints/fashioniq.pt \
    --dataset fashioniq \
    --fashioniq_path "path/to/FashionIQ"

To generate the predictions file for uploading to the CIRR Evaluation Server, run:

python src/cirr_test_submission.py checkpoints/cirr.pt

⚠️ Limitations & Notes

Version Compatibility: Different versions of open_clip can impact model performance. To ensure consistent State-of-the-Art performance as reported in the paper, please strictly adhere to the environment dependencies specified in the requirements.txt file of the official repository.
State Dict Version: These hosted weights are the updated "state_dict" version for stable evaluation.

📝⭐️ Citation

If you find this code or our paper useful for your research, please consider leaving a Star ⭐️ on our GitHub repository and citing our AAAI 2025 paper:

@inproceedings{ENCODER,
  title={Encoder: Entity mining and modification relation binding for composed image retrieval},
  author={Li, Zixu and Chen, Zhiwei and Wen, Haokun and Fu, Zhiheng and Hu, Yupeng and Guan, Weili},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  number={5},
  pages={5101--5109},
  year={2025}
}