Update README.md

dd9b086 verified about 15 hours ago

6.28 kB


	---
	license: apache-2.0
	task_categories:
	- image-retrieval
	- vision-language-navigation
	tags:
	- composed-image-retrieval
	- multimodal-retrieval
	- pytorch
	- aaai-2025
	---

	<a id="top"></a>
	<div align="center">
	<h1>(AAAI 2025) ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval (Model Weights)</h1>
	<div>
	<a target="_blank" href="https://lee-zixu.github.io/">Zixu Li</a><sup>1</sup>,
	<a target="_blank" href="https://zivchen-ty.github.io/">Zhiwei Chen</a><sup>1</sup>,
	<a target="_blank" href="https://haokunwen.github.io">Haokun Wen</a><sup>2,3</sup>,
	<a target="_blank" href="https://zhihfu.github.io/">Zhiheng Fu</a><sup>1</sup>,
	<a target="_blank" href="https://faculty.sdu.edu.cn/huyupeng1/zh_CN/index.htm">Yupeng Hu</a><sup>1&#9993</sup>,
	<a target="_blank" href="https://homepage.hit.edu.cn/guanweili">Weili Guan</a><sup>2</sup>
	</div>
	<sup>1</sup>School of Software, Shandong University &#160&#160&#160</span> <br />
	<sup>2</sup>School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), &#160&#160&#160</span> <br />
	<sup>2</sup>School of Data Science, City University of Hong Kong &#160&#160&#160</span>
	<br />
	<sup>&#9993 </sup>Corresponding author  </span>
	<br/>

	<p>
	<a href="https://aaai.org/Conferences/AAAI-25/"><img src="https://img.shields.io/badge/AAAI-2025-blue.svg?style=flat-square" alt="AAAI 2025"></a>
	<a href="https://ojs.aaai.org/index.php/AAAI/article/view/32541"><img alt='Paper' src="https://img.shields.io/badge/Paper-AAAI.32541-green.svg"></a>
	<a href="https://sdu-l.github.io/ENCODER.github.io/"><img alt='Project Page' src="https://img.shields.io/badge/Website-orange"></a>
	<a href="https://github.com/Lee-zixu/ENCODER"><img alt='GitHub' src="https://img.shields.io/badge/GitHub-Repository-black?style=flat-square&logo=github"></a>
	</p>
	</div>

	This repository hosts the official pre-trained model weights for ENCODER, a novel network designed to explicitly mine visual entities and modification actions, and securely bind implicit modification relations in Composed Image Retrieval (CIR).

	---

	## 📌 Model Information

	### 1. Model Name
	ENCODER (Entity miNing and modifiCation relation binDing nEtwoRk) Checkpoints.

	### 2. Task Type & Applicable Tasks
	- Task Type: Composed Image Retrieval (CIR).
	- Applicable Tasks: Retrieving a target image based on a reference image and a corresponding modification text. The model excels at capturing fine-grained modification relations through multimodal semantic alignment.

	### 3. Project Introduction
	Existing CIR approaches often struggle with the modification relation between visual entities and modification actions due to irrelevant factor perturbation, vague semantic boundaries, and implicit modification relations.

	ENCODER introduces three innovative modules to achieve precise multimodal semantic alignment:
	- 🔍 Latent Factor Filter (LFF): Filters out irrelevant visual and textual factors.
	- 🔗 Entity-Action Binding (EAB): Employs modality-shared Learnable Relation Queries (LRQ) to mine visual entities and actions, learning their implicit relations to bind them effectively.
	- 🧩 Multi-scale Composition (MSC): Performs multi-scale feature composition to precisely push the retrieved feature closer to the target image.

	### 4. Training Data Source & Hosted Weights
	The models were trained across four widely-used CIR datasets: FashionIQ, Shoes, Fashion200K, and CIRR. This Hugging Face repository provides the pre-trained `.pt` checkpoint files for each corresponding dataset:

	* 📄 `cirr.pt`: Checkpoint trained on the open-domain CIRR dataset.
	* 📄 `fashion200k.pt`: Checkpoint trained on the Fashion200K dataset.
	* 📄 `fashioniq.pt`: Checkpoint trained on the FashionIQ dataset.
	* 📄 `shoes.pt`: Checkpoint trained on the Shoes dataset.

	---

	## 🚀 Usage & Basic Inference

	These weights are designed to be evaluated seamlessly using the official [ENCODER GitHub repository](https://github.com/Lee-zixu/ENCODER).

	### Step 1: Prepare the Environment
	Clone the GitHub repository and install dependencies:
	```bash
	git clone [https://github.com/Lee-zixu/ENCODER.git](https://github.com/Lee-zixu/ENCODER.git)
	cd ENCODER
	conda create -n encoder_env python=3.9
	conda activate encoder_env
	pip install torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu118](https://download.pytorch.org/whl/cu118)
	pip install -r requirements.txt
	```

	### Step 2: Download Model Weights
	Download the specific `.pt` files you wish to evaluate from this Hugging Face repository. Place them into a `checkpoints/` directory within your cloned GitHub repo.

	### Step 3: Run Evaluation
	To test your trained model on the validation set, use the `evaluate_model.py` script and point it to the downloaded weights:
	```bash
	python3 evaluation_model.py \
	--model_dir checkpoints/fashioniq.pt \
	--dataset fashioniq \
	--fashioniq_path "path/to/FashionIQ"
	```

	To generate the predictions file for uploading to the [CIRR Evaluation Server](https://cirr.cecs.anu.edu.au/), run:
	```bash
	python src/cirr_test_submission.py checkpoints/cirr.pt
	```

	---

	## ⚠️ Limitations & Notes

	- Version Compatibility: Different versions of `open_clip` can impact model performance. To ensure consistent State-of-the-Art performance as reported in the paper, please strictly adhere to the environment dependencies specified in the `requirements.txt` file of the official repository.
	- State Dict Version: These hosted weights are the updated "state_dict" version for stable evaluation.

	---

	## 📝⭐️ Citation

	If you find this code or our paper useful for your research, please consider leaving a Star ⭐️ on our GitHub repository and citing our AAAI 2025 paper:

	```bibtex
	@inproceedings{ENCODER,
	title={Encoder: Entity mining and modification relation binding for composed image retrieval},
	author={Li, Zixu and Chen, Zhiwei and Wen, Haokun and Fu, Zhiheng and Hu, Yupeng and Guan, Weili},
	booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
	volume={39},
	number={5},
	pages={5101--5109},
	year={2025}
	}
	```