Update README.md

1948e94 verified 1 day ago

4.73 kB

license: apache-2.0
tags:
  - composed-image-retrieval
  - vision-language
  - multimodal
  - pytorch

⚓ TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

Zixu Li¹ Yupeng Hu^1✉ Zhiheng Fu¹ Zhiwei Chen¹ Yongqi Li² Liqiang Nie³

¹Shandong University ²Hong Kong Polytechnic University ³Harbin Institute of Technology (Shenzhen)

These are the official model and data resources for TEMA (Text-oriented Entity Mapping Architecture), the first Composed Image Retrieval (CIR) framework designed explicitly for multi-modification scenarios.

🔗 Paper: [Accepted by ACL 2026] 🔗 GitHub Repository: lee-zixu/ACL26-TEMA

📌 Model Information

1. Model Name

TEMA (Text-oriented Entity Mapping Architecture) Checkpoints.

2. Task Type & Applicable Tasks

Task Type: Composed Image Retrieval (CIR) / Vision-Language / Multimodal Alignment
Applicable Tasks: Retrieving target images based on a reference image and complex Multi-Modification Texts (MMT), seamlessly accommodating both simple and complex modifications.

3. Project Introduction

Prevailing CIR setups rely on simple modification texts, inducing two critical limitations in practical applications: Insufficient Entity Coverage and Clause-Entity Misalignment.

TEMA brings CIR closer to real-world use cases by introducing:

🧠 MMT Parsing Assistant (PA): Utilizes an LLM-based text summarizer and a Consistency Detector during training to enhance the exposure and coverage of modified entities.
🔗 MMT-oriented Entity Mapping (EM): Introduces learnable queries to consolidate multiple clauses of the same entity on the text side and align them with corresponding visual entities on the image side.

4. Training Data Source

The model is evaluated and trained using our newly proposed instruction-rich multi-modification datasets:

M-FashionIQ (Fashion domain)
M-CIRR (Open domain)

These datasets replace short, simplistic texts with Multi-Modification Texts (MMT) generated by MLLM and verified by human annotators.

🚀 Usage & Basic Inference

These weights and codes are designed to be used with the official TEMA GitHub repository.

Step 1: Prepare the Environment

Clone the GitHub repository and install the required dependencies (evaluated on Python 3.10.8 and PyTorch 2.5.1):

git clone [https://github.com/lee-zixu/ACL26-TEMA](https://github.com/lee-zixu/ACL26-TEMA)
cd TEMA
conda create -n tema python=3.10.8 -y
conda activate tema

# Install PyTorch
pip install torch==2.5.1 torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu121](https://download.pytorch.org/whl/cu121)

# Install core dependencies
pip install transformers==4.25.0

Step 2: Download Model & Data

Please refer to the GitHub Repository for detailed instructions on downloading the base image datasets (FashionIQ and CIRR) and replacing their captions with our provided mmt_captions to construct the M-FashionIQ and M-CIRR datasets.

Ensure your folder structure matches the requirements in the official codebase.

Step 3: Run Training / Inference

Once the environment and datasets are prepared, you can start the training or evaluation process:

python3 train.py

⚠️ Limitations & Notes

Disclaimer: This framework and the constructed M-FashionIQ/M-CIRR datasets are intended for academic research and multimodal evaluation.

The datasets build upon existing public datasets (FashionIQ and CIRR); users must also comply with the original licenses of those datasets.
The model's performance relies heavily on the quality of instruction parsing, and real-world multi-modification accuracy may vary based on domain-specific data.

📝⭐️ Citation

If you find our paper, the M-FashionIQ/M-CIRR datasets, or this codebase useful in your research, please consider leaving a Star ⭐️ on our GitHub repo and citing our work:

@inproceedings{TEMA,
  title={TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval},
  author={Li, Zixu and Hu, Yupeng and Fu, Zhiheng and Chen, Zhiwei and Li, Yongqi and Nie, Liqiang},
  booktitle={Proceedings of the Association for Computational Linguistics (ACL)},
  year={2026}
}