license: apache-2.0
tags:
- composed-image-retrieval
- vision-language
- multimodal
- pytorch
β TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval
Zixu Li1 Yupeng Hu1β Zhiheng Fu1 Zhiwei Chen1 Yongqi Li2 Liqiang Nie3
1Shandong University 2Hong Kong Polytechnic University 3Harbin Institute of Technology (Shenzhen)
These are the official model and data resources for TEMA (Text-oriented Entity Mapping Architecture), the first Composed Image Retrieval (CIR) framework designed explicitly for multi-modification scenarios.
π Paper: [Accepted by ACL 2026] π GitHub Repository: lee-zixu/ACL26-TEMA
π Model Information
1. Model Name
TEMA (Text-oriented Entity Mapping Architecture) Checkpoints.
2. Task Type & Applicable Tasks
- Task Type: Composed Image Retrieval (CIR) / Vision-Language / Multimodal Alignment
- Applicable Tasks: Retrieving target images based on a reference image and complex Multi-Modification Texts (MMT), seamlessly accommodating both simple and complex modifications.
3. Project Introduction
Prevailing CIR setups rely on simple modification texts, inducing two critical limitations in practical applications: Insufficient Entity Coverage and Clause-Entity Misalignment.
TEMA brings CIR closer to real-world use cases by introducing:
- π§ MMT Parsing Assistant (PA): Utilizes an LLM-based text summarizer and a Consistency Detector during training to enhance the exposure and coverage of modified entities.
- π MMT-oriented Entity Mapping (EM): Introduces learnable queries to consolidate multiple clauses of the same entity on the text side and align them with corresponding visual entities on the image side.
4. Training Data Source
The model is evaluated and trained using our newly proposed instruction-rich multi-modification datasets:
- M-FashionIQ (Fashion domain)
- M-CIRR (Open domain)
These datasets replace short, simplistic texts with Multi-Modification Texts (MMT) generated by MLLM and verified by human annotators.
π Usage & Basic Inference
These weights and codes are designed to be used with the official TEMA GitHub repository.
Step 1: Prepare the Environment
Clone the GitHub repository and install the required dependencies (evaluated on Python 3.10.8 and PyTorch 2.5.1):
git clone [https://github.com/lee-zixu/ACL26-TEMA](https://github.com/lee-zixu/ACL26-TEMA)
cd TEMA
conda create -n tema python=3.10.8 -y
conda activate tema
# Install PyTorch
pip install torch==2.5.1 torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu121](https://download.pytorch.org/whl/cu121)
# Install core dependencies
pip install transformers==4.25.0
Step 2: Download Model & Data
Please refer to the GitHub Repository for detailed instructions on downloading the base image datasets (FashionIQ and CIRR) and replacing their captions with our provided mmt_captions to construct the M-FashionIQ and M-CIRR datasets.
Ensure your folder structure matches the requirements in the official codebase.
Step 3: Run Training / Inference
Once the environment and datasets are prepared, you can start the training or evaluation process:
python3 train.py
β οΈ Limitations & Notes
Disclaimer: This framework and the constructed M-FashionIQ/M-CIRR datasets are intended for academic research and multimodal evaluation.
- The datasets build upon existing public datasets (FashionIQ and CIRR); users must also comply with the original licenses of those datasets.
- The model's performance relies heavily on the quality of instruction parsing, and real-world multi-modification accuracy may vary based on domain-specific data.
πβοΈ Citation
If you find our paper, the M-FashionIQ/M-CIRR datasets, or this codebase useful in your research, please consider leaving a Star βοΈ on our GitHub repo and citing our work:
@inproceedings{TEMA,
title={TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval},
author={Li, Zixu and Hu, Yupeng and Fu, Zhiheng and Chen, Zhiwei and Li, Yongqi and Nie, Liqiang},
booktitle={Proceedings of the Association for Computational Linguistics (ACL)},
year={2026}
}