--- license: apache-2.0 tags: - composed-image-retrieval - vision-language - multimodal - pytorch ---

⚓ TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

Zixu Li1  Yupeng Hu1✉  Zhiheng Fu1  Zhiwei Chen1  Yongqi Li2  Liqiang Nie3

1Shandong University  2Hong Kong Polytechnic University   3Harbin Institute of Technology (Shenzhen)

These are the official model and data resources for **TEMA** (Text-oriented Entity Mapping Architecture), the first Composed Image Retrieval (CIR) framework designed explicitly for multi-modification scenarios. 🔗 **Paper:** [Accepted by ACL 2026] 🔗 **GitHub Repository:** [lee-zixu/ACL26-TEMA](https://github.com/lee-zixu/ACL26-TEMA) --- ## 📌 Model Information ### 1. Model Name **TEMA** (Text-oriented Entity Mapping Architecture) Checkpoints. ### 2. Task Type & Applicable Tasks - **Task Type:** Composed Image Retrieval (CIR) / Vision-Language / Multimodal Alignment - **Applicable Tasks:** Retrieving target images based on a reference image and complex Multi-Modification Texts (MMT), seamlessly accommodating both simple and complex modifications. ### 3. Project Introduction Prevailing CIR setups rely on simple modification texts, inducing two critical limitations in practical applications: **Insufficient Entity Coverage** and **Clause-Entity Misalignment**. **TEMA** brings CIR closer to real-world use cases by introducing: - 🧠 **MMT Parsing Assistant (PA):** Utilizes an LLM-based text summarizer and a Consistency Detector during training to enhance the exposure and coverage of modified entities. - 🔗 **MMT-oriented Entity Mapping (EM):** Introduces learnable queries to consolidate multiple clauses of the same entity on the text side and align them with corresponding visual entities on the image side. ### 4. Training Data Source The model is evaluated and trained using our newly proposed instruction-rich multi-modification datasets: - **M-FashionIQ** (Fashion domain) - **M-CIRR** (Open domain) These datasets replace short, simplistic texts with Multi-Modification Texts (MMT) generated by MLLM and verified by human annotators. --- ## 🚀 Usage & Basic Inference These weights and codes are designed to be used with the official TEMA GitHub repository. ### Step 1: Prepare the Environment Clone the GitHub repository and install the required dependencies (evaluated on Python 3.10.8 and PyTorch 2.5.1): ```bash git clone [https://github.com/lee-zixu/ACL26-TEMA](https://github.com/lee-zixu/ACL26-TEMA) cd TEMA conda create -n tema python=3.10.8 -y conda activate tema # Install PyTorch pip install torch==2.5.1 torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu121](https://download.pytorch.org/whl/cu121) # Install core dependencies pip install transformers==4.25.0 ``` ### Step 2: Download Model & Data Please refer to the [GitHub Repository](https://github.com/lee-zixu/ACL26-TEMA) for detailed instructions on downloading the base image datasets (FashionIQ and CIRR) and replacing their captions with our provided `mmt_captions` to construct the **M-FashionIQ** and **M-CIRR** datasets. Ensure your folder structure matches the requirements in the official codebase. ### Step 3: Run Training / Inference Once the environment and datasets are prepared, you can start the training or evaluation process: ```bash python3 train.py ``` --- ## ⚠️ Limitations & Notes **Disclaimer:** This framework and the constructed M-FashionIQ/M-CIRR datasets are intended for **academic research and multimodal evaluation**. - The datasets build upon existing public datasets (FashionIQ and CIRR); users must also comply with the original licenses of those datasets. - The model's performance relies heavily on the quality of instruction parsing, and real-world multi-modification accuracy may vary based on domain-specific data. --- ## 📝⭐️ Citation If you find our paper, the M-FashionIQ/M-CIRR datasets, or this codebase useful in your research, please consider leaving a **Star** ⭐️ on our GitHub repo and citing our work: ```bibtex @inproceedings{TEMA, title={TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval}, author={Li, Zixu and Hu, Yupeng and Fu, Zhiheng and Chen, Zhiwei and Li, Yongqi and Nie, Liqiang}, booktitle={Proceedings of the Association for Computational Linguistics (ACL)}, year={2026} } ```