--- license: apache-2.0 tags: - composed-image-retrieval - zero-shot-learning - text-image-retrieval - multimodal - textual-inversion - pytorch ---

🔍 FTI4CIR: Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval (Model Weights)

Haoqiang Lin1  Haokun Wen2  Xuemeng Song1\*  Meng Liu3  Yupeng Hu1  Liqiang Nie2

1Shandong University   2Harbin Institute of Technology (Shenzhen)   3Shandong Jianzhu University

This repository hosts the official pre-trained model weights for **FTI4CIR**, a fine-grained textual inversion framework for **Zero-Shot Composed Image Retrieval (CIR)**. The model maps reference images into subject-oriented and attribute-oriented pseudo-word tokens, enabling zero-shot composed retrieval without any annotated training triplets. 🔗 **Paper:** [SIGIR 2024](https://dl.acm.org/doi/10.1145/3626772.3657831) 🔗 **GitHub Repository:** [iLearn-Lab/ERASE](https://github.com/iLearn-Lab/SIGIR24-FTI4CIR) --- ## 📌 Model Information ### 1. Model Name **FTI4CIR** (Fine-grained Textual Inversion for Composed Image Retrieval) ### 2. Task Type & Applicable Tasks - **Task Type:** Multimodal Retrieval / Zero-Shot Composed Image Retrieval / Textual Inversion - **Applicable Tasks:** - Zero-shot composed image retrieval (reference image + modification text → target image) - Text-image retrieval with fine-grained image decomposition - Open-domain composed retrieval on fashion, general objects, and real-world scenes ### 3. Model Overview Existing CIR methods often rely on expensive annotated `` triplets and use only coarse-grained image representations. **FTI4CIR** innovatively decomposes each image into: - **Subject-oriented pseudo-word token** for main entities - **Attribute-oriented pseudo-word tokens** for appearance, style, background, etc. The image is then represented as a natural sentence: `"a photo of [S*] with [A1*, A2*, ..., Ar*]"` By concatenating with modification text, CIR is reduced to standard text-image retrieval, achieving strong zero-shot generalization. Key designs: - Fine-grained pseudo-word token mapping - Dynamic local attribute feature extraction - Tri-wise caption-based semantic regularization (subject / attribute / whole-image) ### 4. Training Data The model is trained on **unlabeled open-domain images** (ImageNet) without any manually annotated CIR triplets. Evaluation is performed on standard benchmarks: - FashionIQ - CIRR - CIRCO --- ## 🚀 Usage & Inference These weights are designed to be directly used with the official FTI4CIR codebase. ### Step 1: Environment Setup ```bash git clone https://github.com/ZiChao111/FTI4CIR.git cd FTI4CIR conda create -n fti4cir python=3.9 -y conda activate fti4cir pip install -r requirements.txt