| --- |
| license: apache-2.0 |
| tags: |
| - composed-image-retrieval |
| - zero-shot-learning |
| - text-image-retrieval |
| - multimodal |
| - textual-inversion |
| - pytorch |
| --- |
| |
| <a id="top"></a> |
| <div align="center"> |
| <h1>π FTI4CIR: Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval (Model Weights)</h1> |
|
|
| <p> |
| <b>Haoqiang Lin</b><sup>1</sup> |
| <b>Haokun Wen</b><sup>2</sup> |
| <b>Xuemeng Song</b><sup>1</sup>\* |
| <b>Meng Liu</b><sup>3</sup> |
| <b>Yupeng Hu</b><sup>1</sup> |
| <b>Liqiang Nie</b><sup>2</sup> |
| </p> |
| |
| <p> |
| <sup>1</sup>Shandong University |
| <sup>2</sup>Harbin Institute of Technology (Shenzhen) |
| <sup>3</sup>Shandong Jianzhu University |
| </p> |
| </div> |
| |
| This repository hosts the official pre-trained model weights for **FTI4CIR**, a fine-grained textual inversion framework for **Zero-Shot Composed Image Retrieval (CIR)**. |
| The model maps reference images into subject-oriented and attribute-oriented pseudo-word tokens, enabling zero-shot composed retrieval without any annotated training triplets. |
|
|
| π **Paper:** [SIGIR 2024](https://dl.acm.org/doi/10.1145/3626772.3657831) |
| π **GitHub Repository:** [iLearn-Lab/ERASE](https://github.com/iLearn-Lab/SIGIR24-FTI4CIR) |
|
|
| --- |
|
|
| ## π Model Information |
|
|
| ### 1. Model Name |
| **FTI4CIR** (Fine-grained Textual Inversion for Composed Image Retrieval) |
|
|
| ### 2. Task Type & Applicable Tasks |
| - **Task Type:** Multimodal Retrieval / Zero-Shot Composed Image Retrieval / Textual Inversion |
| - **Applicable Tasks:** |
| - Zero-shot composed image retrieval (reference image + modification text β target image) |
| - Text-image retrieval with fine-grained image decomposition |
| - Open-domain composed retrieval on fashion, general objects, and real-world scenes |
|
|
| ### 3. Model Overview |
| Existing CIR methods often rely on expensive annotated `<image, text, target>` triplets and use only coarse-grained image representations. |
| **FTI4CIR** innovatively decomposes each image into: |
| - **Subject-oriented pseudo-word token** for main entities |
| - **Attribute-oriented pseudo-word tokens** for appearance, style, background, etc. |
|
|
| The image is then represented as a natural sentence: |
| `"a photo of [S*] with [A1*, A2*, ..., Ar*]"` |
|
|
| By concatenating with modification text, CIR is reduced to standard text-image retrieval, achieving strong zero-shot generalization. |
|
|
| Key designs: |
| - Fine-grained pseudo-word token mapping |
| - Dynamic local attribute feature extraction |
| - Tri-wise caption-based semantic regularization (subject / attribute / whole-image) |
|
|
| ### 4. Training Data |
| The model is trained on **unlabeled open-domain images** (ImageNet) without any manually annotated CIR triplets. |
| Evaluation is performed on standard benchmarks: |
| - FashionIQ |
| - CIRR |
| - CIRCO |
|
|
| --- |
|
|
| ## π Usage & Inference |
|
|
| These weights are designed to be directly used with the official FTI4CIR codebase. |
|
|
| ### Step 1: Environment Setup |
| ```bash |
| git clone https://github.com/ZiChao111/FTI4CIR.git |
| cd FTI4CIR |
| conda create -n fti4cir python=3.9 -y |
| conda activate fti4cir |
| pip install -r requirements.txt |