SIGIR24-FTI4CIR / README.md
zichao111's picture
Update README.md
4e3f9ca verified
---
license: apache-2.0
tags:
- composed-image-retrieval
- zero-shot-learning
- text-image-retrieval
- multimodal
- textual-inversion
- pytorch
---
<a id="top"></a>
<div align="center">
<h1>πŸ” FTI4CIR: Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval (Model Weights)</h1>
<p>
<b>Haoqiang Lin</b><sup>1</sup>&nbsp;
<b>Haokun Wen</b><sup>2</sup>&nbsp;
<b>Xuemeng Song</b><sup>1</sup>\*&nbsp;
<b>Meng Liu</b><sup>3</sup>&nbsp;
<b>Yupeng Hu</b><sup>1</sup>&nbsp;
<b>Liqiang Nie</b><sup>2</sup>
</p>
<p>
<sup>1</sup>Shandong University&nbsp;&nbsp;
<sup>2</sup>Harbin Institute of Technology (Shenzhen)&nbsp;&nbsp;
<sup>3</sup>Shandong Jianzhu University
</p>
</div>
This repository hosts the official pre-trained model weights for **FTI4CIR**, a fine-grained textual inversion framework for **Zero-Shot Composed Image Retrieval (CIR)**.
The model maps reference images into subject-oriented and attribute-oriented pseudo-word tokens, enabling zero-shot composed retrieval without any annotated training triplets.
πŸ”— **Paper:** [SIGIR 2024](https://dl.acm.org/doi/10.1145/3626772.3657831)
πŸ”— **GitHub Repository:** [iLearn-Lab/ERASE](https://github.com/iLearn-Lab/SIGIR24-FTI4CIR)
---
## πŸ“Œ Model Information
### 1. Model Name
**FTI4CIR** (Fine-grained Textual Inversion for Composed Image Retrieval)
### 2. Task Type & Applicable Tasks
- **Task Type:** Multimodal Retrieval / Zero-Shot Composed Image Retrieval / Textual Inversion
- **Applicable Tasks:**
- Zero-shot composed image retrieval (reference image + modification text β†’ target image)
- Text-image retrieval with fine-grained image decomposition
- Open-domain composed retrieval on fashion, general objects, and real-world scenes
### 3. Model Overview
Existing CIR methods often rely on expensive annotated `<image, text, target>` triplets and use only coarse-grained image representations.
**FTI4CIR** innovatively decomposes each image into:
- **Subject-oriented pseudo-word token** for main entities
- **Attribute-oriented pseudo-word tokens** for appearance, style, background, etc.
The image is then represented as a natural sentence:
`"a photo of [S*] with [A1*, A2*, ..., Ar*]"`
By concatenating with modification text, CIR is reduced to standard text-image retrieval, achieving strong zero-shot generalization.
Key designs:
- Fine-grained pseudo-word token mapping
- Dynamic local attribute feature extraction
- Tri-wise caption-based semantic regularization (subject / attribute / whole-image)
### 4. Training Data
The model is trained on **unlabeled open-domain images** (ImageNet) without any manually annotated CIR triplets.
Evaluation is performed on standard benchmarks:
- FashionIQ
- CIRR
- CIRCO
---
## πŸš€ Usage & Inference
These weights are designed to be directly used with the official FTI4CIR codebase.
### Step 1: Environment Setup
```bash
git clone https://github.com/ZiChao111/FTI4CIR.git
cd FTI4CIR
conda create -n fti4cir python=3.9 -y
conda activate fti4cir
pip install -r requirements.txt