🔍 FTI4CIR: Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval (Model Weights)

---
license: apache-2.0
tags:
- composed-image-retrieval
- zero-shot-learning
- text-image-retrieval
- multimodal
- textual-inversion
- pytorch
---

<a id="top"></a>
<div align="center">
  <h1>🔍 FTI4CIR: Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval (Model Weights)</h1>

<p>
    <b>Haoqiang Lin</b><sup>1</sup>&nbsp;
    <b>Haokun Wen</b><sup>2</sup>&nbsp;
    <b>Xuemeng Song</b><sup>1</sup>\*&nbsp;
    <b>Meng Liu</b><sup>3</sup>&nbsp;
    <b>Yupeng Hu</b><sup>1</sup>&nbsp;
    <b>Liqiang Nie</b><sup>2</sup>
  </p>

  <p>
    <sup>1</sup>Shandong University&nbsp;&nbsp;
    <sup>2</sup>Harbin Institute of Technology (Shenzhen)&nbsp;&nbsp;
    <sup>3</sup>Shandong Jianzhu University
  </p>
</div>

This repository hosts the official pre-trained model weights for **FTI4CIR**, a fine-grained textual inversion framework for **Zero-Shot Composed Image Retrieval (CIR)**.
The model maps reference images into subject-oriented and attribute-oriented pseudo-word tokens, enabling zero-shot composed retrieval without any annotated training triplets.

🔗 **Paper:** [SIGIR 2024](https://dl.acm.org/doi/10.1145/3626772.3657831) 
🔗 **GitHub Repository:** [iLearn-Lab/ERASE](https://github.com/iLearn-Lab/SIGIR24-FTI4CIR)

---

## 📌 Model Information

### 1. Model Name
**FTI4CIR** (Fine-grained Textual Inversion for Composed Image Retrieval)

### 2. Task Type & Applicable Tasks
- **Task Type:** Multimodal Retrieval / Zero-Shot Composed Image Retrieval / Textual Inversion
- **Applicable Tasks:**
  - Zero-shot composed image retrieval (reference image + modification text → target image)
  - Text-image retrieval with fine-grained image decomposition
  - Open-domain composed retrieval on fashion, general objects, and real-world scenes

### 3. Model Overview
Existing CIR methods often rely on expensive annotated `<image, text, target>` triplets and use only coarse-grained image representations.
**FTI4CIR** innovatively decomposes each image into:
- **Subject-oriented pseudo-word token** for main entities
- **Attribute-oriented pseudo-word tokens** for appearance, style, background, etc.

The image is then represented as a natural sentence:
`"a photo of [S*] with [A1*, A2*, ..., Ar*]"`

By concatenating with modification text, CIR is reduced to standard text-image retrieval, achieving strong zero-shot generalization.

Key designs:
- Fine-grained pseudo-word token mapping
- Dynamic local attribute feature extraction
- Tri-wise caption-based semantic regularization (subject / attribute / whole-image)

### 4. Training Data
The model is trained on **unlabeled open-domain images** (ImageNet) without any manually annotated CIR triplets.
Evaluation is performed on standard benchmarks:
- FashionIQ
- CIRR
- CIRCO

---

## 🚀 Usage & Inference

These weights are designed to be directly used with the official FTI4CIR codebase.

### Step 1: Environment Setup
```bash
git clone https://github.com/ZiChao111/FTI4CIR.git
cd FTI4CIR
conda create -n fti4cir python=3.9 -y
conda activate fti4cir
pip install -r requirements.txt