iLearn-Lab
/

SIGIR24-FTI4CIR

composed-image-retrieval

zero-shot-learning

text-image-retrieval

textual-inversion

Model card Files Files and versions

SIGIR24-FTI4CIR / README.md

zichao111's picture

Update README.md

4e3f9ca verified 1 day ago

|

history blame contribute delete

3.08 kB

	---
	license: apache-2.0
	tags:
	- composed-image-retrieval
	- zero-shot-learning
	- text-image-retrieval
	- multimodal
	- textual-inversion
	- pytorch
	---

	<a id="top"></a>
	<div align="center">
	<h1>🔍 FTI4CIR: Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval (Model Weights)</h1>

	<p>
	<b>Haoqiang Lin</b><sup>1</sup>
	<b>Haokun Wen</b><sup>2</sup>
	<b>Xuemeng Song</b><sup>1</sup>\*
	<b>Meng Liu</b><sup>3</sup>
	<b>Yupeng Hu</b><sup>1</sup>
	<b>Liqiang Nie</b><sup>2</sup>
	</p>

	<p>
	<sup>1</sup>Shandong University
	<sup>2</sup>Harbin Institute of Technology (Shenzhen)
	<sup>3</sup>Shandong Jianzhu University
	</p>
	</div>

	This repository hosts the official pre-trained model weights for FTI4CIR, a fine-grained textual inversion framework for Zero-Shot Composed Image Retrieval (CIR).
	The model maps reference images into subject-oriented and attribute-oriented pseudo-word tokens, enabling zero-shot composed retrieval without any annotated training triplets.

	🔗 Paper: [SIGIR 2024](https://dl.acm.org/doi/10.1145/3626772.3657831)
	🔗 GitHub Repository: [iLearn-Lab/ERASE](https://github.com/iLearn-Lab/SIGIR24-FTI4CIR)

	---

	## 📌 Model Information

	### 1. Model Name
	FTI4CIR (Fine-grained Textual Inversion for Composed Image Retrieval)

	### 2. Task Type & Applicable Tasks
	- Task Type: Multimodal Retrieval / Zero-Shot Composed Image Retrieval / Textual Inversion
	- Applicable Tasks:
	- Zero-shot composed image retrieval (reference image + modification text → target image)
	- Text-image retrieval with fine-grained image decomposition
	- Open-domain composed retrieval on fashion, general objects, and real-world scenes

	### 3. Model Overview
	Existing CIR methods often rely on expensive annotated `<image, text, target>` triplets and use only coarse-grained image representations.
	FTI4CIR innovatively decomposes each image into:
	- Subject-oriented pseudo-word token for main entities
	- Attribute-oriented pseudo-word tokens for appearance, style, background, etc.

	The image is then represented as a natural sentence:
	`"a photo of [S] with [A1, A2, ..., Ar]"`

	By concatenating with modification text, CIR is reduced to standard text-image retrieval, achieving strong zero-shot generalization.

	Key designs:
	- Fine-grained pseudo-word token mapping
	- Dynamic local attribute feature extraction
	- Tri-wise caption-based semantic regularization (subject / attribute / whole-image)

	### 4. Training Data
	The model is trained on unlabeled open-domain images (ImageNet) without any manually annotated CIR triplets.
	Evaluation is performed on standard benchmarks:
	- FashionIQ
	- CIRR
	- CIRCO

	---

	## 🚀 Usage & Inference

	These weights are designed to be directly used with the official FTI4CIR codebase.

	### Step 1: Environment Setup
	```bash
	git clone https://github.com/ZiChao111/FTI4CIR.git
	cd FTI4CIR
	conda create -n fti4cir python=3.9 -y
	conda activate fti4cir
	pip install -r requirements.txt