# Seeing Through Words: Controlling Visual Retrieval Quality with Language Models

<div align="center">

[![Paper](https://img.shields.io/badge/Paper-ICLR_2026-blue)](https://openreview.net/forum?id=yOEmEXmbV8)
[![HuggingFace](https://img.shields.io/badge/🤗_HuggingFace-Model-yellow)](https://huggingface.co/Johnny050407/QCQC/)

[Jianglin Lu](https://jianglin954.github.io/), [Simon Jenni](https://sjenni.github.io/), [Kushal Kafle](https://kushalkafle.com/), [Jing Shi](https://jshi31.github.io/jingshi/), [Handong Zhao](https://hdzhao.github.io/), [Yun Fu](https://www1.ece.neu.edu/~yunfu/)

</div>

---

## Why QCQC?

Text-to-image retrieval usually optimizes for **relevance** only. In practice you often care about **quality** too: more aesthetic photos, fewer blurry or low-IQA images, or a custom trade-off. We call this **Quality-Controllable Retrieval (QCR)**, a new setting where retrieval can be explicitly conditioned on user-defined quality requirements.

We propose **Quality-Conditioned Query Completion (QCQC)**, a query completion framework that leverages LLMs to enrich short queries with quality-aware descriptive details. Specify desired quality (e.g., aesthetic, relevance, image quality), and QCQC completes your query so retrieval returns results that match both meaning and quality.

- **Quality control** — Describe desired quality as the condition; no separate filters or post-hoc ranking.
- **Multi-dimensional quality** — Aesthetic, image quality (IQA), and relevance, composable in one framework (adapt to any quality definition).
- **Reproducible** — MS-COCO workflow, clear data pipeline, and training/inference scripts.

---

## Overview

We use **MS-COCO** and **GPT-2** as the running example: download data, build a search index, generate auxiliary quality scores (aesthetic, IQA, relevance), tokenize the data, train the QCQC model, and then run retrieval. The steps below walk through the full pipeline.

---

## Environment Installation

```bash
bash ./src/setup_envir.sh
conda activate QCQC
```

---

## Dataset Preparation

### Download MS-COCO dataset

```bash
python ./src/download_coco.py
unzip ./coco_data/train2017.zip -d ./coco_data/
unzip ./coco_data/annotations_trainval2017.zip -d ./coco_data/
```

### Build search index

```bash
CUDA_VISIBLE_DEVICES=0 python ./src/search_preparation.py
```

---

## Auxiliary Data Generation

Quality conditioning relies on precomputed scores. Follow the steps below for each type.

### Image Aesthetic Scores

Follow the setup in [improved-aesthetic-predictor](https://github.com/christophschuhmann/improved-aesthetic-predictor).

Install extra dependencies:

```bash
conda run -n QCQC pip install webdataset pytorch-lightning
```

Generate aesthetic scores:

```bash
CUDA_VISIBLE_DEVICES=0 python ./improved-aesthetic-predictor/simple_inference_coco.py
```

### IQA Scores

Follow the setup in [DeQA-Score](https://github.com/zhiyuanyou/DeQA-Score). Create a separate environment:

```bash
conda create -yn DeQA python=3.10
conda activate DeQA
cd DeQA-Score
pip install -e .
pip install pycocotools numpy==1.26.4 protobuf
```

Generate IQA scores:

```bash
CUDA_VISIBLE_DEVICES=0 python ./src/evaluate/scorer_coco.py
```

### Relevance Scores

Relevance scores are computed with CLIP. From the QCQC environment:

```bash
conda activate QCQC
CUDA_VISIBLE_DEVICES=0 python ./src/generate_relevance_scores.py
```

---

## Training & Testing

### 1. Data tokenization

```bash
CUDA_VISIBLE_DEVICES=0 python ./src/run_tokenize.py
```

### 2. Model training

Multi-GPU example (8 GPUs):

```bash
torchrun --nproc_per_node=8 --master_port=1221 ./src/train.py \
    --lr 2e-3 --warmup 100 --epochs 20 --bs 256 \
    --logstep 100 --evalstep 100 --savestep 100 \
    --project_name GPT2_COCO --run_name prompt_gpt2coco
```

### 3. Model testing

```bash
bash src/inference.sh 
```

### 4. Upload to Huggingface
```bash
cd ..
hf upload Johnny050407/QCQC QCQC
```
---


## Pretrained Checkpoints and Processed Data

Pretrained checkpoints and preprocessed auxiliary data for MS-COCO are publicly available on Hugging Face:

https://huggingface.co/Johnny050407/QCQC/


---


## Results

Qualitative examples of quality-conditioned retrieval:

|  |  |
|:---:|:---:|
| ![Results 1](./imgs/results1.png) | ![Results 2](./imgs/results2.png) |
| *Quality-conditioned retrieval examples (1)* | *Quality-conditioned retrieval examples (2)* |

---

## Citation

If you use this code or idea in your work, please cite:

```bibtex
@inproceedings{JianglinQCQC2026,
  title     = {Seeing Through Words: Controlling Visual Retrieval Quality with Language Models},
  author    = {Jianglin Lu and Simon Jenni and Kushal Kafle and Jing Shi and Handong Zhao and Yun Fu},
  booktitle = {The Fourteenth International Conference on Learning Representations (ICLR)},
  year      = {2026},
  url       = {https://openreview.net/forum?id=yOEmEXmbV8},
}
```

---

## Acknowledgement

We use the following open-source projects and thank the authors:

- [improved-aesthetic-predictor](https://github.com/christophschuhmann/improved-aesthetic-predictor) for aesthetic quality evaluation
- [DeQA-Score](https://github.com/zhiyuanyou/DeQA-Score) for IQA score prediction