Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,486 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
<p align="center"> <h1 align="center">CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding</h1>
|
| 5 |
+
<p align="center">
|
| 6 |
+
<b> IEEE Transaction on Multimedia, 2023 </b>
|
| 7 |
+
<br />
|
| 8 |
+
<a href="https://scholar.google.com.hk/citations?user=4rTE4ogAAAAJ&hl=zh-CN&oi=sra"><strong> Linhui Xiao </strong></a>
|
| 9 |
+
·
|
| 10 |
+
<a href="https://yangxs.ac.cn/home"><strong>Xiaoshan Yang </strong></a>
|
| 11 |
+
·
|
| 12 |
+
<a href="https://scholar.google.com.hk/citations?user=HBZ9plsAAAAJ&hl=zh-CN"><strong>Fang Peng </strong></a>
|
| 13 |
+
·
|
| 14 |
+
<a href="https://scholar.google.com.hk/citations?user=uIUfGxYAAAAJ&hl=zh-CN"><strong>Ming Yan </strong></a>
|
| 15 |
+
·
|
| 16 |
+
<a href="https://scholar.google.com.hk/citations?user=o_DllmIAAAAJ&hl=zh-CN"><strong>Yaowei Wang </strong></a>
|
| 17 |
+
·
|
| 18 |
+
<a href="https://scholar.google.com.hk/citations?user=hI9NRDkAAAAJ&hl=zh-CN"><strong>Changsheng Xu</strong></a>
|
| 19 |
+
</p>
|
| 20 |
+
|
| 21 |
+
<p align="center">
|
| 22 |
+
<a href='https://arxiv.org/pdf/2305.08685'>
|
| 23 |
+
<img src='https://img.shields.io/badge/arXiv-PDF-green?style=flat&logo=arXiv&logoColor=green' alt='arXiv PDF'>
|
| 24 |
+
</a>
|
| 25 |
+
<a href='https://ieeexplore.ieee.org/abstract/document/10269126'>
|
| 26 |
+
<img src='https://img.shields.io/badge/IEEE TMM-blue' alt='arXiv PDF'>
|
| 27 |
+
</a>
|
| 28 |
+
|
| 29 |
+
<br />
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
<p align="center"> <img src='docs/model.jpg' align="center" width="70%"> </p>
|
| 33 |
+
|
| 34 |
+
**<p align="center"> CLIP for Unsupervised and Fully Supervised Visual Grounding. </p>**
|
| 35 |
+
|
| 36 |
+
This repository is the official Pytorch implementation for the paper [**CLIP-VG: Self-paced Curriculum Adapting of CLIP
|
| 37 |
+
for Visual Grounding**](https://ieeexplore.ieee.org/abstract/document/10269126).
|
| 38 |
+
|
| 39 |
+
If you have any questions, please feel free to open an issue or contact me with emails: <xiaolinhui16@mails.ucas.ac.cn>.
|
| 40 |
+
|
| 41 |
+
<h3 align="left">
|
| 42 |
+
Links: <a href="https://ieeexplore.ieee.org/abstract/document/10269126">IEEE Transaction on Multimedia (2023)</a>,
|
| 43 |
+
<a href="https://arxiv.org/abs/2305.08685">ArXiv</a>,
|
| 44 |
+
[<a href="https://mp.weixin.qq.com/s/fwbamVr5P5Vcj5XheopQOg">中文解读</a>]
|
| 45 |
+
</h3>
|
| 46 |
+
|
| 47 |
+
**Please leave a <font color='orange'>STAR ⭐</font> if you like this project!**
|
| 48 |
+
|
| 49 |
+
## News
|
| 50 |
+
|
| 51 |
+
- 🔥🔥🔥 **Our Grounding survey ([TPAMI](https://doi.org/10.1109/TPAMI.2025.3630635), [Arxiv](https://arxiv.org/abs/2412.20206), [Project](https://github.com/linhuixiao/Awesome-Visual-Grounding)) has been accepted by TPAMI on October 30, 2025 !!!**
|
| 52 |
+
|
| 53 |
+
- :fire: **Update on 2024/12/28: We conducted a survey of Visual Grounding over the past decade, entitled "Towards Visual Grounding: A Survey" ([Paper](https://arxiv.org/pdf/2412.20206), [Project](https://github.com/linhuixiao/Awesome-Visual-Grounding)), Comments are welcome !!!**
|
| 54 |
+
- :fire: **Update on 2024/09/26: Our advanced grounding work OneRef ([Paper](https://openreview.net/pdf?id=siPdcro6uD), [Code](https://github.com/linhuixiao/OneRef)) has acceptance by top conference NeurIPS 2024 in October 2024!**
|
| 55 |
+
- :fire: **Update on 2024/07/16: Our advanced grounding work HiVG ([Paper](https://openreview.net/pdf?id=NMMyGy1kKZ), [Code](https://github.com/linhuixiao/HiVG)) has acceptance by top conference ACM MM 2024 in July 2024!**
|
| 56 |
+
- **Update on 2024/04/20: We release an advanced version of CLIP-VG, namely HiVG ([paper](https://arxiv.org/abs/2404.13400), [github](https://github.com/linhuixiao/HiVG)).**
|
| 57 |
+
- **Update on 2023/12/13: All of the code, models and datasets have been released.**
|
| 58 |
+
- **Update on 2023/9/25: Our paper has been accepted by the top journal IEEE Transaction on Multimedia (2023)!**
|
| 59 |
+
- Update on 2023/05/18: Release the repository and training code.
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
## Citation
|
| 63 |
+
|
| 64 |
+
If you find our work helpful for your research, please consider citing the following BibTeX entry.
|
| 65 |
+
|
| 66 |
+
```bibtex
|
| 67 |
+
@article{xiao2023clip,
|
| 68 |
+
title={CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding},
|
| 69 |
+
author={Xiao, Linhui and Yang, Xiaoshan and Peng, Fang and Yan, Ming and Wang, Yaowei and Xu, Changsheng},
|
| 70 |
+
journal={IEEE Transactions on Multimedia},
|
| 71 |
+
year={2023},
|
| 72 |
+
publisher={IEEE}
|
| 73 |
+
}
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
## Contents
|
| 77 |
+
|
| 78 |
+
1. [Introduction](#introduction)
|
| 79 |
+
2. [Usage](#usage)
|
| 80 |
+
3. [Results](#results)
|
| 81 |
+
4. [Contacts](#contacts)
|
| 82 |
+
5. [Acknowledgments](#acknowledgments)
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
## Highlight
|
| 86 |
+
- **CLIP for Visual Grounding.** a state-of-the-art baseline for unsupervised and fully supervised visual grounding with CLIP model.
|
| 87 |
+
- **Single-source and Multi-source pseudo-language labels.** The generation and usage of multi-source pseudo-labels.
|
| 88 |
+
- **Self-paced Curriculum Adapting Algorithm.** A plugin-like algorithmic idea that can be applied to any pseudo-label scenario.
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
## TODO
|
| 92 |
+
- [x] Release model code and inference code.
|
| 93 |
+
- [x] Release unsupervised and fully supervised checkpoints.
|
| 94 |
+
- [x] Release the complete multi-source pseudo-language labels and its generation code.
|
| 95 |
+
- [x] Release the reliability measurement code.
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
## Introduction
|
| 101 |
+
|
| 102 |
+
In order to utilize vision and language pre-trained models to address the grounding problem, and reasonably take
|
| 103 |
+
advantage of pseudo-labels, we propose **CLIP-VG**, **a novel method that can conduct self-paced curriculum adapting of CLIP
|
| 104 |
+
with pseudo-language labels.**
|
| 105 |
+
|
| 106 |
+
We propose a simple yet efficient end-to-end network architecture to realize the transfer
|
| 107 |
+
of CLIP to the visual grounding. Based on the CLIP-based architecture, we further propose single-source and
|
| 108 |
+
multi-source curriculum adapting algorithms, which can progressively find more reliable pseudo-labels to learn an
|
| 109 |
+
optimal model, thereby achieving a balance between reliability and diversity for the pseudo-language labels.
|
| 110 |
+
|
| 111 |
+
Our method outperforms the current state-of-the-art unsupervised method Pseudo-Q, by a significant margin on RefCOCO/+/g datasets in both
|
| 112 |
+
single-source and multi-source scenarios. Furthermore, our approach even outperforms existing weakly supervised methods.
|
| 113 |
+
In comparison with the fully supervised SOTA model QRNet, we achieve comparable results with only **7.7%** of its
|
| 114 |
+
updated parameters, while obtaining significant speedups in both training and inference, up to **26.84×** and **7.41×**, respectively.
|
| 115 |
+
|
| 116 |
+
In summary, **the contributions of this work are four-fold**:
|
| 117 |
+
|
| 118 |
+
- As far as we know, **we are the first to adapt CLIP to realize unsupervised visual grounding.** Our method can
|
| 119 |
+
transfer the cross-modal learning ability of CLIP to visual grounding with only a small training cost.
|
| 120 |
+
- **We first introduce the self-paced curriculum learning in unsupervised visual grounding.** Our proposed reliability measurement and single-source self-paced adapting
|
| 121 |
+
can progressively enhance the CLIP-based visual grounding model by utilizing pseudo-labels in an easy-to-hard
|
| 122 |
+
learning paradigm.
|
| 123 |
+
- **We first propose the multi-source self-paced adapting algorithm to extend our method for accessing multiple
|
| 124 |
+
sources of pseudo-labels,** which can flexibly improve the diversity of language taxonomy.
|
| 125 |
+
- We conduct extensive experiments to evaluate the effectiveness of our approach. Results show that our method
|
| 126 |
+
obtains significant improvements in unsupervised setting and is also competitive in fully supervised setting.
|
| 127 |
+
|
| 128 |
+
For more details, please refer to [our paper](https://arxiv.org/abs/2305.08685).
|
| 129 |
+
|
| 130 |
+
## Usage
|
| 131 |
+
### Dependencies
|
| 132 |
+
- Python 3.9.10
|
| 133 |
+
- PyTorch 1.9.0 + cu111 + cp39
|
| 134 |
+
- Check [requirements.txt](requirements.txt) for other dependencies.
|
| 135 |
+
|
| 136 |
+
Our model is **easy to deploy** in a variety of environments and has been successfully tested on multiple pytorch versions.
|
| 137 |
+
If you are interested in the pseudo language label generation module, more detailed instructions will be found at [usage instructions](pseudo_label_generation_module/README.md).
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
### Image Data Preparation
|
| 141 |
+
1.You can download the images from the original source and place them in your disk folder, such as `$/path_to_image_data`:
|
| 142 |
+
- [MS COCO 2014](download_mscoco2014.sh) (for RefCOCO, RefCOCO+, RefCOCOg dataset, almost 13.0GB)
|
| 143 |
+
- [ReferItGame](https://drive.google.com/drive/folders/1D4shieeoKly6FswpdjSpaOrxJQNKTyTv)
|
| 144 |
+
- [Flickr30K Entities](http://shannon.cs.illinois.edu/DenotationGraph/#:~:text=make%20face-,Downloads,-Please%20fill%20in)
|
| 145 |
+
|
| 146 |
+
We provide a script to download the mscoco2014 dataset, you just need to run the script in terminal with the following command:
|
| 147 |
+
```
|
| 148 |
+
bash download_mscoco2014.sh
|
| 149 |
+
```
|
| 150 |
+
Or you can also follow the data preparation of TransVG, which can be found in [GETTING_STARTED.md](https://github.com/djiajunustc/TransVG/blob/main/docs/GETTING_STARTED.md).
|
| 151 |
+
|
| 152 |
+
Only the image data in these datasets is used, and these image data is easily find in similar repositories of visual grounding work, such as [TransVG](https://github.com/linhuixiao/TransVG) etc.
|
| 153 |
+
Finally, the `$/path_to_image_data` folder will have the following structure:
|
| 154 |
+
|
| 155 |
+
```angular2html
|
| 156 |
+
|-- image_data
|
| 157 |
+
|-- Flickr30k
|
| 158 |
+
|-- flickr30k-images
|
| 159 |
+
|-- other
|
| 160 |
+
|-- images
|
| 161 |
+
|-- mscoco
|
| 162 |
+
|-- images
|
| 163 |
+
|-- train2014
|
| 164 |
+
|-- referit
|
| 165 |
+
|-- images
|
| 166 |
+
```
|
| 167 |
+
- ```$/path_to_image_data/image_data/Flickr30k/flickr30k-images/```: Image data for the Flickr30K dataset, please download from this [link](http://shannon.cs.illinois.edu/DenotationGraph/#:~:text=make%20face-,Downloads,-Please%20fill%20in). Fill the form and download the images.
|
| 168 |
+
- ```$/path_to_image_data/image_data/other/images/```: Image data for RefCOCO/RefCOCO+/RefCOCOg, i.e., mscoco2014.
|
| 169 |
+
- ```$/path_to_image_data/image_data/referit/images/```: Image data for ReferItGame.
|
| 170 |
+
|
| 171 |
+
## Text-Box Anotations / Pseudo-Labels Prepare
|
| 172 |
+
The following are the **pseudo-language labels** generated by the pseudo-language label generation module in an unsupervised setting.
|
| 173 |
+
|
| 174 |
+
The **single-source scenario** includes a pseudo-template label derived from [Pseudo-Q](https://github.com/LeapLabTHU/Pseudo-Q).
|
| 175 |
+
|
| 176 |
+
**Multi-source scenario** include pseudo-template labels, pseudo-relation labels, and pseudo-caption labels.
|
| 177 |
+
Please refer to [pseudo-language label generation module](pseudo_label_generation_module/README.md) for specific details
|
| 178 |
+
on how they are generated if interested.
|
| 179 |
+
|
| 180 |
+
Additionally, we also provide the pseudo-labels that selected through our single-source self-paced curriculum adapting (SSA)
|
| 181 |
+
and multi-source self-paced curriculum adapting (MSA) algorithms, which can be conveniently and directly used by the following researchers.
|
| 182 |
+
|
| 183 |
+
The labels in the fully supervised scenario is consistent with previous works such as [TransVG](https://github.com/linhuixiao/TransVG).
|
| 184 |
+
It is worth noting that the test split in the unsupervised scenario are exactly the same as those used in the fully supervised scenario.
|
| 185 |
+
|
| 186 |
+
### Unsupervised setting
|
| 187 |
+
#### Single-source scenario
|
| 188 |
+
<table>
|
| 189 |
+
<tr> <!-- line 3 -->
|
| 190 |
+
<th style="text-align:center" > Datasets </th>
|
| 191 |
+
<th style="text-align:center" > RefCOCO </th>
|
| 192 |
+
<th style="text-align:center" > RefCOCO+ </th>
|
| 193 |
+
<th style="text-align:center" > RefCOCOg-g </th>
|
| 194 |
+
<th style="text-align:center" > RefCOCOg-u </th>
|
| 195 |
+
<th style="text-align:center" > ReferIt </th>
|
| 196 |
+
<th style="text-align:center" > Flickr </th>
|
| 197 |
+
</tr>
|
| 198 |
+
<tr> <!-- line 2 -->
|
| 199 |
+
<th style="text-align:center" rowspan="1"> original </th> <!-- table head -->
|
| 200 |
+
<th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1G5VK8uNbAepyrQiI_DLQaN_02tYyOQq2/view?usp=drive_link">All of six datasets</a>, 36.7MB </th> <!-- table head -->
|
| 201 |
+
</tr>
|
| 202 |
+
<tr> <!-- line 2 -->
|
| 203 |
+
<th style="text-align:center" rowspan="1"> with curriculum selecting </th> <!-- table head -->
|
| 204 |
+
<th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1ekEWR-gYMMOrWPDB7R8lxZfDJbO8KGQt/view?usp=drive_link">All of six datasets</a>, 31.4MB </th> <!-- table head -->
|
| 205 |
+
</tr>
|
| 206 |
+
</table>
|
| 207 |
+
|
| 208 |
+
|
| 209 |
+
#### Multi-source scenario
|
| 210 |
+
<table>
|
| 211 |
+
<tr> <!-- line 3 -->
|
| 212 |
+
<th style="text-align:center" > Datasets </th>
|
| 213 |
+
<th style="text-align:center" > RefCOCO </th>
|
| 214 |
+
<th style="text-align:center" > RefCOCO+ </th>
|
| 215 |
+
<th style="text-align:center" > RefCOCOg-g </th>
|
| 216 |
+
<th style="text-align:center" > RefCOCOg-u </th>
|
| 217 |
+
<th style="text-align:center" > ReferIt </th>
|
| 218 |
+
<th style="text-align:center" > Flickr </th>
|
| 219 |
+
</tr>
|
| 220 |
+
<tr> <!-- line 2 -->
|
| 221 |
+
<th style="text-align:center" rowspan="1"> original </th> <!-- table head -->
|
| 222 |
+
<th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1X9F5n7M0Zm4jhOIf1tjHj6bzMh6A1ZkE/view?usp=drive_link">All of six datasets</a>, 144.7MB, each dataset contains 3 sources of pseudo-labels </th> <!-- table head -->
|
| 223 |
+
</tr>
|
| 224 |
+
<tr> <!-- line 2 -->
|
| 225 |
+
<th style="text-align:center" rowspan="1"> with curriculum selecting </th> <!-- table head -->
|
| 226 |
+
<th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1IBReTahxkOdKW_fKvplw3PGlI8PdHPUW/view?usp=drive_link">All of six datasets</a>, 87.3MB, each dataset contains 3 sources of pseudo-labels </th> <!-- table head -->
|
| 227 |
+
</tr>
|
| 228 |
+
</table>
|
| 229 |
+
|
| 230 |
+
### Fully supervised setting
|
| 231 |
+
<table>
|
| 232 |
+
<tr> <!-- line 3 -->
|
| 233 |
+
<th style="text-align:center" > Datasets </th>
|
| 234 |
+
<th style="text-align:center" > RefCOCO </th>
|
| 235 |
+
<th style="text-align:center" > RefCOCO+ </th>
|
| 236 |
+
<th style="text-align:center" > RefCOCOg-g </th>
|
| 237 |
+
<th style="text-align:center" > RefCOCOg-u </th>
|
| 238 |
+
<th style="text-align:center" > ReferIt </th>
|
| 239 |
+
<th style="text-align:center" > Flickr </th>
|
| 240 |
+
</tr>
|
| 241 |
+
<tr> <!-- line 2 -->
|
| 242 |
+
<th style="text-align:center" rowspan="1"> url, size </th> <!-- table head -->
|
| 243 |
+
<th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1ituKSxWU5aXsGnXePd7twv7ImJoFiATc/view?usp=drive_link">All of six datasets</a>, 89.0MB </th> <!-- table head -->
|
| 244 |
+
</tr>
|
| 245 |
+
<tr> <!-- line 3 -->
|
| 246 |
+
<th style="text-align:center" > with curriculum selecting </th>
|
| 247 |
+
<th style="text-align:center" > - </th>
|
| 248 |
+
<th style="text-align:center" > - </th>
|
| 249 |
+
<th style="text-align:center" > - </th>
|
| 250 |
+
<th style="text-align:center" > <a href="https://drive.google.com/file/d/1eSGr-sTqZ6z_Jy7APnJXNxegt2Q-pbqE/view?usp=drive_link">dataset</a> </th>
|
| 251 |
+
<th style="text-align:center" > - </th>
|
| 252 |
+
<th style="text-align:center" > - </th>
|
| 253 |
+
</tr>
|
| 254 |
+
</table>
|
| 255 |
+
|
| 256 |
+
\* Since we observed a relatively clear performance increase on the RefCOCOg-u dataset in the fully supervised setting,
|
| 257 |
+
we provide data for this dataset after applying our SSA algorithm for curriculum selecting. Typically, by using this
|
| 258 |
+
filtered data, there is an approximate ~1.0 increase in performance on both val-u and test-u.
|
| 259 |
+
|
| 260 |
+
Download the above annotations to a disk directory such as `$/path_to_split`; then will have the following similar directory structure:
|
| 261 |
+
|
| 262 |
+
```angular2html
|
| 263 |
+
|-- /unsup_single_source/unsup_single_source_ssa/
|
| 264 |
+
|-- unsup_multi_source/unsup_multi_source_msa/full_sup_data
|
| 265 |
+
├── flickr
|
| 266 |
+
│ ├── flickr_test.pth
|
| 267 |
+
│ ├── flickr_train_pseudo.pth
|
| 268 |
+
│ └── flickr_val.pth
|
| 269 |
+
├── gref
|
| 270 |
+
│ ├── gref_train_pseudo.pth
|
| 271 |
+
│ └── gref_val.pth
|
| 272 |
+
├── gref_umd
|
| 273 |
+
│ ├── gref_umd_test.pth
|
| 274 |
+
│ ├── gref_umd_train_pseudo.pth
|
| 275 |
+
│ └── gref_umd_val.pth
|
| 276 |
+
├── referit
|
| 277 |
+
│ ├── referit_test.pth
|
| 278 |
+
│ ├── referit_train_pseudo.pth
|
| 279 |
+
│ └── referit_val.pth
|
| 280 |
+
├── unc
|
| 281 |
+
│ ├── unc_testA.pth
|
| 282 |
+
│ ├── unc_testB.pth
|
| 283 |
+
│ ├── unc_train_pseudo.pth
|
| 284 |
+
│ └── unc_val.pth
|
| 285 |
+
└── unc+
|
| 286 |
+
├── unc+_testA.pth
|
| 287 |
+
├── unc+_testB.pth
|
| 288 |
+
├── unc+_train_pseudo.pth
|
| 289 |
+
└── unc+_val.pth
|
| 290 |
+
In multi-source, it have a additional train_separate directory for further research purpose.
|
| 291 |
+
├── train_separate
|
| 292 |
+
├── 1_unc+_train_pseudo_template_0_5.pth
|
| 293 |
+
│── 2_unc+_train_pseudo_relation_0_5.pth
|
| 294 |
+
└── 3_unc+_train_pseudo_caption_0_5.pth
|
| 295 |
+
```
|
| 296 |
+
\* The number at the end of the filename in train_separate directory represents the reliability threshold as defined in the paper.
|
| 297 |
+
|
| 298 |
+
## Pre-trained Checkpoints
|
| 299 |
+
|
| 300 |
+
### Unsupervised setting
|
| 301 |
+
#### Single-source scenario
|
| 302 |
+
<table>
|
| 303 |
+
<tr> <!-- line 3 -->
|
| 304 |
+
<th style="text-align:center" > Datasets </th>
|
| 305 |
+
<th style="text-align:center" > RefCOCO </th>
|
| 306 |
+
<th style="text-align:center" > RefCOCO+ </th>
|
| 307 |
+
<th style="text-align:center" > RefCOCOg-g </th>
|
| 308 |
+
<th style="text-align:center" > RefCOCOg-u </th>
|
| 309 |
+
<th style="text-align:center" > ReferIt </th>
|
| 310 |
+
<th style="text-align:center" > Flickr </th>
|
| 311 |
+
</tr>
|
| 312 |
+
<tr> <!-- line 2 -->
|
| 313 |
+
<th style="text-align:center" rowspan="1"> url, size </th> <!-- table head -->
|
| 314 |
+
<th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/14b-lc7zNniy4EEcJoBdXY9gNv2d20yxU/view?usp=drive_link">All of six models</a>, 3.0GB </th> <!-- table head -->
|
| 315 |
+
</tr>
|
| 316 |
+
</table>
|
| 317 |
+
|
| 318 |
+
\* Note that the performance of our provided model on the refcocog-val-g dataset in the unsupervised single-source scenario is approximately higher ~2.0
|
| 319 |
+
than reported in the paper, i.e., (54.16) --> (56.46).
|
| 320 |
+
|
| 321 |
+
#### Multi-source scenario
|
| 322 |
+
<table>
|
| 323 |
+
<tr> <!-- line 3 -->
|
| 324 |
+
<th style="text-align:center" > Datasets </th>
|
| 325 |
+
<th style="text-align:center" > RefCOCO </th>
|
| 326 |
+
<th style="text-align:center" > RefCOCO+ </th>
|
| 327 |
+
<th style="text-align:center" > RefCOCOg-g </th>
|
| 328 |
+
<th style="text-align:center" > RefCOCOg-u </th>
|
| 329 |
+
<th style="text-align:center" > ReferIt </th>
|
| 330 |
+
<th style="text-align:center" > Flickr </th>
|
| 331 |
+
</tr>
|
| 332 |
+
<tr> <!-- line 2 -->
|
| 333 |
+
<th style="text-align:center" rowspan="1"> url, size </th> <!-- table head -->
|
| 334 |
+
<th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1NU35UhAqx2YLehG5ni59rG4sWWaAaXGm/view?usp=drive_link">All of six models</a>, 3.0GB </th> <!-- table head -->
|
| 335 |
+
</tr>
|
| 336 |
+
</table>
|
| 337 |
+
|
| 338 |
+
### Fully supervised setting
|
| 339 |
+
|
| 340 |
+
<table>
|
| 341 |
+
<tr> <!-- line 3 -->
|
| 342 |
+
<th style="text-align:center" > Datasets </th>
|
| 343 |
+
<th style="text-align:center" > RefCOCO </th>
|
| 344 |
+
<th style="text-align:center" > RefCOCO+ </th>
|
| 345 |
+
<th style="text-align:center" > RefCOCOg-g </th>
|
| 346 |
+
<th style="text-align:center" > RefCOCOg-u </th>
|
| 347 |
+
<th style="text-align:center" > ReferIt </th>
|
| 348 |
+
<th style="text-align:center" > Flickr </th>
|
| 349 |
+
</tr>
|
| 350 |
+
<tr> <!-- line 3 -->
|
| 351 |
+
<th style="text-align:center" > separate </th>
|
| 352 |
+
<th style="text-align:center" > <a href="https://drive.google.com/file/d/1ZyQkPDBG33FPVlyVmzcCf5wD_Ct2hLr8/view?usp=drive_link">model</a> </th>
|
| 353 |
+
<th style="text-align:center" > <a href="https://drive.google.com/file/d/18M-Mmu_TaMLKrpdxksoroe3DIeHmmguN/view?usp=drive_link">model</a> </th>
|
| 354 |
+
<th style="text-align:center" > <a href="https://drive.google.com/file/d/1E80T3nz6YETqYU8ZZImCuX76TM1OOxNp/view?usp=drive_link">model</a> </th>
|
| 355 |
+
<th style="text-align:center" > <a href="https://drive.google.com/file/d/1bR5WIwaNiu0ShgEafw10BwC3boT-bLRW/view?usp=drive_link">model</a> </th>
|
| 356 |
+
<th style="text-align:center" > <a href="https://drive.google.com/file/d/1g8U5Q-KUcGPVq1iKMyFui65lXn9Dwfws/view?usp=drive_link">model</a> </th>
|
| 357 |
+
<th style="text-align:center" > <a href="https://drive.google.com/file/d/1Zm98Bf7ulKxXsi-UhoEGkyUtF8-v-Ohp/view?usp=drive_link">model</a> </th>
|
| 358 |
+
</tr>
|
| 359 |
+
<tr> <!-- line 2 -->
|
| 360 |
+
<th style="text-align:center" rowspan="1"> url, size </th> <!-- table head -->
|
| 361 |
+
<th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1vUC4swZM3ho_5olO--Y3PdKzMBW_iBJG/view?usp=drive_link">All of six models</a>, 3.0GB </th> <!-- table head -->
|
| 362 |
+
</tr>
|
| 363 |
+
</table>
|
| 364 |
+
|
| 365 |
+
\* Note that the performance of our provided model on the refcoco+ dataset in the fully supervised setting is approximately higher ~2.0
|
| 366 |
+
than reported in the paper, i.e., (69.55, 77.33, 57.62) --> (71.08, 79.17, 59.40).
|
| 367 |
+
|
| 368 |
+
|
| 369 |
+
|
| 370 |
+
## Training and Evaluation
|
| 371 |
+
|
| 372 |
+
You just only need to change ```$/path_to_split```, ``` $/path_to_image_data```, ``` $/path_to_output``` to your own file directory to execute the following command.
|
| 373 |
+
The first time we run the command below, it will take some time for the repository to download the CLIP model.
|
| 374 |
+
|
| 375 |
+
1. Training on RefCOCO with unsupervised setting.
|
| 376 |
+
```
|
| 377 |
+
CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=6 --master_port 28887 --use_env train_clip_vg.py --num_workers 2 --epochs 110 --batch_size 64 --lr 0.00025 --lr_scheduler cosine --aug_crop --aug_scale --aug_translate --imsize 224 --max_query_len 77 --dataset unc --data_root $/path_to_image_data --split_root $/path_to_split --output_dir $/path_to_output/output_v01/unc;
|
| 378 |
+
```
|
| 379 |
+
Please refer to [train_and_eval_script/train_and_eval_unsup.sh](train_and_eval_script/train_and_eval_unsup.sh) for training commands on other datasets.
|
| 380 |
+
|
| 381 |
+
2. Training on RefCOCO with fully supervised setting.
|
| 382 |
+
The only difference is an additional control flag: ```--sup_type full```
|
| 383 |
+
```
|
| 384 |
+
CUDA_VISIBLE_DEVICES=3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=5 --master_port 28887 --use_env train_clip_vg.py --num_workers 32 --epochs 120 --batch_size 64 --lr 0.00025 --lr_scheduler cosine --aug_crop --aug_scale --aug_translate --imsize 224 --max_query_len 77 --sup_type full --dataset unc --data_root $/path_to_image_data --split_root $/path_to_split --output_dir $/path_to_output/output_v01/unc;
|
| 385 |
+
```
|
| 386 |
+
Please refer to [train_and_eval_script/train_and_eval_full_sup.sh](train_and_eval_script/train_and_eval_full_sup.sh) for training commands on other datasets.
|
| 387 |
+
|
| 388 |
+
3. Evaluation on RefCOCO. The instructions are the same for the unsupervised and fully supervised Settings.
|
| 389 |
+
```
|
| 390 |
+
CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=6 --master_port 28888 --use_env eval.py --num_workers 2 --batch_size 128 --dataset unc --imsize 224 --max_query_len 77 --data_root $/path_to_image_data --split_root $/path_to_split --eval_model $/path_to_output/output_v01/unc/best_checkpoint.pth --eval_set val --output_dir $/path_to_output/output_v01/unc;
|
| 391 |
+
```
|
| 392 |
+
Please refer to [train_and_eval_script/train_and_eval_unsup.sh](train_and_eval_script/train_and_eval_unsup.sh) for evaluation commands on other splits or datasets.
|
| 393 |
+
|
| 394 |
+
4. We strongly recommend to use the following commands to training or testing with different datasets and splits,
|
| 395 |
+
which will significant reduce the training workforce.
|
| 396 |
+
```
|
| 397 |
+
bash train_and_eval_script/train_and_eval_unsup.sh
|
| 398 |
+
bash train_and_eval_script/train_and_eval_full_sup.sh
|
| 399 |
+
```
|
| 400 |
+
|
| 401 |
+
5. Curriculum reliability measurement or scoring for the pseudo-language labels:
|
| 402 |
+
|
| 403 |
+
It is only needs to change ```eval.py``` to ```eval_for_reliability_distribution.py``` and rename the training pseudo labels as ```test.pth```
|
| 404 |
+
in the corresponding datasets during Evaluation:
|
| 405 |
+
```
|
| 406 |
+
CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=6 --master_port 28888 --use_env eval_for_reliability_distribution.py --num_workers 2 --batch_size 128 --dataset unc --imsize 224 --max_query_len 77 --data_root $/path_to_image_data --split_root $/path_to_split --eval_model $/path_to_output/output_v01/unc/best_checkpoint.pth --eval_set val --output_dir $/path_to_output/output_v01/unc;
|
| 407 |
+
```
|
| 408 |
+
Besides, if you need to merge the pseudo train splits for further research, just running the following command:
|
| 409 |
+
```
|
| 410 |
+
python ./pseudo_label_generation_module/utils/merge_file.py $/path_to_split/unsup_multi_source/unc/train_separate unc;
|
| 411 |
+
cp $/path_to_split/full_sup_data/unc/unc_val.pth $/path_to_split/unsup_multi_source/unc/train_separate/unc/unc_val.pth
|
| 412 |
+
```
|
| 413 |
+
Then, you can construct a new pseudo-label training split.
|
| 414 |
+
|
| 415 |
+
## Results
|
| 416 |
+
|
| 417 |
+
<details open>
|
| 418 |
+
<summary><font size="4">
|
| 419 |
+
RefCOCO, RefCOCO+, and RefCOCOg datasets
|
| 420 |
+
</font></summary>
|
| 421 |
+
<img src="docs/refcoco.png" alt="COCO" width="100%">
|
| 422 |
+
</details>
|
| 423 |
+
|
| 424 |
+
<details open>
|
| 425 |
+
<summary><font size="4">
|
| 426 |
+
ReferIt and Flickr datasets
|
| 427 |
+
</font></summary>
|
| 428 |
+
<div align=center>
|
| 429 |
+
<img src="docs/referit.png" alt="COCO" width="50%"></div>
|
| 430 |
+
</details>
|
| 431 |
+
|
| 432 |
+
<details open>
|
| 433 |
+
<summary><font size="4">
|
| 434 |
+
Our model also has significant energy efficiency advantages.
|
| 435 |
+
</font></summary>
|
| 436 |
+
<div align=center>
|
| 437 |
+
<img src="docs/efficiency.jpg" alt="COCO" width="85%"></div>
|
| 438 |
+
</details>
|
| 439 |
+
|
| 440 |
+
Compared to QRNet, we updated **only 7.7%** of its parameters and achieved impressive training and inference speedups,
|
| 441 |
+
up to **26.84×** and **7.41×**, respectively, while also obtaining competitive results.
|
| 442 |
+
|
| 443 |
+
|
| 444 |
+
## Methods
|
| 445 |
+
<p align="center"> <img src='docs/algorithm.jpg' align="center" width="100%"> </p>
|
| 446 |
+
|
| 447 |
+
## Visualization
|
| 448 |
+
<p align="center"> <img src='docs/fig5.jpg' align="center" width="100%"> </p>
|
| 449 |
+
|
| 450 |
+
The figure presents the histograms of Single-Source Reliability (SR) and Cross-source Reliability (CR) for pseudo-language
|
| 451 |
+
labels in the range of (0.0, 1.0] with 1000 bins, where each bin represents the number of samples. The figure illustrates
|
| 452 |
+
that different sources exhibit distinct distributions due to their specific quality and language taxonomy of pseudo-language
|
| 453 |
+
labels (e.g., Fig.5-(a1)-(b2)-(c3)), while different reliability measures have varying discrimination abilities on the
|
| 454 |
+
same source (e.g., Fig.5-(a1)-(b1)-(c1)).
|
| 455 |
+
|
| 456 |
+
<p align="center"> <img src='docs/fig6.jpg' align="center" width="100%"> </p>
|
| 457 |
+
Before the execution of MSA, the distribution of the pseudo-language labels and the ground-truth query labels is quite
|
| 458 |
+
different, but after the execution of MSA, the distribution discrepancy significantly becomes smaller. This shows that
|
| 459 |
+
MSA can effectively select pseudo-labels that are more reliable or closer to the distribution of ground-truth query labels.
|
| 460 |
+
|
| 461 |
+
<p align="center"> <img src='docs/sample1.jpg' align="center" width="100%"> </p>
|
| 462 |
+
|
| 463 |
+
<p align="center"> <img src='docs/sample2.jpg' align="center" width="100%"> </p>
|
| 464 |
+
|
| 465 |
+
<p align="center"> <img src='docs/sample3.jpg' align="center" width="100%"> </p>
|
| 466 |
+
Among the various types of unreliable pseudo-language labels, referring to ambiguity is more frequent, particularly in
|
| 467 |
+
images with similar classification objects. If future research aims to further enhance model performance, addressing
|
| 468 |
+
ambiguity is a critical issue.
|
| 469 |
+
|
| 470 |
+
## Contacts
|
| 471 |
+
Email: <xiaolinhui16@mails.ucas.ac.cn>.
|
| 472 |
+
Any kind discussions are welcomed!
|
| 473 |
+
|
| 474 |
+
## Acknowledgement
|
| 475 |
+
|
| 476 |
+
Our model is related to [CLIP](https://github.com/openai/CLIP), [Pseudo-Q](https://github.com/LeapLabTHU/Pseudo-Q), [TransVG](https://github.com/linhuixiao/TransVG). Thanks for their great work!
|
| 477 |
+
|
| 478 |
+
We also thank the great previous work including [DETR](https://github.com/facebookresearch/detr), [QRNet](https://github.com/LukeForeverYoung/QRNet), [M2](https://github.com/aimagelab/meshed-memory-transformer), [CLIPCap](https://github.com/rmokady/CLIP_prefix_caption), [RelTR](https://github.com/yrcong/RelTR), [bottom-up-attention](https://github.com/peteanderson80/bottom-up-attention), [ReSC](https://github.com/zyang-ur/ReSC), etc.
|
| 479 |
+
|
| 480 |
+
Thanks [OpenAI](https://github.com/openai) for their awesome models.
|
| 481 |
+
|
| 482 |
+
|
| 483 |
+
## Star History
|
| 484 |
+
|
| 485 |
+
[](https://star-history.com/#linhuixiao/CLIP-VG&Date)
|
| 486 |
+
|