linhuixiao commited on
Commit
53751b1
·
verified ·
1 Parent(s): 0921842

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +486 -3
README.md CHANGED
@@ -1,3 +1,486 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ <p align="center"> <h1 align="center">CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding</h1>
5
+ <p align="center">
6
+ <b> IEEE Transaction on Multimedia, 2023 </b>
7
+ <br />
8
+ <a href="https://scholar.google.com.hk/citations?user=4rTE4ogAAAAJ&hl=zh-CN&oi=sra"><strong> Linhui Xiao </strong></a>
9
+ ·
10
+ <a href="https://yangxs.ac.cn/home"><strong>Xiaoshan Yang </strong></a>
11
+ ·
12
+ <a href="https://scholar.google.com.hk/citations?user=HBZ9plsAAAAJ&hl=zh-CN"><strong>Fang Peng </strong></a>
13
+ ·
14
+ <a href="https://scholar.google.com.hk/citations?user=uIUfGxYAAAAJ&hl=zh-CN"><strong>Ming Yan </strong></a>
15
+ ·
16
+ <a href="https://scholar.google.com.hk/citations?user=o_DllmIAAAAJ&hl=zh-CN"><strong>Yaowei Wang </strong></a>
17
+ ·
18
+ <a href="https://scholar.google.com.hk/citations?user=hI9NRDkAAAAJ&hl=zh-CN"><strong>Changsheng Xu</strong></a>
19
+ </p>
20
+
21
+ <p align="center">
22
+ <a href='https://arxiv.org/pdf/2305.08685'>
23
+ <img src='https://img.shields.io/badge/arXiv-PDF-green?style=flat&logo=arXiv&logoColor=green' alt='arXiv PDF'>
24
+ </a>
25
+ <a href='https://ieeexplore.ieee.org/abstract/document/10269126'>
26
+ <img src='https://img.shields.io/badge/IEEE TMM-blue' alt='arXiv PDF'>
27
+ </a>
28
+
29
+ <br />
30
+
31
+
32
+ <p align="center"> <img src='docs/model.jpg' align="center" width="70%"> </p>
33
+
34
+ **<p align="center"> CLIP for Unsupervised and Fully Supervised Visual Grounding. </p>**
35
+
36
+ This repository is the official Pytorch implementation for the paper [**CLIP-VG: Self-paced Curriculum Adapting of CLIP
37
+ for Visual Grounding**](https://ieeexplore.ieee.org/abstract/document/10269126).
38
+
39
+ If you have any questions, please feel free to open an issue or contact me with emails: <xiaolinhui16@mails.ucas.ac.cn>.
40
+
41
+ <h3 align="left">
42
+ Links: <a href="https://ieeexplore.ieee.org/abstract/document/10269126">IEEE Transaction on Multimedia (2023)</a>,
43
+ <a href="https://arxiv.org/abs/2305.08685">ArXiv</a>,
44
+ [<a href="https://mp.weixin.qq.com/s/fwbamVr5P5Vcj5XheopQOg">中文解读</a>]
45
+ </h3>
46
+
47
+ **Please leave a <font color='orange'>STAR ⭐</font> if you like this project!**
48
+
49
+ ## News
50
+
51
+ - 🔥🔥🔥 **Our Grounding survey ([TPAMI](https://doi.org/10.1109/TPAMI.2025.3630635), [Arxiv](https://arxiv.org/abs/2412.20206), [Project](https://github.com/linhuixiao/Awesome-Visual-Grounding)) has been accepted by TPAMI on October 30, 2025 !!!**
52
+
53
+ - :fire: **Update on 2024/12/28: We conducted a survey of Visual Grounding over the past decade, entitled "Towards Visual Grounding: A Survey" ([Paper](https://arxiv.org/pdf/2412.20206), [Project](https://github.com/linhuixiao/Awesome-Visual-Grounding)), Comments are welcome !!!**
54
+ - :fire: **Update on 2024/09/26: Our advanced grounding work OneRef ([Paper](https://openreview.net/pdf?id=siPdcro6uD), [Code](https://github.com/linhuixiao/OneRef)) has acceptance by top conference NeurIPS 2024 in October 2024!**
55
+ - :fire: **Update on 2024/07/16: Our advanced grounding work HiVG ([Paper](https://openreview.net/pdf?id=NMMyGy1kKZ), [Code](https://github.com/linhuixiao/HiVG)) has acceptance by top conference ACM MM 2024 in July 2024!**
56
+ - **Update on 2024/04/20: We release an advanced version of CLIP-VG, namely HiVG ([paper](https://arxiv.org/abs/2404.13400), [github](https://github.com/linhuixiao/HiVG)).**
57
+ - **Update on 2023/12/13: All of the code, models and datasets have been released.**
58
+ - **Update on 2023/9/25: Our paper has been accepted by the top journal IEEE Transaction on Multimedia (2023)!**
59
+ - Update on 2023/05/18: Release the repository and training code.
60
+
61
+
62
+ ## Citation
63
+
64
+ If you find our work helpful for your research, please consider citing the following BibTeX entry.
65
+
66
+ ```bibtex
67
+ @article{xiao2023clip,
68
+ title={CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding},
69
+ author={Xiao, Linhui and Yang, Xiaoshan and Peng, Fang and Yan, Ming and Wang, Yaowei and Xu, Changsheng},
70
+ journal={IEEE Transactions on Multimedia},
71
+ year={2023},
72
+ publisher={IEEE}
73
+ }
74
+ ```
75
+
76
+ ## Contents
77
+
78
+ 1. [Introduction](#introduction)
79
+ 2. [Usage](#usage)
80
+ 3. [Results](#results)
81
+ 4. [Contacts](#contacts)
82
+ 5. [Acknowledgments](#acknowledgments)
83
+
84
+
85
+ ## Highlight
86
+ - **CLIP for Visual Grounding.** a state-of-the-art baseline for unsupervised and fully supervised visual grounding with CLIP model.
87
+ - **Single-source and Multi-source pseudo-language labels.** The generation and usage of multi-source pseudo-labels.
88
+ - **Self-paced Curriculum Adapting Algorithm.** A plugin-like algorithmic idea that can be applied to any pseudo-label scenario.
89
+
90
+
91
+ ## TODO
92
+ - [x] Release model code and inference code.
93
+ - [x] Release unsupervised and fully supervised checkpoints.
94
+ - [x] Release the complete multi-source pseudo-language labels and its generation code.
95
+ - [x] Release the reliability measurement code.
96
+
97
+
98
+
99
+
100
+ ## Introduction
101
+
102
+ In order to utilize vision and language pre-trained models to address the grounding problem, and reasonably take
103
+ advantage of pseudo-labels, we propose **CLIP-VG**, **a novel method that can conduct self-paced curriculum adapting of CLIP
104
+ with pseudo-language labels.**
105
+
106
+ We propose a simple yet efficient end-to-end network architecture to realize the transfer
107
+ of CLIP to the visual grounding. Based on the CLIP-based architecture, we further propose single-source and
108
+ multi-source curriculum adapting algorithms, which can progressively find more reliable pseudo-labels to learn an
109
+ optimal model, thereby achieving a balance between reliability and diversity for the pseudo-language labels.
110
+
111
+ Our method outperforms the current state-of-the-art unsupervised method Pseudo-Q, by a significant margin on RefCOCO/+/g datasets in both
112
+ single-source and multi-source scenarios. Furthermore, our approach even outperforms existing weakly supervised methods.
113
+ In comparison with the fully supervised SOTA model QRNet, we achieve comparable results with only **7.7%** of its
114
+ updated parameters, while obtaining significant speedups in both training and inference, up to **26.84×** and **7.41×**, respectively.
115
+
116
+ In summary, **the contributions of this work are four-fold**:
117
+
118
+ - As far as we know, **we are the first to adapt CLIP to realize unsupervised visual grounding.** Our method can
119
+ transfer the cross-modal learning ability of CLIP to visual grounding with only a small training cost.
120
+ - **We first introduce the self-paced curriculum learning in unsupervised visual grounding.** Our proposed reliability measurement and single-source self-paced adapting
121
+ can progressively enhance the CLIP-based visual grounding model by utilizing pseudo-labels in an easy-to-hard
122
+ learning paradigm.
123
+ - **We first propose the multi-source self-paced adapting algorithm to extend our method for accessing multiple
124
+ sources of pseudo-labels,** which can flexibly improve the diversity of language taxonomy.
125
+ - We conduct extensive experiments to evaluate the effectiveness of our approach. Results show that our method
126
+ obtains significant improvements in unsupervised setting and is also competitive in fully supervised setting.
127
+
128
+ For more details, please refer to [our paper](https://arxiv.org/abs/2305.08685).
129
+
130
+ ## Usage
131
+ ### Dependencies
132
+ - Python 3.9.10
133
+ - PyTorch 1.9.0 + cu111 + cp39
134
+ - Check [requirements.txt](requirements.txt) for other dependencies.
135
+
136
+ Our model is **easy to deploy** in a variety of environments and has been successfully tested on multiple pytorch versions.
137
+ If you are interested in the pseudo language label generation module, more detailed instructions will be found at [usage instructions](pseudo_label_generation_module/README.md).
138
+
139
+
140
+ ### Image Data Preparation
141
+ 1.You can download the images from the original source and place them in your disk folder, such as `$/path_to_image_data`:
142
+ - [MS COCO 2014](download_mscoco2014.sh) (for RefCOCO, RefCOCO+, RefCOCOg dataset, almost 13.0GB)
143
+ - [ReferItGame](https://drive.google.com/drive/folders/1D4shieeoKly6FswpdjSpaOrxJQNKTyTv)
144
+ - [Flickr30K Entities](http://shannon.cs.illinois.edu/DenotationGraph/#:~:text=make%20face-,Downloads,-Please%20fill%20in)
145
+
146
+ We provide a script to download the mscoco2014 dataset, you just need to run the script in terminal with the following command:
147
+ ```
148
+ bash download_mscoco2014.sh
149
+ ```
150
+ Or you can also follow the data preparation of TransVG, which can be found in [GETTING_STARTED.md](https://github.com/djiajunustc/TransVG/blob/main/docs/GETTING_STARTED.md).
151
+
152
+ Only the image data in these datasets is used, and these image data is easily find in similar repositories of visual grounding work, such as [TransVG](https://github.com/linhuixiao/TransVG) etc.
153
+ Finally, the `$/path_to_image_data` folder will have the following structure:
154
+
155
+ ```angular2html
156
+ |-- image_data
157
+ |-- Flickr30k
158
+ |-- flickr30k-images
159
+ |-- other
160
+ |-- images
161
+ |-- mscoco
162
+ |-- images
163
+ |-- train2014
164
+ |-- referit
165
+ |-- images
166
+ ```
167
+ - ```$/path_to_image_data/image_data/Flickr30k/flickr30k-images/```: Image data for the Flickr30K dataset, please download from this [link](http://shannon.cs.illinois.edu/DenotationGraph/#:~:text=make%20face-,Downloads,-Please%20fill%20in). Fill the form and download the images.
168
+ - ```$/path_to_image_data/image_data/other/images/```: Image data for RefCOCO/RefCOCO+/RefCOCOg, i.e., mscoco2014.
169
+ - ```$/path_to_image_data/image_data/referit/images/```: Image data for ReferItGame.
170
+
171
+ ## Text-Box Anotations / Pseudo-Labels Prepare
172
+ The following are the **pseudo-language labels** generated by the pseudo-language label generation module in an unsupervised setting.
173
+
174
+ The **single-source scenario** includes a pseudo-template label derived from [Pseudo-Q](https://github.com/LeapLabTHU/Pseudo-Q).
175
+
176
+ **Multi-source scenario** include pseudo-template labels, pseudo-relation labels, and pseudo-caption labels.
177
+ Please refer to [pseudo-language label generation module](pseudo_label_generation_module/README.md) for specific details
178
+ on how they are generated if interested.
179
+
180
+ Additionally, we also provide the pseudo-labels that selected through our single-source self-paced curriculum adapting (SSA)
181
+ and multi-source self-paced curriculum adapting (MSA) algorithms, which can be conveniently and directly used by the following researchers.
182
+
183
+ The labels in the fully supervised scenario is consistent with previous works such as [TransVG](https://github.com/linhuixiao/TransVG).
184
+ It is worth noting that the test split in the unsupervised scenario are exactly the same as those used in the fully supervised scenario.
185
+
186
+ ### Unsupervised setting
187
+ #### Single-source scenario
188
+ <table>
189
+ <tr> <!-- line 3 -->
190
+ <th style="text-align:center" > Datasets </th>
191
+ <th style="text-align:center" > RefCOCO </th>
192
+ <th style="text-align:center" > RefCOCO+ </th>
193
+ <th style="text-align:center" > RefCOCOg-g </th>
194
+ <th style="text-align:center" > RefCOCOg-u </th>
195
+ <th style="text-align:center" > ReferIt </th>
196
+ <th style="text-align:center" > Flickr </th>
197
+ </tr>
198
+ <tr> <!-- line 2 -->
199
+ <th style="text-align:center" rowspan="1"> original </th> <!-- table head -->
200
+ <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1G5VK8uNbAepyrQiI_DLQaN_02tYyOQq2/view?usp=drive_link">All of six datasets</a>, 36.7MB </th> <!-- table head -->
201
+ </tr>
202
+ <tr> <!-- line 2 -->
203
+ <th style="text-align:center" rowspan="1"> with curriculum selecting </th> <!-- table head -->
204
+ <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1ekEWR-gYMMOrWPDB7R8lxZfDJbO8KGQt/view?usp=drive_link">All of six datasets</a>, 31.4MB </th> <!-- table head -->
205
+ </tr>
206
+ </table>
207
+
208
+
209
+ #### Multi-source scenario
210
+ <table>
211
+ <tr> <!-- line 3 -->
212
+ <th style="text-align:center" > Datasets </th>
213
+ <th style="text-align:center" > RefCOCO </th>
214
+ <th style="text-align:center" > RefCOCO+ </th>
215
+ <th style="text-align:center" > RefCOCOg-g </th>
216
+ <th style="text-align:center" > RefCOCOg-u </th>
217
+ <th style="text-align:center" > ReferIt </th>
218
+ <th style="text-align:center" > Flickr </th>
219
+ </tr>
220
+ <tr> <!-- line 2 -->
221
+ <th style="text-align:center" rowspan="1"> original </th> <!-- table head -->
222
+ <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1X9F5n7M0Zm4jhOIf1tjHj6bzMh6A1ZkE/view?usp=drive_link">All of six datasets</a>, 144.7MB, each dataset contains 3 sources of pseudo-labels </th> <!-- table head -->
223
+ </tr>
224
+ <tr> <!-- line 2 -->
225
+ <th style="text-align:center" rowspan="1"> with curriculum selecting </th> <!-- table head -->
226
+ <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1IBReTahxkOdKW_fKvplw3PGlI8PdHPUW/view?usp=drive_link">All of six datasets</a>, 87.3MB, each dataset contains 3 sources of pseudo-labels </th> <!-- table head -->
227
+ </tr>
228
+ </table>
229
+
230
+ ### Fully supervised setting
231
+ <table>
232
+ <tr> <!-- line 3 -->
233
+ <th style="text-align:center" > Datasets </th>
234
+ <th style="text-align:center" > RefCOCO </th>
235
+ <th style="text-align:center" > RefCOCO+ </th>
236
+ <th style="text-align:center" > RefCOCOg-g </th>
237
+ <th style="text-align:center" > RefCOCOg-u </th>
238
+ <th style="text-align:center" > ReferIt </th>
239
+ <th style="text-align:center" > Flickr </th>
240
+ </tr>
241
+ <tr> <!-- line 2 -->
242
+ <th style="text-align:center" rowspan="1"> url, size </th> <!-- table head -->
243
+ <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1ituKSxWU5aXsGnXePd7twv7ImJoFiATc/view?usp=drive_link">All of six datasets</a>, 89.0MB </th> <!-- table head -->
244
+ </tr>
245
+ <tr> <!-- line 3 -->
246
+ <th style="text-align:center" > with curriculum selecting </th>
247
+ <th style="text-align:center" > - </th>
248
+ <th style="text-align:center" > - </th>
249
+ <th style="text-align:center" > - </th>
250
+ <th style="text-align:center" > <a href="https://drive.google.com/file/d/1eSGr-sTqZ6z_Jy7APnJXNxegt2Q-pbqE/view?usp=drive_link">dataset</a> </th>
251
+ <th style="text-align:center" > - </th>
252
+ <th style="text-align:center" > - </th>
253
+ </tr>
254
+ </table>
255
+
256
+ \* Since we observed a relatively clear performance increase on the RefCOCOg-u dataset in the fully supervised setting,
257
+ we provide data for this dataset after applying our SSA algorithm for curriculum selecting. Typically, by using this
258
+ filtered data, there is an approximate ~1.0 increase in performance on both val-u and test-u.
259
+
260
+ Download the above annotations to a disk directory such as `$/path_to_split`; then will have the following similar directory structure:
261
+
262
+ ```angular2html
263
+ |-- /unsup_single_source/unsup_single_source_ssa/
264
+ |-- unsup_multi_source/unsup_multi_source_msa/full_sup_data
265
+ ├── flickr
266
+ │ ├── flickr_test.pth
267
+ │ ├── flickr_train_pseudo.pth
268
+ │ └── flickr_val.pth
269
+ ├── gref
270
+ │ ├── gref_train_pseudo.pth
271
+ │ └── gref_val.pth
272
+ ├── gref_umd
273
+ │ ├── gref_umd_test.pth
274
+ │ ├── gref_umd_train_pseudo.pth
275
+ │ └── gref_umd_val.pth
276
+ ├── referit
277
+ │ ├── referit_test.pth
278
+ │ ├── referit_train_pseudo.pth
279
+ │ └── referit_val.pth
280
+ ├── unc
281
+ │ ├── unc_testA.pth
282
+ │ ├── unc_testB.pth
283
+ │ ├── unc_train_pseudo.pth
284
+ │ └── unc_val.pth
285
+ └── unc+
286
+ ├── unc+_testA.pth
287
+ ├── unc+_testB.pth
288
+ ├── unc+_train_pseudo.pth
289
+ └── unc+_val.pth
290
+ In multi-source, it have a additional train_separate directory for further research purpose.
291
+ ├── train_separate
292
+ ├── 1_unc+_train_pseudo_template_0_5.pth
293
+ │── 2_unc+_train_pseudo_relation_0_5.pth
294
+ └── 3_unc+_train_pseudo_caption_0_5.pth
295
+ ```
296
+ \* The number at the end of the filename in train_separate directory represents the reliability threshold as defined in the paper.
297
+
298
+ ## Pre-trained Checkpoints
299
+
300
+ ### Unsupervised setting
301
+ #### Single-source scenario
302
+ <table>
303
+ <tr> <!-- line 3 -->
304
+ <th style="text-align:center" > Datasets </th>
305
+ <th style="text-align:center" > RefCOCO </th>
306
+ <th style="text-align:center" > RefCOCO+ </th>
307
+ <th style="text-align:center" > RefCOCOg-g </th>
308
+ <th style="text-align:center" > RefCOCOg-u </th>
309
+ <th style="text-align:center" > ReferIt </th>
310
+ <th style="text-align:center" > Flickr </th>
311
+ </tr>
312
+ <tr> <!-- line 2 -->
313
+ <th style="text-align:center" rowspan="1"> url, size </th> <!-- table head -->
314
+ <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/14b-lc7zNniy4EEcJoBdXY9gNv2d20yxU/view?usp=drive_link">All of six models</a>, 3.0GB </th> <!-- table head -->
315
+ </tr>
316
+ </table>
317
+
318
+ \* Note that the performance of our provided model on the refcocog-val-g dataset in the unsupervised single-source scenario is approximately higher ~2.0
319
+ than reported in the paper, i.e., (54.16) --> (56.46).
320
+
321
+ #### Multi-source scenario
322
+ <table>
323
+ <tr> <!-- line 3 -->
324
+ <th style="text-align:center" > Datasets </th>
325
+ <th style="text-align:center" > RefCOCO </th>
326
+ <th style="text-align:center" > RefCOCO+ </th>
327
+ <th style="text-align:center" > RefCOCOg-g </th>
328
+ <th style="text-align:center" > RefCOCOg-u </th>
329
+ <th style="text-align:center" > ReferIt </th>
330
+ <th style="text-align:center" > Flickr </th>
331
+ </tr>
332
+ <tr> <!-- line 2 -->
333
+ <th style="text-align:center" rowspan="1"> url, size </th> <!-- table head -->
334
+ <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1NU35UhAqx2YLehG5ni59rG4sWWaAaXGm/view?usp=drive_link">All of six models</a>, 3.0GB </th> <!-- table head -->
335
+ </tr>
336
+ </table>
337
+
338
+ ### Fully supervised setting
339
+
340
+ <table>
341
+ <tr> <!-- line 3 -->
342
+ <th style="text-align:center" > Datasets </th>
343
+ <th style="text-align:center" > RefCOCO </th>
344
+ <th style="text-align:center" > RefCOCO+ </th>
345
+ <th style="text-align:center" > RefCOCOg-g </th>
346
+ <th style="text-align:center" > RefCOCOg-u </th>
347
+ <th style="text-align:center" > ReferIt </th>
348
+ <th style="text-align:center" > Flickr </th>
349
+ </tr>
350
+ <tr> <!-- line 3 -->
351
+ <th style="text-align:center" > separate </th>
352
+ <th style="text-align:center" > <a href="https://drive.google.com/file/d/1ZyQkPDBG33FPVlyVmzcCf5wD_Ct2hLr8/view?usp=drive_link">model</a> </th>
353
+ <th style="text-align:center" > <a href="https://drive.google.com/file/d/18M-Mmu_TaMLKrpdxksoroe3DIeHmmguN/view?usp=drive_link">model</a> </th>
354
+ <th style="text-align:center" > <a href="https://drive.google.com/file/d/1E80T3nz6YETqYU8ZZImCuX76TM1OOxNp/view?usp=drive_link">model</a> </th>
355
+ <th style="text-align:center" > <a href="https://drive.google.com/file/d/1bR5WIwaNiu0ShgEafw10BwC3boT-bLRW/view?usp=drive_link">model</a> </th>
356
+ <th style="text-align:center" > <a href="https://drive.google.com/file/d/1g8U5Q-KUcGPVq1iKMyFui65lXn9Dwfws/view?usp=drive_link">model</a> </th>
357
+ <th style="text-align:center" > <a href="https://drive.google.com/file/d/1Zm98Bf7ulKxXsi-UhoEGkyUtF8-v-Ohp/view?usp=drive_link">model</a> </th>
358
+ </tr>
359
+ <tr> <!-- line 2 -->
360
+ <th style="text-align:center" rowspan="1"> url, size </th> <!-- table head -->
361
+ <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1vUC4swZM3ho_5olO--Y3PdKzMBW_iBJG/view?usp=drive_link">All of six models</a>, 3.0GB </th> <!-- table head -->
362
+ </tr>
363
+ </table>
364
+
365
+ \* Note that the performance of our provided model on the refcoco+ dataset in the fully supervised setting is approximately higher ~2.0
366
+ than reported in the paper, i.e., (69.55, 77.33, 57.62) --> (71.08, 79.17, 59.40).
367
+
368
+
369
+
370
+ ## Training and Evaluation
371
+
372
+ You just only need to change ```$/path_to_split```, ``` $/path_to_image_data```, ``` $/path_to_output``` to your own file directory to execute the following command.
373
+ The first time we run the command below, it will take some time for the repository to download the CLIP model.
374
+
375
+ 1. Training on RefCOCO with unsupervised setting.
376
+ ```
377
+ CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=6 --master_port 28887 --use_env train_clip_vg.py --num_workers 2 --epochs 110 --batch_size 64 --lr 0.00025 --lr_scheduler cosine --aug_crop --aug_scale --aug_translate --imsize 224 --max_query_len 77 --dataset unc --data_root $/path_to_image_data --split_root $/path_to_split --output_dir $/path_to_output/output_v01/unc;
378
+ ```
379
+ Please refer to [train_and_eval_script/train_and_eval_unsup.sh](train_and_eval_script/train_and_eval_unsup.sh) for training commands on other datasets.
380
+
381
+ 2. Training on RefCOCO with fully supervised setting.
382
+ The only difference is an additional control flag: ```--sup_type full```
383
+ ```
384
+ CUDA_VISIBLE_DEVICES=3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=5 --master_port 28887 --use_env train_clip_vg.py --num_workers 32 --epochs 120 --batch_size 64 --lr 0.00025 --lr_scheduler cosine --aug_crop --aug_scale --aug_translate --imsize 224 --max_query_len 77 --sup_type full --dataset unc --data_root $/path_to_image_data --split_root $/path_to_split --output_dir $/path_to_output/output_v01/unc;
385
+ ```
386
+ Please refer to [train_and_eval_script/train_and_eval_full_sup.sh](train_and_eval_script/train_and_eval_full_sup.sh) for training commands on other datasets.
387
+
388
+ 3. Evaluation on RefCOCO. The instructions are the same for the unsupervised and fully supervised Settings.
389
+ ```
390
+ CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=6 --master_port 28888 --use_env eval.py --num_workers 2 --batch_size 128 --dataset unc --imsize 224 --max_query_len 77 --data_root $/path_to_image_data --split_root $/path_to_split --eval_model $/path_to_output/output_v01/unc/best_checkpoint.pth --eval_set val --output_dir $/path_to_output/output_v01/unc;
391
+ ```
392
+ Please refer to [train_and_eval_script/train_and_eval_unsup.sh](train_and_eval_script/train_and_eval_unsup.sh) for evaluation commands on other splits or datasets.
393
+
394
+ 4. We strongly recommend to use the following commands to training or testing with different datasets and splits,
395
+ which will significant reduce the training workforce.
396
+ ```
397
+ bash train_and_eval_script/train_and_eval_unsup.sh
398
+ bash train_and_eval_script/train_and_eval_full_sup.sh
399
+ ```
400
+
401
+ 5. Curriculum reliability measurement or scoring for the pseudo-language labels:
402
+
403
+ It is only needs to change ```eval.py``` to ```eval_for_reliability_distribution.py``` and rename the training pseudo labels as ```test.pth```
404
+ in the corresponding datasets during Evaluation:
405
+ ```
406
+ CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=6 --master_port 28888 --use_env eval_for_reliability_distribution.py --num_workers 2 --batch_size 128 --dataset unc --imsize 224 --max_query_len 77 --data_root $/path_to_image_data --split_root $/path_to_split --eval_model $/path_to_output/output_v01/unc/best_checkpoint.pth --eval_set val --output_dir $/path_to_output/output_v01/unc;
407
+ ```
408
+ Besides, if you need to merge the pseudo train splits for further research, just running the following command:
409
+ ```
410
+ python ./pseudo_label_generation_module/utils/merge_file.py $/path_to_split/unsup_multi_source/unc/train_separate unc;
411
+ cp $/path_to_split/full_sup_data/unc/unc_val.pth $/path_to_split/unsup_multi_source/unc/train_separate/unc/unc_val.pth
412
+ ```
413
+ Then, you can construct a new pseudo-label training split.
414
+
415
+ ## Results
416
+
417
+ <details open>
418
+ <summary><font size="4">
419
+ RefCOCO, RefCOCO+, and RefCOCOg datasets
420
+ </font></summary>
421
+ <img src="docs/refcoco.png" alt="COCO" width="100%">
422
+ </details>
423
+
424
+ <details open>
425
+ <summary><font size="4">
426
+ ReferIt and Flickr datasets
427
+ </font></summary>
428
+ <div align=center>
429
+ <img src="docs/referit.png" alt="COCO" width="50%"></div>
430
+ </details>
431
+
432
+ <details open>
433
+ <summary><font size="4">
434
+ Our model also has significant energy efficiency advantages.
435
+ </font></summary>
436
+ <div align=center>
437
+ <img src="docs/efficiency.jpg" alt="COCO" width="85%"></div>
438
+ </details>
439
+
440
+ Compared to QRNet, we updated **only 7.7%** of its parameters and achieved impressive training and inference speedups,
441
+ up to **26.84×** and **7.41×**, respectively, while also obtaining competitive results.
442
+
443
+
444
+ ## Methods
445
+ <p align="center"> <img src='docs/algorithm.jpg' align="center" width="100%"> </p>
446
+
447
+ ## Visualization
448
+ <p align="center"> <img src='docs/fig5.jpg' align="center" width="100%"> </p>
449
+
450
+ The figure presents the histograms of Single-Source Reliability (SR) and Cross-source Reliability (CR) for pseudo-language
451
+ labels in the range of (0.0, 1.0] with 1000 bins, where each bin represents the number of samples. The figure illustrates
452
+ that different sources exhibit distinct distributions due to their specific quality and language taxonomy of pseudo-language
453
+ labels (e.g., Fig.5-(a1)-(b2)-(c3)), while different reliability measures have varying discrimination abilities on the
454
+ same source (e.g., Fig.5-(a1)-(b1)-(c1)).
455
+
456
+ <p align="center"> <img src='docs/fig6.jpg' align="center" width="100%"> </p>
457
+ Before the execution of MSA, the distribution of the pseudo-language labels and the ground-truth query labels is quite
458
+ different, but after the execution of MSA, the distribution discrepancy significantly becomes smaller. This shows that
459
+ MSA can effectively select pseudo-labels that are more reliable or closer to the distribution of ground-truth query labels.
460
+
461
+ <p align="center"> <img src='docs/sample1.jpg' align="center" width="100%"> </p>
462
+
463
+ <p align="center"> <img src='docs/sample2.jpg' align="center" width="100%"> </p>
464
+
465
+ <p align="center"> <img src='docs/sample3.jpg' align="center" width="100%"> </p>
466
+ Among the various types of unreliable pseudo-language labels, referring to ambiguity is more frequent, particularly in
467
+ images with similar classification objects. If future research aims to further enhance model performance, addressing
468
+ ambiguity is a critical issue.
469
+
470
+ ## Contacts
471
+ Email: <xiaolinhui16@mails.ucas.ac.cn>.
472
+ Any kind discussions are welcomed!
473
+
474
+ ## Acknowledgement
475
+
476
+ Our model is related to [CLIP](https://github.com/openai/CLIP), [Pseudo-Q](https://github.com/LeapLabTHU/Pseudo-Q), [TransVG](https://github.com/linhuixiao/TransVG). Thanks for their great work!
477
+
478
+ We also thank the great previous work including [DETR](https://github.com/facebookresearch/detr), [QRNet](https://github.com/LukeForeverYoung/QRNet), [M2](https://github.com/aimagelab/meshed-memory-transformer), [CLIPCap](https://github.com/rmokady/CLIP_prefix_caption), [RelTR](https://github.com/yrcong/RelTR), [bottom-up-attention](https://github.com/peteanderson80/bottom-up-attention), [ReSC](https://github.com/zyang-ur/ReSC), etc.
479
+
480
+ Thanks [OpenAI](https://github.com/openai) for their awesome models.
481
+
482
+
483
+ ## Star History
484
+
485
+ [![Star History Chart](https://api.star-history.com/svg?repos=linhuixiao/CLIP-VG&type=Date)](https://star-history.com/#linhuixiao/CLIP-VG&Date)
486
+