File size: 27,033 Bytes
53751b1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
---
license: apache-2.0
---
<p align="center"> <h1 align="center">CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding</h1>
  <p align="center">
    <b> IEEE Transaction on Multimedia, 2023 </b>
    <br />
    <a href="https://scholar.google.com.hk/citations?user=4rTE4ogAAAAJ&hl=zh-CN&oi=sra"><strong> Linhui Xiao </strong></a>
    Β·
    <a href="https://yangxs.ac.cn/home"><strong>Xiaoshan Yang </strong></a>
    Β·
    <a href="https://scholar.google.com.hk/citations?user=HBZ9plsAAAAJ&hl=zh-CN"><strong>Fang Peng </strong></a>
    Β·
    <a href="https://scholar.google.com.hk/citations?user=uIUfGxYAAAAJ&hl=zh-CN"><strong>Ming Yan </strong></a>
    Β·
    <a href="https://scholar.google.com.hk/citations?user=o_DllmIAAAAJ&hl=zh-CN"><strong>Yaowei Wang </strong></a>
    Β·
    <a href="https://scholar.google.com.hk/citations?user=hI9NRDkAAAAJ&hl=zh-CN"><strong>Changsheng Xu</strong></a>
  </p>

  <p align="center">
    <a href='https://arxiv.org/pdf/2305.08685'>
      <img src='https://img.shields.io/badge/arXiv-PDF-green?style=flat&logo=arXiv&logoColor=green' alt='arXiv PDF'>
    </a>
    <a href='https://ieeexplore.ieee.org/abstract/document/10269126'>
      <img src='https://img.shields.io/badge/IEEE TMM-blue' alt='arXiv PDF'>
    </a>

<br />


<p align="center"> <img src='docs/model.jpg' align="center" width="70%"> </p>

**<p align="center"> CLIP for Unsupervised and Fully Supervised Visual Grounding.  </p>**

This repository is the official Pytorch implementation for the paper [**CLIP-VG: Self-paced Curriculum Adapting of CLIP 
for Visual Grounding**](https://ieeexplore.ieee.org/abstract/document/10269126). 

If you have any questions, please feel free to open an issue or contact me with emails: <xiaolinhui16@mails.ucas.ac.cn>.

<h3 align="left">
Links: <a href="https://ieeexplore.ieee.org/abstract/document/10269126">IEEE Transaction on Multimedia (2023)</a>,   
<a href="https://arxiv.org/abs/2305.08685">ArXiv</a>,
[<a href="https://mp.weixin.qq.com/s/fwbamVr5P5Vcj5XheopQOg">δΈ­ζ–‡θ§£θ―»</a>]
</h3>

**Please leave a <font color='orange'>STAR ⭐</font> if you like this project!**

## News
 
- πŸ”₯πŸ”₯πŸ”₯ **Our Grounding survey ([TPAMI](https://doi.org/10.1109/TPAMI.2025.3630635), [Arxiv](https://arxiv.org/abs/2412.20206), [Project](https://github.com/linhuixiao/Awesome-Visual-Grounding)) has been accepted by TPAMI on October 30, 2025 !!!**

- :fire: **Update on 2024/12/28: We conducted a survey of Visual Grounding over the past decade, entitled "Towards Visual Grounding: A Survey" ([Paper](https://arxiv.org/pdf/2412.20206), [Project](https://github.com/linhuixiao/Awesome-Visual-Grounding)), Comments are welcome !!!**
- :fire: **Update on 2024/09/26: Our advanced grounding work OneRef ([Paper](https://openreview.net/pdf?id=siPdcro6uD), [Code](https://github.com/linhuixiao/OneRef)) has acceptance by top conference NeurIPS 2024 in October 2024!**
- :fire: **Update on 2024/07/16: Our advanced grounding work HiVG ([Paper](https://openreview.net/pdf?id=NMMyGy1kKZ), [Code](https://github.com/linhuixiao/HiVG)) has acceptance by top conference ACM MM 2024 in July 2024!**
- **Update on 2024/04/20: We release an advanced version of CLIP-VG, namely HiVG ([paper](https://arxiv.org/abs/2404.13400), [github](https://github.com/linhuixiao/HiVG)).**
- **Update on 2023/12/13: All of the code, models and datasets have been released.**
- **Update on 2023/9/25: Our paper has been accepted by the top journal IEEE Transaction on Multimedia (2023)!**
- Update on 2023/05/18: Release the repository and training code.


## Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.   

```bibtex
@article{xiao2023clip,
  title={CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding},
  author={Xiao, Linhui and Yang, Xiaoshan and Peng, Fang and Yan, Ming and Wang, Yaowei and Xu, Changsheng},
  journal={IEEE Transactions on Multimedia},
  year={2023},
  publisher={IEEE}
}
```

## Contents

1. [Introduction](#introduction)
2. [Usage](#usage)
3. [Results](#results)
4. [Contacts](#contacts)
5. [Acknowledgments](#acknowledgments)


## Highlight
- **CLIP for Visual Grounding.** a state-of-the-art baseline for unsupervised and fully supervised visual grounding with CLIP model.
- **Single-source and Multi-source pseudo-language labels.** The generation and usage of multi-source pseudo-labels.
- **Self-paced Curriculum Adapting Algorithm.** A plugin-like algorithmic idea that can be applied to any pseudo-label scenario.


## TODO
- [x] Release model code and inference code.
- [x] Release unsupervised and fully supervised checkpoints.
- [x] Release the complete multi-source pseudo-language labels and its generation code.
- [x] Release the reliability measurement code.




## Introduction

In order to utilize vision and language pre-trained models to address the grounding problem, and reasonably take 
advantage of pseudo-labels, we propose **CLIP-VG**, **a novel method that can conduct self-paced curriculum adapting of CLIP 
with pseudo-language labels.** 

We propose a simple yet efficient end-to-end network architecture to realize the transfer 
of CLIP to the visual grounding. Based on the CLIP-based architecture, we further propose single-source and 
multi-source curriculum adapting algorithms, which can progressively find more reliable pseudo-labels to learn an 
optimal model, thereby achieving a balance between reliability and diversity for the pseudo-language labels. 

Our method outperforms the current state-of-the-art unsupervised method Pseudo-Q, by a significant margin on RefCOCO/+/g datasets in both 
single-source and multi-source scenarios. Furthermore, our approach even outperforms existing weakly supervised methods. 
In comparison with the fully supervised SOTA model QRNet, we achieve comparable results with only **7.7%** of its 
updated parameters, while obtaining significant speedups in both training and inference, up to **26.84Γ—** and **7.41Γ—**, respectively.

In summary, **the contributions of this work are four-fold**:

- As far as we know, **we are the first to adapt CLIP to realize unsupervised visual grounding.** Our method can
transfer the cross-modal learning ability of CLIP to visual grounding with only a small training cost.
- **We first introduce the self-paced curriculum learning in unsupervised visual grounding.** Our proposed reliability measurement and single-source self-paced adapting
can progressively enhance the CLIP-based visual grounding model by utilizing pseudo-labels in an easy-to-hard
learning paradigm.
- **We first propose the multi-source self-paced adapting algorithm to extend our method for accessing multiple
sources of pseudo-labels,** which can flexibly improve the diversity of language taxonomy. 
- We conduct extensive experiments to evaluate the effectiveness of our approach. Results show that our method
obtains significant improvements in unsupervised setting and is also competitive in fully supervised setting.

For more details, please refer to [our paper](https://arxiv.org/abs/2305.08685).

## Usage
### Dependencies
- Python 3.9.10
- PyTorch 1.9.0 + cu111 + cp39
- Check [requirements.txt](requirements.txt) for other dependencies. 

Our model is **easy to deploy** in a variety of environments and has been successfully tested on multiple pytorch versions.
If you are interested in the pseudo language label generation module, more detailed instructions will be found at [usage instructions](pseudo_label_generation_module/README.md).


### Image Data Preparation
1.You can download the images from the original source and place them in your disk folder, such as `$/path_to_image_data`:
- [MS COCO 2014](download_mscoco2014.sh) (for RefCOCO, RefCOCO+, RefCOCOg dataset, almost 13.0GB) 
- [ReferItGame](https://drive.google.com/drive/folders/1D4shieeoKly6FswpdjSpaOrxJQNKTyTv)
- [Flickr30K Entities](http://shannon.cs.illinois.edu/DenotationGraph/#:~:text=make%20face-,Downloads,-Please%20fill%20in)

   We provide a script to download the mscoco2014 dataset, you just need to run the script in terminal with the following command:
   ```
   bash download_mscoco2014.sh
   ```
   Or you can also follow the data preparation of TransVG, which can be found in [GETTING_STARTED.md](https://github.com/djiajunustc/TransVG/blob/main/docs/GETTING_STARTED.md).

Only the image data in these datasets is used, and these image data is easily find in similar repositories of visual grounding work, such as [TransVG](https://github.com/linhuixiao/TransVG) etc. 
Finally, the `$/path_to_image_data` folder will have the following structure:

```angular2html
|-- image_data
   |-- Flickr30k
      |-- flickr30k-images
   |-- other
      |-- images
        |-- mscoco
            |-- images
                |-- train2014
   |-- referit
      |-- images
```
- ```$/path_to_image_data/image_data/Flickr30k/flickr30k-images/```: Image data for the Flickr30K dataset, please download from this [link](http://shannon.cs.illinois.edu/DenotationGraph/#:~:text=make%20face-,Downloads,-Please%20fill%20in). Fill the form and download the images.
- ```$/path_to_image_data/image_data/other/images/```: Image data for RefCOCO/RefCOCO+/RefCOCOg, i.e., mscoco2014. 
- ```$/path_to_image_data/image_data/referit/images/```: Image data for ReferItGame.

## Text-Box Anotations / Pseudo-Labels Prepare
The following are the **pseudo-language labels** generated by the pseudo-language label generation module in an unsupervised setting.

The **single-source scenario** includes a pseudo-template label derived from [Pseudo-Q](https://github.com/LeapLabTHU/Pseudo-Q). 

**Multi-source scenario** include pseudo-template labels, pseudo-relation labels, and pseudo-caption labels. 
Please refer to [pseudo-language label generation module](pseudo_label_generation_module/README.md) for specific details 
on how they are generated if interested.

Additionally, we also provide the pseudo-labels that selected through our single-source self-paced curriculum adapting (SSA) 
and multi-source self-paced curriculum adapting (MSA) algorithms, which can be conveniently and directly used by the following researchers. 

The labels in the fully supervised scenario is consistent with previous works such as [TransVG](https://github.com/linhuixiao/TransVG).
It is worth noting that the test split in the unsupervised scenario are exactly the same as those used in the fully supervised scenario. 

### Unsupervised setting
#### Single-source scenario
<table>
    <tr> <!-- line 3 -->
    <th style="text-align:center" > Datasets </th>
    <th style="text-align:center" > RefCOCO </th>
    <th style="text-align:center" > RefCOCO+ </th>
    <th style="text-align:center" > RefCOCOg-g </th>
    <th style="text-align:center" > RefCOCOg-u </th>
    <th style="text-align:center" > ReferIt </th>
    <th style="text-align:center" > Flickr </th>
    </tr>
    <tr> <!-- line 2 -->
        <th style="text-align:center" rowspan="1"> original </th> <!-- table head -->
        <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1G5VK8uNbAepyrQiI_DLQaN_02tYyOQq2/view?usp=drive_link">All of six datasets</a>,  36.7MB </th>  <!-- table head -->
    </tr>
    <tr> <!-- line 2 -->
        <th style="text-align:center" rowspan="1"> with curriculum selecting </th> <!-- table head -->
        <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1ekEWR-gYMMOrWPDB7R8lxZfDJbO8KGQt/view?usp=drive_link">All of six datasets</a>,  31.4MB </th>  <!-- table head -->
    </tr>
</table>


#### Multi-source scenario
<table>
    <tr> <!-- line 3 -->
    <th style="text-align:center" > Datasets </th>
    <th style="text-align:center" > RefCOCO </th>
    <th style="text-align:center" > RefCOCO+ </th>
    <th style="text-align:center" > RefCOCOg-g </th>
    <th style="text-align:center" > RefCOCOg-u </th>
    <th style="text-align:center" > ReferIt </th>
    <th style="text-align:center" > Flickr </th>
    </tr>
    <tr> <!-- line 2 -->
        <th style="text-align:center" rowspan="1"> original </th> <!-- table head -->
        <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1X9F5n7M0Zm4jhOIf1tjHj6bzMh6A1ZkE/view?usp=drive_link">All of six datasets</a>,  144.7MB, each dataset contains 3 sources of pseudo-labels </th>  <!-- table head -->
    </tr>
    <tr> <!-- line 2 -->
        <th style="text-align:center" rowspan="1"> with curriculum selecting </th> <!-- table head -->
        <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1IBReTahxkOdKW_fKvplw3PGlI8PdHPUW/view?usp=drive_link">All of six datasets</a>,  87.3MB, each dataset contains 3 sources of pseudo-labels </th>  <!-- table head -->
    </tr>
</table>

### Fully supervised setting
<table>
    <tr> <!-- line 3 -->
    <th style="text-align:center" > Datasets </th>
    <th style="text-align:center" > RefCOCO </th>
    <th style="text-align:center" > RefCOCO+ </th>
    <th style="text-align:center" > RefCOCOg-g </th>
    <th style="text-align:center" > RefCOCOg-u </th>
    <th style="text-align:center" > ReferIt </th>
    <th style="text-align:center" > Flickr </th>
    </tr>
    <tr> <!-- line 2 -->
        <th style="text-align:center" rowspan="1"> url, size </th> <!-- table head -->
        <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1ituKSxWU5aXsGnXePd7twv7ImJoFiATc/view?usp=drive_link">All of six datasets</a>,  89.0MB </th>  <!-- table head -->
    </tr>
    <tr> <!-- line 3 -->
    <th style="text-align:center" > with curriculum selecting </th>
    <th style="text-align:center" > - </th>
    <th style="text-align:center" > - </th>
    <th style="text-align:center" > - </th>
    <th style="text-align:center" > <a href="https://drive.google.com/file/d/1eSGr-sTqZ6z_Jy7APnJXNxegt2Q-pbqE/view?usp=drive_link">dataset</a> </th>
    <th style="text-align:center" > - </th>
    <th style="text-align:center" > - </th>
    </tr>
</table>

\* Since we observed a relatively clear performance increase on the RefCOCOg-u dataset in the fully supervised setting, 
we provide data for this dataset after applying our SSA algorithm for curriculum selecting. Typically, by using this 
filtered data, there is an approximate ~1.0 increase in performance on both val-u and test-u.

Download the above annotations to a disk directory such as `$/path_to_split`; then will have the following similar directory structure:

```angular2html
|-- /unsup_single_source/unsup_single_source_ssa/
|-- unsup_multi_source/unsup_multi_source_msa/full_sup_data
    β”œβ”€β”€ flickr
    β”‚   β”œβ”€β”€ flickr_test.pth
    β”‚   β”œβ”€β”€ flickr_train_pseudo.pth
    β”‚   └── flickr_val.pth
    β”œβ”€β”€ gref
    β”‚   β”œβ”€β”€ gref_train_pseudo.pth
    β”‚   └── gref_val.pth
    β”œβ”€β”€ gref_umd
    β”‚   β”œβ”€β”€ gref_umd_test.pth
    β”‚   β”œβ”€β”€ gref_umd_train_pseudo.pth
    β”‚   └── gref_umd_val.pth
    β”œβ”€β”€ referit
    β”‚   β”œβ”€β”€ referit_test.pth
    β”‚   β”œβ”€β”€ referit_train_pseudo.pth
    β”‚   └── referit_val.pth
    β”œβ”€β”€ unc
    β”‚   β”œβ”€β”€ unc_testA.pth
    β”‚   β”œβ”€β”€ unc_testB.pth
    β”‚   β”œβ”€β”€ unc_train_pseudo.pth
    β”‚   └── unc_val.pth
    └── unc+
        β”œβ”€β”€ unc+_testA.pth
        β”œβ”€β”€ unc+_testB.pth
        β”œβ”€β”€ unc+_train_pseudo.pth
        └── unc+_val.pth
    In multi-source, it have a additional train_separate directory for further research purpose. 
        β”œβ”€β”€ train_separate
            β”œβ”€β”€ 1_unc+_train_pseudo_template_0_5.pth
            │── 2_unc+_train_pseudo_relation_0_5.pth
            └── 3_unc+_train_pseudo_caption_0_5.pth
```
 \* The number at the end of the filename in train_separate directory represents the reliability threshold as defined in the paper.

## Pre-trained Checkpoints

### Unsupervised setting
#### Single-source scenario
<table>
    <tr> <!-- line 3 -->
    <th style="text-align:center" > Datasets </th>
    <th style="text-align:center" > RefCOCO </th>
    <th style="text-align:center" > RefCOCO+ </th>
    <th style="text-align:center" > RefCOCOg-g </th>
    <th style="text-align:center" > RefCOCOg-u </th>
    <th style="text-align:center" > ReferIt </th>
    <th style="text-align:center" > Flickr </th>
    </tr>
    <tr> <!-- line 2 -->
        <th style="text-align:center" rowspan="1"> url, size </th> <!-- table head -->
        <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/14b-lc7zNniy4EEcJoBdXY9gNv2d20yxU/view?usp=drive_link">All of six models</a>,  3.0GB </th>  <!-- table head -->
    </tr>
</table>

\* Note that the performance of our provided model on the refcocog-val-g dataset in the unsupervised single-source scenario is approximately higher ~2.0 
than reported in the paper, i.e., (54.16) --> (56.46).

#### Multi-source scenario
<table>
    <tr> <!-- line 3 -->
    <th style="text-align:center" > Datasets </th>
    <th style="text-align:center" > RefCOCO </th>
    <th style="text-align:center" > RefCOCO+ </th>
    <th style="text-align:center" > RefCOCOg-g </th>
    <th style="text-align:center" > RefCOCOg-u </th>
    <th style="text-align:center" > ReferIt </th>
    <th style="text-align:center" > Flickr </th>
    </tr>
    <tr> <!-- line 2 -->
        <th style="text-align:center" rowspan="1"> url, size </th> <!-- table head -->
        <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1NU35UhAqx2YLehG5ni59rG4sWWaAaXGm/view?usp=drive_link">All of six models</a>,  3.0GB </th>  <!-- table head -->
    </tr>
</table>

### Fully supervised setting

<table>
    <tr> <!-- line 3 -->
    <th style="text-align:center" > Datasets </th>
    <th style="text-align:center" > RefCOCO </th>
    <th style="text-align:center" > RefCOCO+ </th>
    <th style="text-align:center" > RefCOCOg-g </th>
    <th style="text-align:center" > RefCOCOg-u </th>
    <th style="text-align:center" > ReferIt </th>
    <th style="text-align:center" > Flickr </th>
    </tr>
    <tr> <!-- line 3 -->
    <th style="text-align:center" > separate </th>
    <th style="text-align:center" > <a href="https://drive.google.com/file/d/1ZyQkPDBG33FPVlyVmzcCf5wD_Ct2hLr8/view?usp=drive_link">model</a> </th>
    <th style="text-align:center" > <a href="https://drive.google.com/file/d/18M-Mmu_TaMLKrpdxksoroe3DIeHmmguN/view?usp=drive_link">model</a> </th>
    <th style="text-align:center" > <a href="https://drive.google.com/file/d/1E80T3nz6YETqYU8ZZImCuX76TM1OOxNp/view?usp=drive_link">model</a> </th>
    <th style="text-align:center" > <a href="https://drive.google.com/file/d/1bR5WIwaNiu0ShgEafw10BwC3boT-bLRW/view?usp=drive_link">model</a> </th>
    <th style="text-align:center" > <a href="https://drive.google.com/file/d/1g8U5Q-KUcGPVq1iKMyFui65lXn9Dwfws/view?usp=drive_link">model</a> </th>
    <th style="text-align:center" > <a href="https://drive.google.com/file/d/1Zm98Bf7ulKxXsi-UhoEGkyUtF8-v-Ohp/view?usp=drive_link">model</a> </th>
    </tr>
    <tr> <!-- line 2 -->
        <th style="text-align:center" rowspan="1"> url, size </th> <!-- table head -->
        <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1vUC4swZM3ho_5olO--Y3PdKzMBW_iBJG/view?usp=drive_link">All of six models</a>,  3.0GB </th>  <!-- table head -->
    </tr>
</table>

\* Note that the performance of our provided model on the refcoco+ dataset in the fully supervised setting is approximately higher ~2.0 
than reported in the paper, i.e., (69.55, 77.33, 57.62) --> (71.08, 79.17, 59.40).



## Training and Evaluation

You just only need to change ```$/path_to_split```, ``` $/path_to_image_data```, ``` $/path_to_output``` to your own file directory to execute the following command.
The first time we run the command below, it will take some time for the repository to download the CLIP model.

1. Training on RefCOCO with unsupervised setting.
    ```
    CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=6 --master_port 28887 --use_env train_clip_vg.py --num_workers 2 --epochs 110 --batch_size 64 --lr 0.00025  --lr_scheduler cosine --aug_crop --aug_scale --aug_translate      --imsize 224 --max_query_len 77 --dataset unc      --data_root $/path_to_image_data --split_root $/path_to_split      --output_dir $/path_to_output/output_v01/unc;
    ```
    Please refer to [train_and_eval_script/train_and_eval_unsup.sh](train_and_eval_script/train_and_eval_unsup.sh) for training commands on other datasets.

2. Training on RefCOCO with fully supervised setting. 
    The only difference is an additional control flag: ```--sup_type full```
    ```
    CUDA_VISIBLE_DEVICES=3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=5 --master_port 28887 --use_env train_clip_vg.py --num_workers 32 --epochs 120 --batch_size 64 --lr 0.00025  --lr_scheduler cosine --aug_crop --aug_scale --aug_translate    --imsize 224 --max_query_len 77  --sup_type full --dataset unc      --data_root $/path_to_image_data --split_root $/path_to_split --output_dir $/path_to_output/output_v01/unc;
    ```
    Please refer to [train_and_eval_script/train_and_eval_full_sup.sh](train_and_eval_script/train_and_eval_full_sup.sh) for training commands on other datasets.

3. Evaluation on RefCOCO. The instructions are the same for the unsupervised and fully supervised Settings.
    ```
    CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=6 --master_port 28888 --use_env eval.py --num_workers 2 --batch_size 128    --dataset unc      --imsize 224 --max_query_len 77 --data_root $/path_to_image_data --split_root $/path_to_split --eval_model $/path_to_output/output_v01/unc/best_checkpoint.pth      --eval_set val    --output_dir $/path_to_output/output_v01/unc;
    ```
    Please refer to [train_and_eval_script/train_and_eval_unsup.sh](train_and_eval_script/train_and_eval_unsup.sh) for evaluation commands on other splits or datasets.
    
4. We strongly recommend to use the following commands to training or testing with different datasets and splits, 
    which will significant reduce the training workforce.
    ```
    bash train_and_eval_script/train_and_eval_unsup.sh   
    bash train_and_eval_script/train_and_eval_full_sup.sh
    ```

5. Curriculum reliability measurement or scoring for the pseudo-language labels:

    It is only needs to change ```eval.py``` to ```eval_for_reliability_distribution.py``` and rename the training pseudo labels as ```test.pth```
    in the corresponding datasets during Evaluation:
    ```
    CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=6 --master_port 28888 --use_env eval_for_reliability_distribution.py --num_workers 2 --batch_size 128    --dataset unc      --imsize 224 --max_query_len 77 --data_root $/path_to_image_data --split_root $/path_to_split --eval_model $/path_to_output/output_v01/unc/best_checkpoint.pth      --eval_set val    --output_dir $/path_to_output/output_v01/unc;
    ```
    Besides, if you need to merge the pseudo train splits for further research, just running the following command:
    ```
    python ./pseudo_label_generation_module/utils/merge_file.py $/path_to_split/unsup_multi_source/unc/train_separate unc;
    cp $/path_to_split/full_sup_data/unc/unc_val.pth $/path_to_split/unsup_multi_source/unc/train_separate/unc/unc_val.pth
    ```
    Then, you can construct a new pseudo-label training split.

## Results

<details open>
<summary><font size="4">
RefCOCO, RefCOCO+, and RefCOCOg datasets
</font></summary>
<img src="docs/refcoco.png" alt="COCO" width="100%">
</details>

<details open>
<summary><font size="4">
ReferIt and Flickr datasets
</font></summary>
<div align=center>
<img src="docs/referit.png" alt="COCO" width="50%"></div>
</details>

<details open>
<summary><font size="4">
Our model also has significant energy efficiency advantages.
</font></summary>
<div align=center>
<img src="docs/efficiency.jpg" alt="COCO" width="85%"></div>
</details>

Compared to QRNet, we updated **only 7.7%** of its parameters and achieved impressive training and inference speedups, 
up to **26.84Γ—** and **7.41Γ—**, respectively, while also obtaining competitive results. 


## Methods 
<p align="center"> <img src='docs/algorithm.jpg' align="center" width="100%"> </p>

## Visualization
<p align="center"> <img src='docs/fig5.jpg' align="center" width="100%"> </p>

The figure presents the histograms of Single-Source Reliability (SR) and Cross-source Reliability (CR) for pseudo-language 
labels in the range of (0.0, 1.0] with 1000 bins, where each bin represents the number of samples. The figure illustrates 
that different sources exhibit distinct distributions due to their specific quality and language taxonomy of pseudo-language
labels (e.g., Fig.5-(a1)-(b2)-(c3)), while different reliability measures have varying discrimination abilities on the 
same source (e.g., Fig.5-(a1)-(b1)-(c1)). 

<p align="center"> <img src='docs/fig6.jpg' align="center" width="100%"> </p>
Before the execution of MSA, the distribution of the pseudo-language labels and the ground-truth query labels is quite 
different, but after the execution of MSA, the distribution discrepancy significantly becomes smaller. This shows that 
MSA can effectively select pseudo-labels that are more reliable or closer to the distribution of ground-truth query labels.

<p align="center"> <img src='docs/sample1.jpg' align="center" width="100%"> </p>

<p align="center"> <img src='docs/sample2.jpg' align="center" width="100%"> </p>

<p align="center"> <img src='docs/sample3.jpg' align="center" width="100%"> </p>
Among the various types of unreliable pseudo-language labels, referring to ambiguity is more frequent, particularly in 
images with similar classification objects. If future research aims to further enhance model performance, addressing 
ambiguity is a critical issue.

## Contacts
Email: <xiaolinhui16@mails.ucas.ac.cn>.
Any kind discussions are welcomed!

## Acknowledgement

Our model is related to [CLIP](https://github.com/openai/CLIP), [Pseudo-Q](https://github.com/LeapLabTHU/Pseudo-Q), [TransVG](https://github.com/linhuixiao/TransVG). Thanks for their great work!

We also thank the great previous work including [DETR](https://github.com/facebookresearch/detr), [QRNet](https://github.com/LukeForeverYoung/QRNet), [M2](https://github.com/aimagelab/meshed-memory-transformer), [CLIPCap](https://github.com/rmokady/CLIP_prefix_caption), [RelTR](https://github.com/yrcong/RelTR), [bottom-up-attention](https://github.com/peteanderson80/bottom-up-attention), [ReSC](https://github.com/zyang-ur/ReSC), etc. 

Thanks [OpenAI](https://github.com/openai) for their awesome models.


## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=linhuixiao/CLIP-VG&type=Date)](https://star-history.com/#linhuixiao/CLIP-VG&Date)