linhuixiao
/

CLIP-VG

Model card Files Files and versions

xet

Community

linhuixiao commited on Nov 13, 2025

Commit

53751b1

verified ·

1 Parent(s): 0921842

Update README.md

Browse files

Files changed (1) hide show

README.md +486 -3

README.md CHANGED Viewed

@@ -1,3 +1,486 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+---
+<p align="center"> <h1 align="center">CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding</h1>
+  <p align="center">
+    <b> IEEE Transaction on Multimedia, 2023 </b>
+    <br />
+    <a href="https://scholar.google.com.hk/citations?user=4rTE4ogAAAAJ&hl=zh-CN&oi=sra"><strong> Linhui Xiao </strong></a>
+    ·
+    <a href="https://yangxs.ac.cn/home"><strong>Xiaoshan Yang </strong></a>
+    ·
+    <a href="https://scholar.google.com.hk/citations?user=HBZ9plsAAAAJ&hl=zh-CN"><strong>Fang Peng </strong></a>
+    ·
+    <a href="https://scholar.google.com.hk/citations?user=uIUfGxYAAAAJ&hl=zh-CN"><strong>Ming Yan </strong></a>
+    ·
+    <a href="https://scholar.google.com.hk/citations?user=o_DllmIAAAAJ&hl=zh-CN"><strong>Yaowei Wang </strong></a>
+    ·
+    <a href="https://scholar.google.com.hk/citations?user=hI9NRDkAAAAJ&hl=zh-CN"><strong>Changsheng Xu</strong></a>
+  </p>
+  <p align="center">
+    <a href='https://arxiv.org/pdf/2305.08685'>
+      <img src='https://img.shields.io/badge/arXiv-PDF-green?style=flat&logo=arXiv&logoColor=green' alt='arXiv PDF'>
+    </a>
+    <a href='https://ieeexplore.ieee.org/abstract/document/10269126'>
+      <img src='https://img.shields.io/badge/IEEE TMM-blue' alt='arXiv PDF'>
+    </a>
+<br />
+<p align="center"> <img src='docs/model.jpg' align="center" width="70%"> </p>
+**<p align="center"> CLIP for Unsupervised and Fully Supervised Visual Grounding.  </p>**
+This repository is the official Pytorch implementation for the paper [**CLIP-VG: Self-paced Curriculum Adapting of CLIP
+for Visual Grounding**](https://ieeexplore.ieee.org/abstract/document/10269126).
+If you have any questions, please feel free to open an issue or contact me with emails: <xiaolinhui16@mails.ucas.ac.cn>.
+<h3 align="left">
+Links: <a href="https://ieeexplore.ieee.org/abstract/document/10269126">IEEE Transaction on Multimedia (2023)</a>,
+<a href="https://arxiv.org/abs/2305.08685">ArXiv</a>,
+[<a href="https://mp.weixin.qq.com/s/fwbamVr5P5Vcj5XheopQOg">中文解读</a>]
+</h3>
+**Please leave a <font color='orange'>STAR ⭐</font> if you like this project!**
+## News
+- 🔥🔥🔥 **Our Grounding survey ([TPAMI](https://doi.org/10.1109/TPAMI.2025.3630635), [Arxiv](https://arxiv.org/abs/2412.20206), [Project](https://github.com/linhuixiao/Awesome-Visual-Grounding)) has been accepted by TPAMI on October 30, 2025 !!!**
+- :fire: **Update on 2024/12/28: We conducted a survey of Visual Grounding over the past decade, entitled "Towards Visual Grounding: A Survey" ([Paper](https://arxiv.org/pdf/2412.20206), [Project](https://github.com/linhuixiao/Awesome-Visual-Grounding)), Comments are welcome !!!**
+- :fire: **Update on 2024/09/26: Our advanced grounding work OneRef ([Paper](https://openreview.net/pdf?id=siPdcro6uD), [Code](https://github.com/linhuixiao/OneRef)) has acceptance by top conference NeurIPS 2024 in October 2024!**
+- :fire: **Update on 2024/07/16: Our advanced grounding work HiVG ([Paper](https://openreview.net/pdf?id=NMMyGy1kKZ), [Code](https://github.com/linhuixiao/HiVG)) has acceptance by top conference ACM MM 2024 in July 2024!**
+- **Update on 2024/04/20: We release an advanced version of CLIP-VG, namely HiVG ([paper](https://arxiv.org/abs/2404.13400), [github](https://github.com/linhuixiao/HiVG)).**
+- **Update on 2023/12/13: All of the code, models and datasets have been released.**
+- **Update on 2023/9/25: Our paper has been accepted by the top journal IEEE Transaction on Multimedia (2023)!**
+- Update on 2023/05/18: Release the repository and training code.
+## Citation
+If you find our work helpful for your research, please consider citing the following BibTeX entry.
+```bibtex
+@article{xiao2023clip,
+  title={CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding},
+  author={Xiao, Linhui and Yang, Xiaoshan and Peng, Fang and Yan, Ming and Wang, Yaowei and Xu, Changsheng},
+  journal={IEEE Transactions on Multimedia},
+  year={2023},
+  publisher={IEEE}
+}
+```
+## Contents
+1. [Introduction](#introduction)
+2. [Usage](#usage)
+3. [Results](#results)
+4. [Contacts](#contacts)
+5. [Acknowledgments](#acknowledgments)
+## Highlight
+- **CLIP for Visual Grounding.** a state-of-the-art baseline for unsupervised and fully supervised visual grounding with CLIP model.
+- **Single-source and Multi-source pseudo-language labels.** The generation and usage of multi-source pseudo-labels.
+- **Self-paced Curriculum Adapting Algorithm.** A plugin-like algorithmic idea that can be applied to any pseudo-label scenario.
+## TODO
+- [x] Release model code and inference code.
+- [x] Release unsupervised and fully supervised checkpoints.
+- [x] Release the complete multi-source pseudo-language labels and its generation code.
+- [x] Release the reliability measurement code.
+## Introduction
+In order to utilize vision and language pre-trained models to address the grounding problem, and reasonably take
+advantage of pseudo-labels, we propose **CLIP-VG**, **a novel method that can conduct self-paced curriculum adapting of CLIP
+with pseudo-language labels.**
+We propose a simple yet efficient end-to-end network architecture to realize the transfer
+of CLIP to the visual grounding. Based on the CLIP-based architecture, we further propose single-source and
+multi-source curriculum adapting algorithms, which can progressively find more reliable pseudo-labels to learn an
+optimal model, thereby achieving a balance between reliability and diversity for the pseudo-language labels.
+Our method outperforms the current state-of-the-art unsupervised method Pseudo-Q, by a significant margin on RefCOCO/+/g datasets in both
+single-source and multi-source scenarios. Furthermore, our approach even outperforms existing weakly supervised methods.
+In comparison with the fully supervised SOTA model QRNet, we achieve comparable results with only **7.7%** of its
+updated parameters, while obtaining significant speedups in both training and inference, up to **26.84×** and **7.41×**, respectively.
+In summary, **the contributions of this work are four-fold**:
+- As far as we know, **we are the first to adapt CLIP to realize unsupervised visual grounding.** Our method can
+transfer the cross-modal learning ability of CLIP to visual grounding with only a small training cost.
+- **We first introduce the self-paced curriculum learning in unsupervised visual grounding.** Our proposed reliability measurement and single-source self-paced adapting
+can progressively enhance the CLIP-based visual grounding model by utilizing pseudo-labels in an easy-to-hard
+learning paradigm.
+- **We first propose the multi-source self-paced adapting algorithm to extend our method for accessing multiple
+sources of pseudo-labels,** which can flexibly improve the diversity of language taxonomy.
+- We conduct extensive experiments to evaluate the effectiveness of our approach. Results show that our method
+obtains significant improvements in unsupervised setting and is also competitive in fully supervised setting.
+For more details, please refer to [our paper](https://arxiv.org/abs/2305.08685).
+## Usage
+### Dependencies
+- Python 3.9.10
+- PyTorch 1.9.0 + cu111 + cp39
+- Check [requirements.txt](requirements.txt) for other dependencies.
+Our model is **easy to deploy** in a variety of environments and has been successfully tested on multiple pytorch versions.
+If you are interested in the pseudo language label generation module, more detailed instructions will be found at [usage instructions](pseudo_label_generation_module/README.md).
+### Image Data Preparation
+1.You can download the images from the original source and place them in your disk folder, such as `$/path_to_image_data`:
+- [MS COCO 2014](download_mscoco2014.sh) (for RefCOCO, RefCOCO+, RefCOCOg dataset, almost 13.0GB)
+- [ReferItGame](https://drive.google.com/drive/folders/1D4shieeoKly6FswpdjSpaOrxJQNKTyTv)
+- [Flickr30K Entities](http://shannon.cs.illinois.edu/DenotationGraph/#:~:text=make%20face-,Downloads,-Please%20fill%20in)
+   We provide a script to download the mscoco2014 dataset, you just need to run the script in terminal with the following command:
+   ```
+   bash download_mscoco2014.sh
+   ```
+   Or you can also follow the data preparation of TransVG, which can be found in [GETTING_STARTED.md](https://github.com/djiajunustc/TransVG/blob/main/docs/GETTING_STARTED.md).
+Only the image data in these datasets is used, and these image data is easily find in similar repositories of visual grounding work, such as [TransVG](https://github.com/linhuixiao/TransVG) etc.
+Finally, the `$/path_to_image_data` folder will have the following structure:
+```angular2html
+|-- image_data
+   |-- Flickr30k
+      |-- flickr30k-images
+   |-- other
+      |-- images
+        |-- mscoco
+            |-- images
+                |-- train2014
+   |-- referit
+      |-- images
+```
+- ```$/path_to_image_data/image_data/Flickr30k/flickr30k-images/```: Image data for the Flickr30K dataset, please download from this [link](http://shannon.cs.illinois.edu/DenotationGraph/#:~:text=make%20face-,Downloads,-Please%20fill%20in). Fill the form and download the images.
+- ```$/path_to_image_data/image_data/other/images/```: Image data for RefCOCO/RefCOCO+/RefCOCOg, i.e., mscoco2014.
+- ```$/path_to_image_data/image_data/referit/images/```: Image data for ReferItGame.
+## Text-Box Anotations / Pseudo-Labels Prepare
+The following are the **pseudo-language labels** generated by the pseudo-language label generation module in an unsupervised setting.
+The **single-source scenario** includes a pseudo-template label derived from [Pseudo-Q](https://github.com/LeapLabTHU/Pseudo-Q).
+**Multi-source scenario** include pseudo-template labels, pseudo-relation labels, and pseudo-caption labels.
+Please refer to [pseudo-language label generation module](pseudo_label_generation_module/README.md) for specific details
+on how they are generated if interested.
+Additionally, we also provide the pseudo-labels that selected through our single-source self-paced curriculum adapting (SSA)
+and multi-source self-paced curriculum adapting (MSA) algorithms, which can be conveniently and directly used by the following researchers.
+The labels in the fully supervised scenario is consistent with previous works such as [TransVG](https://github.com/linhuixiao/TransVG).
+It is worth noting that the test split in the unsupervised scenario are exactly the same as those used in the fully supervised scenario.
+### Unsupervised setting
+#### Single-source scenario
+<table>
+    <tr> <!-- line 3 -->
+    <th style="text-align:center" > Datasets </th>
+    <th style="text-align:center" > RefCOCO </th>
+    <th style="text-align:center" > RefCOCO+ </th>
+    <th style="text-align:center" > RefCOCOg-g </th>
+    <th style="text-align:center" > RefCOCOg-u </th>
+    <th style="text-align:center" > ReferIt </th>
+    <th style="text-align:center" > Flickr </th>
+    </tr>
+    <tr> <!-- line 2 -->
+        <th style="text-align:center" rowspan="1"> original </th> <!-- table head -->
+        <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1G5VK8uNbAepyrQiI_DLQaN_02tYyOQq2/view?usp=drive_link">All of six datasets</a>,  36.7MB </th>  <!-- table head -->
+    </tr>
+    <tr> <!-- line 2 -->
+        <th style="text-align:center" rowspan="1"> with curriculum selecting </th> <!-- table head -->
+        <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1ekEWR-gYMMOrWPDB7R8lxZfDJbO8KGQt/view?usp=drive_link">All of six datasets</a>,  31.4MB </th>  <!-- table head -->
+    </tr>
+</table>
+#### Multi-source scenario
+<table>
+    <tr> <!-- line 3 -->
+    <th style="text-align:center" > Datasets </th>
+    <th style="text-align:center" > RefCOCO </th>
+    <th style="text-align:center" > RefCOCO+ </th>
+    <th style="text-align:center" > RefCOCOg-g </th>
+    <th style="text-align:center" > RefCOCOg-u </th>
+    <th style="text-align:center" > ReferIt </th>
+    <th style="text-align:center" > Flickr </th>
+    </tr>
+    <tr> <!-- line 2 -->
+        <th style="text-align:center" rowspan="1"> original </th> <!-- table head -->
+        <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1X9F5n7M0Zm4jhOIf1tjHj6bzMh6A1ZkE/view?usp=drive_link">All of six datasets</a>,  144.7MB, each dataset contains 3 sources of pseudo-labels </th>  <!-- table head -->
+    </tr>
+    <tr> <!-- line 2 -->
+        <th style="text-align:center" rowspan="1"> with curriculum selecting </th> <!-- table head -->
+        <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1IBReTahxkOdKW_fKvplw3PGlI8PdHPUW/view?usp=drive_link">All of six datasets</a>,  87.3MB, each dataset contains 3 sources of pseudo-labels </th>  <!-- table head -->
+    </tr>
+</table>
+### Fully supervised setting
+<table>
+    <tr> <!-- line 3 -->
+    <th style="text-align:center" > Datasets </th>
+    <th style="text-align:center" > RefCOCO </th>
+    <th style="text-align:center" > RefCOCO+ </th>
+    <th style="text-align:center" > RefCOCOg-g </th>
+    <th style="text-align:center" > RefCOCOg-u </th>
+    <th style="text-align:center" > ReferIt </th>
+    <th style="text-align:center" > Flickr </th>
+    </tr>
+    <tr> <!-- line 2 -->
+        <th style="text-align:center" rowspan="1"> url, size </th> <!-- table head -->
+        <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1ituKSxWU5aXsGnXePd7twv7ImJoFiATc/view?usp=drive_link">All of six datasets</a>,  89.0MB </th>  <!-- table head -->
+    </tr>
+    <tr> <!-- line 3 -->
+    <th style="text-align:center" > with curriculum selecting </th>
+    <th style="text-align:center" > - </th>
+    <th style="text-align:center" > - </th>
+    <th style="text-align:center" > - </th>
+    <th style="text-align:center" > <a href="https://drive.google.com/file/d/1eSGr-sTqZ6z_Jy7APnJXNxegt2Q-pbqE/view?usp=drive_link">dataset</a> </th>
+    <th style="text-align:center" > - </th>
+    <th style="text-align:center" > - </th>
+    </tr>
+</table>
+\* Since we observed a relatively clear performance increase on the RefCOCOg-u dataset in the fully supervised setting,
+we provide data for this dataset after applying our SSA algorithm for curriculum selecting. Typically, by using this
+filtered data, there is an approximate ~1.0 increase in performance on both val-u and test-u.
+Download the above annotations to a disk directory such as `$/path_to_split`; then will have the following similar directory structure:
+```angular2html
+|-- /unsup_single_source/unsup_single_source_ssa/
+|-- unsup_multi_source/unsup_multi_source_msa/full_sup_data
+    ├── flickr
+    │   ├── flickr_test.pth
+    │   ├── flickr_train_pseudo.pth
+    │   └── flickr_val.pth
+    ├── gref
+    │   ├── gref_train_pseudo.pth
+    │   └── gref_val.pth
+    ├── gref_umd
+    │   ├── gref_umd_test.pth
+    │   ├── gref_umd_train_pseudo.pth
+    │   └── gref_umd_val.pth
+    ├── referit
+    │   ├── referit_test.pth
+    │   ├── referit_train_pseudo.pth
+    │   └── referit_val.pth
+    ├── unc
+    │   ├── unc_testA.pth
+    │   ├── unc_testB.pth
+    │   ├── unc_train_pseudo.pth
+    │   └── unc_val.pth
+    └── unc+
+        ├── unc+_testA.pth
+        ├── unc+_testB.pth
+        ├── unc+_train_pseudo.pth
+        └── unc+_val.pth
+    In multi-source, it have a additional train_separate directory for further research purpose.
+        ├── train_separate
+            ├── 1_unc+_train_pseudo_template_0_5.pth
+            │── 2_unc+_train_pseudo_relation_0_5.pth
+            └── 3_unc+_train_pseudo_caption_0_5.pth
+```
+ \* The number at the end of the filename in train_separate directory represents the reliability threshold as defined in the paper.
+## Pre-trained Checkpoints
+### Unsupervised setting
+#### Single-source scenario
+<table>
+    <tr> <!-- line 3 -->
+    <th style="text-align:center" > Datasets </th>
+    <th style="text-align:center" > RefCOCO </th>
+    <th style="text-align:center" > RefCOCO+ </th>
+    <th style="text-align:center" > RefCOCOg-g </th>
+    <th style="text-align:center" > RefCOCOg-u </th>
+    <th style="text-align:center" > ReferIt </th>
+    <th style="text-align:center" > Flickr </th>
+    </tr>
+    <tr> <!-- line 2 -->
+        <th style="text-align:center" rowspan="1"> url, size </th> <!-- table head -->
+        <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/14b-lc7zNniy4EEcJoBdXY9gNv2d20yxU/view?usp=drive_link">All of six models</a>,  3.0GB </th>  <!-- table head -->
+    </tr>
+</table>
+\* Note that the performance of our provided model on the refcocog-val-g dataset in the unsupervised single-source scenario is approximately higher ~2.0
+than reported in the paper, i.e., (54.16) --> (56.46).
+#### Multi-source scenario
+<table>
+    <tr> <!-- line 3 -->
+    <th style="text-align:center" > Datasets </th>
+    <th style="text-align:center" > RefCOCO </th>
+    <th style="text-align:center" > RefCOCO+ </th>
+    <th style="text-align:center" > RefCOCOg-g </th>
+    <th style="text-align:center" > RefCOCOg-u </th>
+    <th style="text-align:center" > ReferIt </th>
+    <th style="text-align:center" > Flickr </th>
+    </tr>
+    <tr> <!-- line 2 -->
+        <th style="text-align:center" rowspan="1"> url, size </th> <!-- table head -->
+        <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1NU35UhAqx2YLehG5ni59rG4sWWaAaXGm/view?usp=drive_link">All of six models</a>,  3.0GB </th>  <!-- table head -->
+    </tr>
+</table>
+### Fully supervised setting
+<table>
+    <tr> <!-- line 3 -->
+    <th style="text-align:center" > Datasets </th>
+    <th style="text-align:center" > RefCOCO </th>
+    <th style="text-align:center" > RefCOCO+ </th>
+    <th style="text-align:center" > RefCOCOg-g </th>
+    <th style="text-align:center" > RefCOCOg-u </th>
+    <th style="text-align:center" > ReferIt </th>
+    <th style="text-align:center" > Flickr </th>
+    </tr>
+    <tr> <!-- line 3 -->
+    <th style="text-align:center" > separate </th>
+    <th style="text-align:center" > <a href="https://drive.google.com/file/d/1ZyQkPDBG33FPVlyVmzcCf5wD_Ct2hLr8/view?usp=drive_link">model</a> </th>
+    <th style="text-align:center" > <a href="https://drive.google.com/file/d/18M-Mmu_TaMLKrpdxksoroe3DIeHmmguN/view?usp=drive_link">model</a> </th>
+    <th style="text-align:center" > <a href="https://drive.google.com/file/d/1E80T3nz6YETqYU8ZZImCuX76TM1OOxNp/view?usp=drive_link">model</a> </th>
+    <th style="text-align:center" > <a href="https://drive.google.com/file/d/1bR5WIwaNiu0ShgEafw10BwC3boT-bLRW/view?usp=drive_link">model</a> </th>
+    <th style="text-align:center" > <a href="https://drive.google.com/file/d/1g8U5Q-KUcGPVq1iKMyFui65lXn9Dwfws/view?usp=drive_link">model</a> </th>
+    <th style="text-align:center" > <a href="https://drive.google.com/file/d/1Zm98Bf7ulKxXsi-UhoEGkyUtF8-v-Ohp/view?usp=drive_link">model</a> </th>
+    </tr>
+    <tr> <!-- line 2 -->
+        <th style="text-align:center" rowspan="1"> url, size </th> <!-- table head -->
+        <th style="text-align:center" colspan="6"> <a href="https://drive.google.com/file/d/1vUC4swZM3ho_5olO--Y3PdKzMBW_iBJG/view?usp=drive_link">All of six models</a>,  3.0GB </th>  <!-- table head -->
+    </tr>
+</table>
+\* Note that the performance of our provided model on the refcoco+ dataset in the fully supervised setting is approximately higher ~2.0
+than reported in the paper, i.e., (69.55, 77.33, 57.62) --> (71.08, 79.17, 59.40).
+## Training and Evaluation
+You just only need to change ```$/path_to_split```, ``` $/path_to_image_data```, ``` $/path_to_output``` to your own file directory to execute the following command.
+The first time we run the command below, it will take some time for the repository to download the CLIP model.
+1. Training on RefCOCO with unsupervised setting.
+    ```
+    CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=6 --master_port 28887 --use_env train_clip_vg.py --num_workers 2 --epochs 110 --batch_size 64 --lr 0.00025  --lr_scheduler cosine --aug_crop --aug_scale --aug_translate      --imsize 224 --max_query_len 77 --dataset unc      --data_root $/path_to_image_data --split_root $/path_to_split      --output_dir $/path_to_output/output_v01/unc;
+    ```
+    Please refer to [train_and_eval_script/train_and_eval_unsup.sh](train_and_eval_script/train_and_eval_unsup.sh) for training commands on other datasets.
+2. Training on RefCOCO with fully supervised setting.
+    The only difference is an additional control flag: ```--sup_type full```
+    ```
+    CUDA_VISIBLE_DEVICES=3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=5 --master_port 28887 --use_env train_clip_vg.py --num_workers 32 --epochs 120 --batch_size 64 --lr 0.00025  --lr_scheduler cosine --aug_crop --aug_scale --aug_translate    --imsize 224 --max_query_len 77  --sup_type full --dataset unc      --data_root $/path_to_image_data --split_root $/path_to_split --output_dir $/path_to_output/output_v01/unc;
+    ```
+    Please refer to [train_and_eval_script/train_and_eval_full_sup.sh](train_and_eval_script/train_and_eval_full_sup.sh) for training commands on other datasets.
+3. Evaluation on RefCOCO. The instructions are the same for the unsupervised and fully supervised Settings.
+    ```
+    CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=6 --master_port 28888 --use_env eval.py --num_workers 2 --batch_size 128    --dataset unc      --imsize 224 --max_query_len 77 --data_root $/path_to_image_data --split_root $/path_to_split --eval_model $/path_to_output/output_v01/unc/best_checkpoint.pth      --eval_set val    --output_dir $/path_to_output/output_v01/unc;
+    ```
+    Please refer to [train_and_eval_script/train_and_eval_unsup.sh](train_and_eval_script/train_and_eval_unsup.sh) for evaluation commands on other splits or datasets.
+4. We strongly recommend to use the following commands to training or testing with different datasets and splits,
+    which will significant reduce the training workforce.
+    ```
+    bash train_and_eval_script/train_and_eval_unsup.sh
+    bash train_and_eval_script/train_and_eval_full_sup.sh
+    ```
+5. Curriculum reliability measurement or scoring for the pseudo-language labels:
+    It is only needs to change ```eval.py``` to ```eval_for_reliability_distribution.py``` and rename the training pseudo labels as ```test.pth```
+    in the corresponding datasets during Evaluation:
+    ```
+    CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=6 --master_port 28888 --use_env eval_for_reliability_distribution.py --num_workers 2 --batch_size 128    --dataset unc      --imsize 224 --max_query_len 77 --data_root $/path_to_image_data --split_root $/path_to_split --eval_model $/path_to_output/output_v01/unc/best_checkpoint.pth      --eval_set val    --output_dir $/path_to_output/output_v01/unc;
+    ```
+    Besides, if you need to merge the pseudo train splits for further research, just running the following command:
+    ```
+    python ./pseudo_label_generation_module/utils/merge_file.py $/path_to_split/unsup_multi_source/unc/train_separate unc;
+    cp $/path_to_split/full_sup_data/unc/unc_val.pth $/path_to_split/unsup_multi_source/unc/train_separate/unc/unc_val.pth
+    ```
+    Then, you can construct a new pseudo-label training split.
+## Results
+<details open>
+<summary><font size="4">
+RefCOCO, RefCOCO+, and RefCOCOg datasets
+</font></summary>
+<img src="docs/refcoco.png" alt="COCO" width="100%">
+</details>
+<details open>
+<summary><font size="4">
+ReferIt and Flickr datasets
+</font></summary>
+<div align=center>
+<img src="docs/referit.png" alt="COCO" width="50%"></div>
+</details>
+<details open>
+<summary><font size="4">
+Our model also has significant energy efficiency advantages.
+</font></summary>
+<div align=center>
+<img src="docs/efficiency.jpg" alt="COCO" width="85%"></div>
+</details>
+Compared to QRNet, we updated **only 7.7%** of its parameters and achieved impressive training and inference speedups,
+up to **26.84×** and **7.41×**, respectively, while also obtaining competitive results.
+## Methods
+<p align="center"> <img src='docs/algorithm.jpg' align="center" width="100%"> </p>
+## Visualization
+<p align="center"> <img src='docs/fig5.jpg' align="center" width="100%"> </p>
+The figure presents the histograms of Single-Source Reliability (SR) and Cross-source Reliability (CR) for pseudo-language
+labels in the range of (0.0, 1.0] with 1000 bins, where each bin represents the number of samples. The figure illustrates
+that different sources exhibit distinct distributions due to their specific quality and language taxonomy of pseudo-language
+labels (e.g., Fig.5-(a1)-(b2)-(c3)), while different reliability measures have varying discrimination abilities on the
+same source (e.g., Fig.5-(a1)-(b1)-(c1)).
+<p align="center"> <img src='docs/fig6.jpg' align="center" width="100%"> </p>
+Before the execution of MSA, the distribution of the pseudo-language labels and the ground-truth query labels is quite
+different, but after the execution of MSA, the distribution discrepancy significantly becomes smaller. This shows that
+MSA can effectively select pseudo-labels that are more reliable or closer to the distribution of ground-truth query labels.
+<p align="center"> <img src='docs/sample1.jpg' align="center" width="100%"> </p>
+<p align="center"> <img src='docs/sample2.jpg' align="center" width="100%"> </p>
+<p align="center"> <img src='docs/sample3.jpg' align="center" width="100%"> </p>
+Among the various types of unreliable pseudo-language labels, referring to ambiguity is more frequent, particularly in
+images with similar classification objects. If future research aims to further enhance model performance, addressing
+ambiguity is a critical issue.
+## Contacts
+Email: <xiaolinhui16@mails.ucas.ac.cn>.
+Any kind discussions are welcomed!
+## Acknowledgement
+Our model is related to [CLIP](https://github.com/openai/CLIP), [Pseudo-Q](https://github.com/LeapLabTHU/Pseudo-Q), [TransVG](https://github.com/linhuixiao/TransVG). Thanks for their great work!
+We also thank the great previous work including [DETR](https://github.com/facebookresearch/detr), [QRNet](https://github.com/LukeForeverYoung/QRNet), [M2](https://github.com/aimagelab/meshed-memory-transformer), [CLIPCap](https://github.com/rmokady/CLIP_prefix_caption), [RelTR](https://github.com/yrcong/RelTR), [bottom-up-attention](https://github.com/peteanderson80/bottom-up-attention), [ReSC](https://github.com/zyang-ur/ReSC), etc.
+Thanks [OpenAI](https://github.com/openai) for their awesome models.
+## Star History
+[![Star History Chart](https://api.star-history.com/svg?repos=linhuixiao/CLIP-VG&type=Date)](https://star-history.com/#linhuixiao/CLIP-VG&Date)