Zero-Shot Object Detection
linhuixiao commited on
Commit
3642449
·
verified ·
1 Parent(s): 1c24086

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -33
README.md CHANGED
@@ -50,7 +50,7 @@ Code for this model: https://github.com/linhuixiao/OneRef
50
  This repository is the official Pytorch implementation for the paper [**OneRef: Unified One-tower Expression Grounding
51
  and Segmentation with Mask Referring Modeling**](https://openreview.net/pdf?id=siPdcro6uD)
52
  ([Publication](https://proceedings.neurips.cc/paper_files/paper/2024/file/fcd812a51b8f8d05cfea22e3c9c4b369-Paper-Conference.pdf),
53
- [Github Code](https://github.com/linhuixiao/OneRef), [HuggingFace model](https://huggingface.co/xiaolinhui/OneRef)), which is an advanced version
54
  of our preliminary work **HiVG** ([Publication](https://dl.acm.org/doi/abs/10.1145/3664647.3681071), [Paper](https://openreview.net/pdf?id=NMMyGy1kKZ),
55
  [Code](https://github.com/linhuixiao/HiVG)) and **CLIP-VG** ([Publication](https://ieeexplore.ieee.org/abstract/document/10269126),
56
  [Paper](https://arxiv.org/pdf/2305.08685), [Code](https://github.com/linhuixiao/CLIP-VG)).
@@ -67,7 +67,7 @@ Any kind discussions are welcomed!
67
  :exclamation: During the code tidying process, some bugs may arise due to changes in variable names. If any issues occur, please raise them in the [issue page](https://github.com/linhuixiao/OneRef/issues), and I will try to resolve them timely.
68
 
69
  - :fire: **Update on 2024/12/28: We conducted a Survey of Visual Grounding over the past decade, entitled "Towards Visual Grounding: A Survey" ([Paper](https://arxiv.org/pdf/2412.20206), [Project](https://github.com/linhuixiao/Awesome-Visual-Grounding)), Comments are welcome !!!**
70
- - :fire: **Update on 2024/10/10: Our grounding work **OneRef** ([Paper](https://arxiv.org/abs/2410.08021), [Code](https://github.com/linhuixiao/OneRef), [Model](https://huggingface.co/xiaolinhui/OneRef)) has been accepted by the top conference NeurIPS 2024 !**
71
  - **Update on 2024/07/16:** **Our grounding work HiVG ([Publication](https://dl.acm.org/doi/abs/10.1145/3664647.3681071), [Paper](https://openreview.net/pdf?id=NMMyGy1kKZ), [Code](https://github.com/linhuixiao/HiVG)) has been accepted by the top conference ACM MM 2024 !**
72
  - **Update on 2023/9/25:** **Our grounding work CLIP-VG ([paper](https://ieeexplore.ieee.org/abstract/document/10269126), [Code](https://github.com/linhuixiao/CLIP-VG)) has been accepted by the top journal IEEE Transaction on Multimedia (2023)!**
73
 
@@ -193,7 +193,7 @@ Finally, the `$/path_to_image_data` folder will have the following structure:
193
  The labels in the fully supervised scenario is consistent with previous works such as [CLIP-VG](https://github.com/linhuixiao/CLIP-VG).
194
 
195
  :star: As we need to conduct pre-training with mixed datasets, we have shuffled the order of the datasets and unified
196
- some of the dataset formats. You need to download our text annotation files from the [HuggingFace homepage](https://huggingface.co/xiaolinhui/OneRef/tree/main/text_box_annotation).
197
 
198
  ### Fully supervised setting
199
  <table>
@@ -210,7 +210,7 @@ some of the dataset formats. You need to download our text annotation files from
210
  </tr>
211
  <tr> <!-- line 2 -->
212
  <th style="text-align:center" rowspan="1"> url, size </th> <!-- table head -->
213
- <th style="text-align:center" colspan="8"> <a href="https://huggingface.co/xiaolinhui/OneRef/tree/main/text_box_annotation">All of six datasets</a>, ~400.0MB </th> <!-- table head -->
214
  </tr>
215
  </table>
216
 
@@ -274,8 +274,8 @@ the results or encounter errors, please contact us promptly via email or by rais
274
  We will check and upload the correct models. This might be due to model upload errors or model corruption
275
  during disk storage. After all, we trained nearly a hundred models during the research course of this work.**
276
 
277
- <a href="https://huggingface.co/xiaolinhui/OneRef/tree/main"><picture><source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/lobehub/lobe-icons/refs/heads/master/packages/static-png/dark/huggingface-color.png" /><img height="36px" width="36px" src="https://raw.githubusercontent.com/lobehub/lobe-icons/refs/heads/master/packages/static-png/light/huggingface-color.png" /></picture><br/>HuggingFace:
278
- All the models are publicly available on the [**OneRef Huggingface homepage**](https://huggingface.co/xiaolinhui/OneRef/tree/main). You can freely download the corresponding models on this website.
279
 
280
  ### REC task: Single-dataset fine-tuning checkpoints download
281
 
@@ -294,15 +294,15 @@ All the models are publicly available on the [**OneRef Huggingface homepage**](h
294
  </tr>
295
  <tr> <!-- line 2 -->
296
  <th style="text-align:center" rowspan="1"> Base model </th> <!-- table head -->
297
- <th style="text-align:center" colspan="6"> <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/rec_single_dataset_finetuning_base.zip"> Hugging Face, rec_single_dataset_finetuning_base.zip (for all), ~9.0 GB </a> </th> <!-- table head -->
298
  </tr>
299
  <tr> <!-- line 2 -->
300
  <th style="text-align:center" rowspan="1"> Large model </th> <!-- table head -->
301
- <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/rec_single_dataset_finetuning_large_unc.pth">finetuning_large_unc, ~8.0 GB </a> </th> <!-- table head -->
302
- <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/rec_single_dataset_finetuning_large_unc%2B.pth">finetuning_large_unc+, ~8.0 GB </a> </th> <!-- table head -->
303
- <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/rec_single_dataset_finetuning_large_gref_umd.pth">finetuning_large_gref_umd, ~8.0 GB </a> </th> <!-- table head -->
304
- <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/rec_single_dataset_finetuning_large_referit.pth">finetuning_large_referit, ~8.0 GB </a> </th> <!-- table head -->
305
- <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/rec_single_dataset_finetuning_large_flickr.pth">finetuning_large_flickr, ~8.0 GB </a> </th> <!-- table head -->
306
  </tr>
307
  </table>
308
 
@@ -319,14 +319,14 @@ All the models are publicly available on the [**OneRef Huggingface homepage**](h
319
  </tr>
320
  <tr> <!-- line 2 -->
321
  <th style="text-align:center" rowspan="1"> base model </th> <!-- table head -->
322
- <th style="text-align:center" colspan="3"> <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/rec_mixup_grounding_pretraining_base.zip">rec_mixup_grounding_pretraining_base.zip, ~6.0 GB </a> </th> <!-- table head -->
323
  </tr>
324
  <tr> <!-- line 3 -->
325
  <th style="text-align:center" > Large model </th>
326
- <th style="text-align:center" > <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/rec_mixup_grounding_pretraining_large_unc%2Bg.pth">mixup_pretraining_large_unc+g, ~8.0 GB</a> </th>
327
- <th style="text-align:center" > <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/rec_mixup_grounding_pretraining_large_referit.pth">mixup_pretraining_large_referit, ~8.0 GB</a> </th>
328
- <th style="text-align:center" > <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/rec_mixup_grounding_pretraining_large_flickr.pth">mixup_pretraining_large_flickr, ~8.0 GB</a> </th>
329
- </tr>
330
  </table>
331
 
332
 
@@ -339,7 +339,7 @@ All the models are publicly available on the [**OneRef Huggingface homepage**](h
339
  </tr>
340
  <tr> <!-- line 2 -->
341
  <th style="text-align:center" rowspan="1"> base model </th> <!-- table head -->
342
- <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/rec_mixup_grounding_ultimate_performance_base_in_the_survey.zip">rec_mixup_grounding_ultimate_performance_base.zip, ~6.0 GB </a> </th> <!-- table head -->
343
  </tr>
344
  <tr> <!-- line 3 -->
345
  <th style="text-align:center" > Large model </th>
@@ -359,13 +359,13 @@ All the models are publicly available on the [**OneRef Huggingface homepage**](h
359
  </tr>
360
  <tr> <!-- line 2 -->
361
  <th style="text-align:center" rowspan="1"> base model </th> <!-- table head -->
362
- <th style="text-align:center" colspan="3"> <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/res_single_dataset_finetuning_base.zip"> res_single_dataset_finetuning_base.zip, ~6.0 GB </a> </th> <!-- table head -->
363
  </tr>
364
  <tr> <!-- line 2 -->
365
  <th style="text-align:center" rowspan="1"> Large model </th> <!-- table head -->
366
- <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/res_single_dataset_finetuning_large_unc.pth">finetuning_large_unc, ~8.0 GB </a> </th> <!-- table head -->
367
- <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/res_single_dataset_finetuning_large_unc%2B.pth">finetuning_large_unc+, ~8.0 GB </a> </th> <!-- table head -->
368
- <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/res_single_dataset_finetuning_large_gref_umd.pth">finetuning_large_gref_umd, ~8.0 GB </a> </th> <!-- table head -->
369
  </tr>
370
  </table>
371
 
@@ -380,11 +380,11 @@ All the models are publicly available on the [**OneRef Huggingface homepage**](h
380
  </tr>
381
  <tr> <!-- line 2 -->
382
  <th style="text-align:center" rowspan="1"> base model </th> <!-- table head -->
383
- <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/res_mixup_grounding_pretraining_base.zip">res_mixup_pretraining_base.zip, ~1.0 GB </a> </th> <!-- table head -->
384
  </tr>
385
  <tr> <!-- line 3 -->
386
  <th style="text-align:center" > Large model </th>
387
- <th style="text-align:center" > <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/res_mixup_grounding_pretraining_large_unc_%2B_g.pth">res_mixup_pretraining_large, ~2.0 GB</a> </th>
388
  </tr>
389
  </table>
390
 
@@ -423,7 +423,7 @@ the five datasets at once and just using a single script.
423
  the MRefM pre-training **for the RES task** is mainly carried out through a mixture of the RefC datasets.
424
 
425
  For MRefM pre-training, the base model took 15 hours on 32 NVIDIA A100 GPUs, while the large model took 50 hours on
426
- the same number of GPUs. We provide the MRefM pre-trained checkpoints at the following: All model are placed in [HuggingFace Page](https://huggingface.co/xiaolinhui/OneRef/tree/main)
427
 
428
 
429
  <table>
@@ -435,12 +435,12 @@ the same number of GPUs. We provide the MRefM pre-trained checkpoints at the fol
435
  <tr> <!-- line 2 -->
436
  <th style="text-align:center" rowspan="1"> Base model </th> <!-- table head -->
437
  <th style="text-align:center" rowspan="1"> RefC,ReferIt </th> <!-- table head -->
438
- <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/mrefm_pretrain_patch16_384/rec_mrefm_pretrain_base_patch16_384.pth">rec_mrefm_base_patch16_384, ~2 GB </a> </th> <!-- table head -->
439
  </tr>
440
  <tr> <!-- line 3 -->
441
  <th style="text-align:center" > Large model </th>
442
  <th style="text-align:center" rowspan="1"> RefC,ReferIt </th> <!-- table head -->
443
- <th style="text-align:center" > <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/mrefm_pretrain_patch16_384/rec_mrefm_pretrain_large_patch16_384.pth">rec_mrefm_large_patch16_384, ~7 GB</a> </th>
444
  </tr>
445
  </table>
446
 
@@ -455,12 +455,12 @@ the same number of GPUs. We provide the MRefM pre-trained checkpoints at the fol
455
  <tr> <!-- line 2 -->
456
  <th style="text-align:center" rowspan="1"> Base model </th> <!-- table head -->
457
  <th style="text-align:center" > RefC </th>
458
- <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/mrefm_pretrain_patch16_384/res_mrefm_pretrain_base_patch16_384.pth">res_mrefm_base_patch16_384, ~2 GB </a> </th> <!-- table head -->
459
  </tr>
460
  <tr> <!-- line 3 -->
461
  <th style="text-align:center" > Large model </th>
462
  <th style="text-align:center" > RefC </th>
463
- <th style="text-align:center" > <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/mrefm_pretrain_patch16_384/res_mrefm_pretrain_large_patch16_384.pth">res_mrefm_base_patch16_384, ~7 GB</a> </th>
464
  </tr>
465
  </table>
466
 
@@ -479,19 +479,19 @@ the [BEiT-3 official repository](https://github.com/microsoft/unilm/tree/master/
479
  </tr>
480
  <tr> <!-- line 2 -->
481
  <th style="text-align:center" rowspan="1"> Sentencepiece model (Tokenizer) </th> <!-- table head -->
482
- <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/beit3_checkpoints/beit3.spm">sp3 Sentencepiece model, 1 MB </a> </th> <!-- table head -->
483
  </tr>
484
  <tr> <!-- line 2 -->
485
  <th style="text-align:center" rowspan="1"> MIM VQKD model </th> <!-- table head -->
486
- <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/beit3_checkpoints/vqkd_encoder_base_decoder_3x768x12_clip-d5036aa7.pth">vqkd model, 438 MB </a> </th> <!-- table head -->
487
  </tr>
488
  <tr> <!-- line 2 -->
489
  <th style="text-align:center" rowspan="1"> BEiT-3 Base model </th> <!-- table head -->
490
- <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/beit3_checkpoints/beit3_base_indomain_patch16_224.pth">beit3_base_indomain_patch16_224, 554 MB </a> </th> <!-- table head -->
491
  </tr>
492
  <tr> <!-- line 3 -->
493
  <th style="text-align:center" > BEiT-3 Large model </th>
494
- <th style="text-align:center" > <a href="https://huggingface.co/xiaolinhui/OneRef/blob/main/beit3_checkpoints/beit3_large_indomain_patch16_224.pth">beit3_large_indomain_patch16_224, 1.5 GB</a> </th>
495
  </tr>
496
  </table>
497
 
 
50
  This repository is the official Pytorch implementation for the paper [**OneRef: Unified One-tower Expression Grounding
51
  and Segmentation with Mask Referring Modeling**](https://openreview.net/pdf?id=siPdcro6uD)
52
  ([Publication](https://proceedings.neurips.cc/paper_files/paper/2024/file/fcd812a51b8f8d05cfea22e3c9c4b369-Paper-Conference.pdf),
53
+ [Github Code](https://github.com/linhuixiao/OneRef), [HuggingFace model](https://huggingface.co/linhuixiao/OneRef)), which is an advanced version
54
  of our preliminary work **HiVG** ([Publication](https://dl.acm.org/doi/abs/10.1145/3664647.3681071), [Paper](https://openreview.net/pdf?id=NMMyGy1kKZ),
55
  [Code](https://github.com/linhuixiao/HiVG)) and **CLIP-VG** ([Publication](https://ieeexplore.ieee.org/abstract/document/10269126),
56
  [Paper](https://arxiv.org/pdf/2305.08685), [Code](https://github.com/linhuixiao/CLIP-VG)).
 
67
  :exclamation: During the code tidying process, some bugs may arise due to changes in variable names. If any issues occur, please raise them in the [issue page](https://github.com/linhuixiao/OneRef/issues), and I will try to resolve them timely.
68
 
69
  - :fire: **Update on 2024/12/28: We conducted a Survey of Visual Grounding over the past decade, entitled "Towards Visual Grounding: A Survey" ([Paper](https://arxiv.org/pdf/2412.20206), [Project](https://github.com/linhuixiao/Awesome-Visual-Grounding)), Comments are welcome !!!**
70
+ - :fire: **Update on 2024/10/10: Our grounding work **OneRef** ([Paper](https://arxiv.org/abs/2410.08021), [Code](https://github.com/linhuixiao/OneRef), [Model](https://huggingface.co/linhuixiao/OneRef)) has been accepted by the top conference NeurIPS 2024 !**
71
  - **Update on 2024/07/16:** **Our grounding work HiVG ([Publication](https://dl.acm.org/doi/abs/10.1145/3664647.3681071), [Paper](https://openreview.net/pdf?id=NMMyGy1kKZ), [Code](https://github.com/linhuixiao/HiVG)) has been accepted by the top conference ACM MM 2024 !**
72
  - **Update on 2023/9/25:** **Our grounding work CLIP-VG ([paper](https://ieeexplore.ieee.org/abstract/document/10269126), [Code](https://github.com/linhuixiao/CLIP-VG)) has been accepted by the top journal IEEE Transaction on Multimedia (2023)!**
73
 
 
193
  The labels in the fully supervised scenario is consistent with previous works such as [CLIP-VG](https://github.com/linhuixiao/CLIP-VG).
194
 
195
  :star: As we need to conduct pre-training with mixed datasets, we have shuffled the order of the datasets and unified
196
+ some of the dataset formats. You need to download our text annotation files from the [HuggingFace homepage](https://huggingface.co/linhuixiao/OneRef/tree/main/text_box_annotation).
197
 
198
  ### Fully supervised setting
199
  <table>
 
210
  </tr>
211
  <tr> <!-- line 2 -->
212
  <th style="text-align:center" rowspan="1"> url, size </th> <!-- table head -->
213
+ <th style="text-align:center" colspan="8"> <a href="https://huggingface.co/linhuixiao/OneRef/tree/main/text_box_annotation">All of six datasets</a>, ~400.0MB </th> <!-- table head -->
214
  </tr>
215
  </table>
216
 
 
274
  We will check and upload the correct models. This might be due to model upload errors or model corruption
275
  during disk storage. After all, we trained nearly a hundred models during the research course of this work.**
276
 
277
+ <a href="https://huggingface.co/linhuixiao/OneRef/tree/main"><picture><source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/lobehub/lobe-icons/refs/heads/master/packages/static-png/dark/huggingface-color.png" /><img height="36px" width="36px" src="https://raw.githubusercontent.com/lobehub/lobe-icons/refs/heads/master/packages/static-png/light/huggingface-color.png" /></picture><br/>HuggingFace:
278
+ All the models are publicly available on the [**OneRef Huggingface homepage**](https://huggingface.co/linhuixiao/OneRef/tree/main). You can freely download the corresponding models on this website.
279
 
280
  ### REC task: Single-dataset fine-tuning checkpoints download
281
 
 
294
  </tr>
295
  <tr> <!-- line 2 -->
296
  <th style="text-align:center" rowspan="1"> Base model </th> <!-- table head -->
297
+ <th style="text-align:center" colspan="6"> <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/rec_single_dataset_finetuning_base.zip"> Hugging Face, rec_single_dataset_finetuning_base.zip (for all), ~9.0 GB </a> </th> <!-- table head -->
298
  </tr>
299
  <tr> <!-- line 2 -->
300
  <th style="text-align:center" rowspan="1"> Large model </th> <!-- table head -->
301
+ <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/rec_single_dataset_finetuning_large_unc.pth">finetuning_large_unc, ~8.0 GB </a> </th> <!-- table head -->
302
+ <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/rec_single_dataset_finetuning_large_unc%2B.pth">finetuning_large_unc+, ~8.0 GB </a> </th> <!-- table head -->
303
+ <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/rec_single_dataset_finetuning_large_gref_umd.pth">finetuning_large_gref_umd, ~8.0 GB </a> </th> <!-- table head -->
304
+ <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/rec_single_dataset_finetuning_large_referit.pth">finetuning_large_referit, ~8.0 GB </a> </th> <!-- table head -->
305
+ <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/rec_single_dataset_finetuning_large_flickr.pth">finetuning_large_flickr, ~8.0 GB </a> </th> <!-- table head -->
306
  </tr>
307
  </table>
308
 
 
319
  </tr>
320
  <tr> <!-- line 2 -->
321
  <th style="text-align:center" rowspan="1"> base model </th> <!-- table head -->
322
+ <th style="text-align:center" colspan="3"> <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/rec_mixup_grounding_pretraining_base.zip">rec_mixup_grounding_pretraining_base.zip, ~6.0 GB </a> </th> <!-- table head -->
323
  </tr>
324
  <tr> <!-- line 3 -->
325
  <th style="text-align:center" > Large model </th>
326
+ <th style="text-align:center" > <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/rec_mixup_grounding_pretraining_large_unc%2Bg.pth">mixup_pretraining_large_unc+g, ~8.0 GB</a> </th>
327
+ <th style="text-align:center" > <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/rec_mixup_grounding_pretraining_large_referit.pth">mixup_pretraining_large_referit, ~8.0 GB</a> </th>
328
+ <th style="text-align:center" > <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/rec_mixup_grounding_pretraining_large_flickr.pth">mixup_pretraining_large_flickr, ~8.0 GB</a> </th>
329
+ </trlinhuixiao
330
  </table>
331
 
332
 
 
339
  </tr>
340
  <tr> <!-- line 2 -->
341
  <th style="text-align:center" rowspan="1"> base model </th> <!-- table head -->
342
+ <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/rec_mixup_grounding_ultimate_performance_base_in_the_survey.zip">rec_mixup_grounding_ultimate_performance_base.zip, ~6.0 GB </a> </th> <!-- table head -->
343
  </tr>
344
  <tr> <!-- line 3 -->
345
  <th style="text-align:center" > Large model </th>
 
359
  </tr>
360
  <tr> <!-- line 2 -->
361
  <th style="text-align:center" rowspan="1"> base model </th> <!-- table head -->
362
+ <th style="text-align:center" colspan="3"> <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/res_single_dataset_finetuning_base.zip"> res_single_dataset_finetuning_base.zip, ~6.0 GB </a> </th> <!-- table head -->
363
  </tr>
364
  <tr> <!-- line 2 -->
365
  <th style="text-align:center" rowspan="1"> Large model </th> <!-- table head -->
366
+ <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/res_single_dataset_finetuning_large_unc.pth">finetuning_large_unc, ~8.0 GB </a> </th> <!-- table head -->
367
+ <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/res_single_dataset_finetuning_large_unc%2B.pth">finetuning_large_unc+, ~8.0 GB </a> </th> <!-- table head -->
368
+ <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/res_single_dataset_finetuning_large_gref_umd.pth">finetuning_large_gref_umd, ~8.0 GB </a> </th> <!-- table head -->
369
  </tr>
370
  </table>
371
 
 
380
  </tr>
381
  <tr> <!-- line 2 -->
382
  <th style="text-align:center" rowspan="1"> base model </th> <!-- table head -->
383
+ <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/res_mixup_grounding_pretraining_base.zip">res_mixup_pretraining_base.zip, ~1.0 GB </a> </th> <!-- table head -->
384
  </tr>
385
  <tr> <!-- line 3 -->
386
  <th style="text-align:center" > Large model </th>
387
+ <th style="text-align:center" > <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/res_mixup_grounding_pretraining_large_unc_%2B_g.pth">res_mixup_pretraining_large, ~2.0 GB</a> </th>
388
  </tr>
389
  </table>
390
 
 
423
  the MRefM pre-training **for the RES task** is mainly carried out through a mixture of the RefC datasets.
424
 
425
  For MRefM pre-training, the base model took 15 hours on 32 NVIDIA A100 GPUs, while the large model took 50 hours on
426
+ the same number of GPUs. We provide the MRefM pre-trained checkpoints at the following: All model are placed in [HuggingFace Page](https://huggingface.co/linhuixiao/OneRef/tree/main)
427
 
428
 
429
  <table>
 
435
  <tr> <!-- line 2 -->
436
  <th style="text-align:center" rowspan="1"> Base model </th> <!-- table head -->
437
  <th style="text-align:center" rowspan="1"> RefC,ReferIt </th> <!-- table head -->
438
+ <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/mrefm_pretrain_patch16_384/rec_mrefm_pretrain_base_patch16_384.pth">rec_mrefm_base_patch16_384, ~2 GB </a> </th> <!-- table head -->
439
  </tr>
440
  <tr> <!-- line 3 -->
441
  <th style="text-align:center" > Large model </th>
442
  <th style="text-align:center" rowspan="1"> RefC,ReferIt </th> <!-- table head -->
443
+ <th style="text-align:center" > <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/mrefm_pretrain_patch16_384/rec_mrefm_pretrain_large_patch16_384.pth">rec_mrefm_large_patch16_384, ~7 GB</a> </th>
444
  </tr>
445
  </table>
446
 
 
455
  <tr> <!-- line 2 -->
456
  <th style="text-align:center" rowspan="1"> Base model </th> <!-- table head -->
457
  <th style="text-align:center" > RefC </th>
458
+ <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/mrefm_pretrain_patch16_384/res_mrefm_pretrain_base_patch16_384.pth">res_mrefm_base_patch16_384, ~2 GB </a> </th> <!-- table head -->
459
  </tr>
460
  <tr> <!-- line 3 -->
461
  <th style="text-align:center" > Large model </th>
462
  <th style="text-align:center" > RefC </th>
463
+ <th style="text-align:center" > <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/mrefm_pretrain_patch16_384/res_mrefm_pretrain_large_patch16_384.pth">res_mrefm_base_patch16_384, ~7 GB</a> </th>
464
  </tr>
465
  </table>
466
 
 
479
  </tr>
480
  <tr> <!-- line 2 -->
481
  <th style="text-align:center" rowspan="1"> Sentencepiece model (Tokenizer) </th> <!-- table head -->
482
+ <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/beit3_checkpoints/beit3.spm">sp3 Sentencepiece model, 1 MB </a> </th> <!-- table head -->
483
  </tr>
484
  <tr> <!-- line 2 -->
485
  <th style="text-align:center" rowspan="1"> MIM VQKD model </th> <!-- table head -->
486
+ <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/beit3_checkpoints/vqkd_encoder_base_decoder_3x768x12_clip-d5036aa7.pth">vqkd model, 438 MB </a> </th> <!-- table head -->
487
  </tr>
488
  <tr> <!-- line 2 -->
489
  <th style="text-align:center" rowspan="1"> BEiT-3 Base model </th> <!-- table head -->
490
+ <th style="text-align:center" colspan="1"> <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/beit3_checkpoints/beit3_base_indomain_patch16_224.pth">beit3_base_indomain_patch16_224, 554 MB </a> </th> <!-- table head -->
491
  </tr>
492
  <tr> <!-- line 3 -->
493
  <th style="text-align:center" > BEiT-3 Large model </th>
494
+ <th style="text-align:center" > <a href="https://huggingface.co/linhuixiao/OneRef/blob/main/beit3_checkpoints/beit3_large_indomain_patch16_224.pth">beit3_large_indomain_patch16_224, 1.5 GB</a> </th>
495
  </tr>
496
  </table>
497