---
license: apache-2.0
pipeline_tag: zero-shot-object-detection
---
This model, OneRef, is discussed in the paper [OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling](https://proceedings.neurips.cc/paper_files/paper/2024/file/fcd812a51b8f8d05cfea22e3c9c4b369-Paper-Conference.pdf), [Towards Visual Grounding: A Survey](https://huggingface.co/papers/2412.20206).
Code for this model: https://github.com/linhuixiao/OneRef
[//]: # (
)
NeurIPS 2024
Linhui Xiao
·
Xiaoshan Yang
·
Fang Peng
·
Yaowei Wang
·
Changsheng Xu
A Comparison between OneRef model and the mainstream REC/RES architectures.
** This repository is the official Pytorch implementation for the paper [**OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling**](https://openreview.net/pdf?id=siPdcro6uD) ([Publication](https://proceedings.neurips.cc/paper_files/paper/2024/file/fcd812a51b8f8d05cfea22e3c9c4b369-Paper-Conference.pdf), [Github Code](https://github.com/linhuixiao/OneRef), [HuggingFace model](https://huggingface.co/linhuixiao/OneRef)), which is an advanced version of our preliminary work **HiVG** ([Publication](https://dl.acm.org/doi/abs/10.1145/3664647.3681071), [Paper](https://openreview.net/pdf?id=NMMyGy1kKZ), [Code](https://github.com/linhuixiao/HiVG)) and **CLIP-VG** ([Publication](https://ieeexplore.ieee.org/abstract/document/10269126), [Paper](https://arxiv.org/pdf/2305.08685), [Code](https://github.com/linhuixiao/CLIP-VG)). If you have any questions, please feel free to open an issue or contact me with emails:| Datasets | RefCOCO | RefCOCO+ | RefCOCOg-g | RefCOCOg-u | ReferIt | Flickr | mixup_with_refc | mixup_with_refc_referit |
|---|---|---|---|---|---|---|---|---|
| url, size | All of six datasets, ~400.0MB | |||||||

| Datasets | RefCOCO | RefCOCO+ | RefCOCOg-u | ReferIt | Flickr | |
|---|---|---|---|---|---|---|
| Base model | Google Drive, rec_single_dataset_finetuning_base.zip (for all), ~9.0 GB | |||||
| Base model | Hugging Face, rec_single_dataset_finetuning_base.zip (for all), ~9.0 GB | |||||
| Large model | finetuning_large_unc, ~8.0 GB | finetuning_large_unc+, ~8.0 GB | finetuning_large_gref_umd, ~8.0 GB | finetuning_large_referit, ~8.0 GB | finetuning_large_flickr, ~8.0 GB | |
| Datasets | Mixup (RefCOCO/+/g) | ReferIt | Flickr | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| base model | rec_mixup_grounding_pretraining_base.zip, ~6.0 GB | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Large model | mixup_pretraining_large_unc+g, ~8.0 GB | mixup_pretraining_large_referit, ~8.0 GB | mixup_pretraining_large_flickr, ~8.0 GB | ### REC task: Ultimate performance prediction in our [Grounding Survey paper](https://arxiv.org/pdf/2412.20206)||||||||||||||||||||||||||||||||||||||||||||||||||||
| Datasets | Mixup (RefCOCO/+/g) |
|---|---|
| base model | rec_mixup_grounding_ultimate_performance_base.zip, ~6.0 GB |
| Large model | rec_mixup_grounding_ultimate_performance_large, ~8.0 GB |
| Datasets | RefCOCO | RefCOCO+ | RefCOCOg-u |
|---|---|---|---|
| base model | res_single_dataset_finetuning_base.zip, ~6.0 GB | ||
| Large model | finetuning_large_unc, ~8.0 GB | finetuning_large_unc+, ~8.0 GB | finetuning_large_gref_umd, ~8.0 GB |
| Datasets | Mixup (RefCOCO/+/g) |
|---|---|
| base model | res_mixup_pretraining_base.zip, ~1.0 GB |
| Large model | res_mixup_pretraining_large, ~2.0 GB |
| MRefM Model for REC | Pretraining dataset | Checkpoints |
|---|---|---|
| Base model | RefC,ReferIt | rec_mrefm_base_patch16_384, ~2 GB |
| Large model | RefC,ReferIt | rec_mrefm_large_patch16_384, ~7 GB |
| MRefM Model for RES | Pretraining dataset | Checkpoints |
|---|---|---|
| Base model | RefC | res_mrefm_base_patch16_384, ~2 GB |
| Large model | RefC | res_mrefm_base_patch16_384, ~7 GB |
| BEiT-3 original model | Checkpoints |
|---|---|
| Sentencepiece model (Tokenizer) | sp3 Sentencepiece model, 1 MB |
| MIM VQKD model | vqkd model, 438 MB |
| BEiT-3 Base model | beit3_base_indomain_patch16_224, 554 MB |
| BEiT-3 Large model | beit3_large_indomain_patch16_224, 1.5 GB |

An Illustration of our multimodal Mask Referring Modeling (MRefM) paradigm, which includes Referring-aware mask image modeling and Referring-aware mask language modeling.
**
An Illustration of the referring-based grounding and segmentation transfer.
**
Illustrations of random masking (MAE) [27], block-wise masking (BEiT) [4], and our referring-aware dynamic masking. α denotes the entire masking ratio, while β and γ denote the masking ratio beyond and within the referred region.
** ## Visualization
Qualitative results on the RefCOCO-val dataset.
**
Qualitative results on the RefCOCO+-val dataset.
**
Qualitative results on the RefCOCOg-val dataset.
** Each example shows two different query texts. From left to right: the original input image, the ground truth with box and segmentation mask (in green), the RES prediction of OneRef (in cyan), the REC prediction of OneRef (in cyan), and the cross-modal feature. ## Contacts Email: