File size: 8,818 Bytes
36877d4 bb5408e 36877d4 1d01d95 7df4525 db1831e 240e3a9 b2a6540 1380629 296854f 5f86cd7 e30a42a 5406276 296854f d67a49b 9cdba25 d67a49b 9cdba25 132ea2c 0482ffe 91dd55e 0482ffe 87b58f0 c61a302 1849123 c61a302 0482ffe 91dd55e 1849123 91dd55e 1849123 0482ffe f00bdc5 132ea2c b45f922 132ea2c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
---
license: llama2
pipeline_tag: image-text-to-text
---
# UGround (The Initial LLaVA-based Version)
**Update: We have trained [stronger models](https://huggingface.co/osunlp/UGround-V1-7B) based on Qwen2-VL with the same data. We suggest using them instead for better performance and more convenient training, inference and deployment.**
UGround is a strong GUI visual grounding model trained with a simple recipe. Check our homepage and paper for more details. This work is a collaboration between [OSU NLP Group](https://x.com/osunlp) and [Orby AI](https://www.orby.ai/).

- **Homepage:** https://osu-nlp-group.github.io/UGround/
- **Repository:** https://github.com/OSU-NLP-Group/UGround
- **Paper:** https://arxiv.org/abs/2410.05243
- **Demo:** https://huggingface.co/spaces/orby-osu/UGround
- **Point of Contact:** [Boyu Gou](mailto:gou.43@osu.edu)
## Models
- Model-V1:
- [Initial UGround](https://huggingface.co/osunlp/UGround):
- [UGround-V1-2B (Qwen2-VL)](https://huggingface.co/osunlp/UGround-V1-2B)
- [UGround-V1-7B (Qwen2-VL)](https://huggingface.co/osunlp/UGround-V1-7B)
- [UGround-V1-72B (Qwen2-VL)](https://huggingface.co/osunlp/UGround-V1-72B)
- [Training Data](https://huggingface.co/datasets/osunlp/UGround-V1-Data)
## Release Plan
- [x] [Model Weights](https://huggingface.co/collections/osunlp/uground-677824fc5823d21267bc9812)
- [x] Initial Version (the one used in the paper)
- [x] Qwen2-VL-Based V1 (2B, 7B, 72B)
- [x] Code
- [x] [Inference Code of UGround (Initial & Qwen2-VL-Based)](https://github.com/boyugou/llava_uground/)
- [x] Offline Experiments (Code, Results, and Useful Resources)
- [x] [ScreenSpot](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/ScreenSpot)
- [x] [Multimodal-Mind2Web](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/Multimodal-Mind2Web)
- [x] [OmniAct](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/OmniACT)
- [x] [Android Control](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/AndroidControl)
- [x] Online Experiments
- [x] [Mind2Web-Live-SeeAct-V](https://github.com/boyugou/Mind2Web_Live_SeeAct_V)
- [x] [AndroidWorld-SeeAct-V](https://github.com/boyugou/android_world_seeact_v)
- [ ] Data Synthesis Pipeline (Coming Soon)
- [x] [Training-Data (V1)](https://huggingface.co/datasets/osunlp/UGround-V1-Data)
- [x] Online Demo (HF Spaces)
## Main Results
### GUI Visual Grounding: ScreenSpot (Standard Setting)
| Grounding Model | Arch | SFT data | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg |
| ---------------------------- | ---------------- | ---------------- | ----------- | ----------- | ------------ | ------------ | -------- | -------- | -------- |
| GPT-4 | | | 22.6 | 24.5 | 20.2 | 11.8 | 9.2 | 8.8 | 16.2 |
| GPT-4o | | | 20.2 | 24.9 | 21.1 | 23.6 | 12.2 | 7.8 | 18.3 |
| MiniGPT-v2 | MiniGPT-v2 | | 8.4 | 6.6 | 6.2 | 2.9 | 6.5 | 3.4 | 5.7 |
| Groma | Groma | | 10.3 | 2.6 | 4.6 | 4.3 | 5.7 | 3.4 | 5.2 |
| Fuyu | Fuyu | | 41.0 | 1.3 | 33.0 | 3.6 | 33.9 | 4.4 | 19.5 |
| Qwen-VL | Qwen-VL | | 9.5 | 4.8 | 5.7 | 5.0 | 3.5 | 2.4 | 5.2 |
| SeeClick | Qwen-VL | SeeClick | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 | 53.4 |
| Qwen-GUI | Qwen-VL | GUICourse | 52.4 | 10.9 | 45.9 | 5.7 | 43.0 | 13.6 | 28.6 |
| **UGround-V1** | LLaVA-UGround-V1 | UGround-V1 | **82.8** | **60.3** | **82.5** | **63.6** | **80.4** | **70.4** | **73.3** |
| Qwen2-VL | Qwen2-VL | | 61.3 | 39.3 | 52.0 | 45.0 | 33.0 | 21.8 | 42.1 |
| Auguvis-G-7B | Qwen2-VL | Aguvis-Stage-1 | 88.3 | 78.2 | 88.1 | 70.7 | 85.7 | 74.8 | 81.0 |
| Auguvis-7B | Qwen2-VL | Aguvis-Stage-1&2 | **95.6** | 77.7 | **93.8** | 67.1 | 88.3 | 75.2 | 83.0 |
| OS-Atlas-Base-4B | InternVL | OS-Atlas | 85.7 | 58.5 | 72.2 | 45.7 | 82.6 | 63.1 | 68.0 |
| OS-Atlas-Base-7B | Qwen2-VL | OS-Atlas | 93.0 | 72.9 | 91.8 | 62.9 | **90.9** | 74.3 | 81.0 |
| ShowUI-G | ShowUI | ShowUI | 91.6 | 69.0 | 81.8 | 59.0 | 83.0 | 65.5 | 75.0 |
| ShowUI | ShowUI | ShowUI | 92.3 | 75.5 | 76.3 | 61.1 | 81.7 | 63.6 | 75.1 |
| Iris | Iris | SeeClick | 85.3 | 64.2 | 86.7 | 57.5 | 82.6 | 71.2 | 74.6 |
| Aria-UI | Aria | Aria-UI | 92.3 | 73.8 | 93.3 | 64.3 | 86.5 | 76.2 | 81.1 |
| **UGround-V1-2B (Qwen2-VL)** | Qwen2-VL | UGround-V1 | 89.4 | 72.0 | 88.7 | 65.7 | 81.3 | 68.9 | 77.7 |
| **UGround-V1-7B (Qwen2-VL)** | Qwen2-VL | UGround-V1 | 93.0 | **79.9** | **93.8** | **76.4** | **90.9** | **84.0** | **86.3** |
### GUI Visual Grounding: ScreenSpot (Agent Setting)
| Planner | Grounding Model | Arch | SFT data | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg |
| ------- | ------------------------ | ---------------- | ---------------- | ----------- | ----------- | ------------ | ------------ | -------- | -------- | -------- |
| GPT-4o | Qwen-VL | Qwen-VL | | 21.3 | 21.4 | 18.6 | 10.7 | 9.1 | 5.8 | 14.5 |
| GPT-4o | SeeClick | Qwen-VL | SeeClick | 81.0 | 59.8 | 69.6 | 33.6 | 43.9 | 26.2 | 52.4 |
| GPT-4o | Qwen-GUI | Qwen-VL | GUICourse | 67.8 | 24.5 | 53.1 | 16.4 | 50.4 | 18.5 | 38.5 |
| GPT-4o | **UGround-V1** | LLaVA-UGround-V1 | UGround-V1 | **93.4** | **76.9** | **92.8** | **67.9** | **88.7** | **68.9** | **81.4** |
| GPT-4o | OS-Atlas-Base-4B | InternVL | OS-Atlas | **94.1** | 73.8 | 77.8 | 47.1 | 86.5 | 65.3 | 74.1 |
| GPT-4o | OS-Atlas-Base-7B | Qwen2-VL | OS-Atlas | 93.8 | **79.9** | 90.2 | 66.4 | **92.6** | **79.1** | 83.7 |
| GPT-4o | **UGround-V1-2B (Qwen2-VL)** | Qwen2-VL | UGround-V1 | **94.1** | 77.7 | 92.8 | 63.6 | 90.0 | 70.9 | 81.5 |
| GPT-4o | **UGround-V1-7B (Qwen2-VL)** | Qwen2-VL | UGround-V1 | **94.1** | **79.9** | **93.3** | **73.6** | 89.6 | 73.3 | **84.0** |

## Citation Information
If you find this work useful, please consider citing our papers:
```
@article{gou2024uground,
title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
journal={arXiv preprint arXiv:2410.05243},
year={2024},
url={https://arxiv.org/abs/2410.05243},
}
@article{zheng2023seeact,
title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
journal={arXiv preprint arXiv:2401.01614},
year={2024},
}
```
|