|
|
--- |
|
|
license: llama2 |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
|
|
|
# UGround (The Initial LLaVA-based Version) |
|
|
|
|
|
**Update: We have trained [stronger models](https://huggingface.co/osunlp/UGround-V1-7B) based on Qwen2-VL with the same data. We suggest using them instead for better performance and more convenient training, inference and deployment.** |
|
|
|
|
|
UGround is a strong GUI visual grounding model trained with a simple recipe. Check our homepage and paper for more details. This work is a collaboration between [OSU NLP Group](https://x.com/osunlp) and [Orby AI](https://www.orby.ai/). |
|
|
 |
|
|
- **Homepage:** https://osu-nlp-group.github.io/UGround/ |
|
|
- **Repository:** https://github.com/OSU-NLP-Group/UGround |
|
|
- **Paper:** https://arxiv.org/abs/2410.05243 |
|
|
- **Demo:** https://huggingface.co/spaces/orby-osu/UGround |
|
|
- **Point of Contact:** [Boyu Gou](mailto:gou.43@osu.edu) |
|
|
|
|
|
|
|
|
## Models |
|
|
|
|
|
- Model-V1: |
|
|
- [Initial UGround](https://huggingface.co/osunlp/UGround): |
|
|
- [UGround-V1-2B (Qwen2-VL)](https://huggingface.co/osunlp/UGround-V1-2B) |
|
|
- [UGround-V1-7B (Qwen2-VL)](https://huggingface.co/osunlp/UGround-V1-7B) |
|
|
- [UGround-V1-72B (Qwen2-VL)](https://huggingface.co/osunlp/UGround-V1-72B) |
|
|
- [Training Data](https://huggingface.co/datasets/osunlp/UGround-V1-Data) |
|
|
|
|
|
## Release Plan |
|
|
|
|
|
- [x] [Model Weights](https://huggingface.co/collections/osunlp/uground-677824fc5823d21267bc9812) |
|
|
- [x] Initial Version (the one used in the paper) |
|
|
- [x] Qwen2-VL-Based V1 (2B, 7B, 72B) |
|
|
- [x] Code |
|
|
- [x] [Inference Code of UGround (Initial & Qwen2-VL-Based)](https://github.com/boyugou/llava_uground/) |
|
|
- [x] Offline Experiments (Code, Results, and Useful Resources) |
|
|
- [x] [ScreenSpot](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/ScreenSpot) |
|
|
- [x] [Multimodal-Mind2Web](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/Multimodal-Mind2Web) |
|
|
- [x] [OmniAct](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/OmniACT) |
|
|
- [x] [Android Control](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/AndroidControl) |
|
|
- [x] Online Experiments |
|
|
- [x] [Mind2Web-Live-SeeAct-V](https://github.com/boyugou/Mind2Web_Live_SeeAct_V) |
|
|
- [x] [AndroidWorld-SeeAct-V](https://github.com/boyugou/android_world_seeact_v) |
|
|
- [ ] Data Synthesis Pipeline (Coming Soon) |
|
|
- [x] [Training-Data (V1)](https://huggingface.co/datasets/osunlp/UGround-V1-Data) |
|
|
- [x] Online Demo (HF Spaces) |
|
|
|
|
|
|
|
|
## Main Results |
|
|
|
|
|
### GUI Visual Grounding: ScreenSpot (Standard Setting) |
|
|
|
|
|
| Grounding Model | Arch | SFT data | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg | |
|
|
| ---------------------------- | ---------------- | ---------------- | ----------- | ----------- | ------------ | ------------ | -------- | -------- | -------- | |
|
|
| GPT-4 | | | 22.6 | 24.5 | 20.2 | 11.8 | 9.2 | 8.8 | 16.2 | |
|
|
| GPT-4o | | | 20.2 | 24.9 | 21.1 | 23.6 | 12.2 | 7.8 | 18.3 | |
|
|
| MiniGPT-v2 | MiniGPT-v2 | | 8.4 | 6.6 | 6.2 | 2.9 | 6.5 | 3.4 | 5.7 | |
|
|
| Groma | Groma | | 10.3 | 2.6 | 4.6 | 4.3 | 5.7 | 3.4 | 5.2 | |
|
|
| Fuyu | Fuyu | | 41.0 | 1.3 | 33.0 | 3.6 | 33.9 | 4.4 | 19.5 | |
|
|
| Qwen-VL | Qwen-VL | | 9.5 | 4.8 | 5.7 | 5.0 | 3.5 | 2.4 | 5.2 | |
|
|
| SeeClick | Qwen-VL | SeeClick | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 | 53.4 | |
|
|
| Qwen-GUI | Qwen-VL | GUICourse | 52.4 | 10.9 | 45.9 | 5.7 | 43.0 | 13.6 | 28.6 | |
|
|
| **UGround-V1** | LLaVA-UGround-V1 | UGround-V1 | **82.8** | **60.3** | **82.5** | **63.6** | **80.4** | **70.4** | **73.3** | |
|
|
| Qwen2-VL | Qwen2-VL | | 61.3 | 39.3 | 52.0 | 45.0 | 33.0 | 21.8 | 42.1 | |
|
|
| Auguvis-G-7B | Qwen2-VL | Aguvis-Stage-1 | 88.3 | 78.2 | 88.1 | 70.7 | 85.7 | 74.8 | 81.0 | |
|
|
| Auguvis-7B | Qwen2-VL | Aguvis-Stage-1&2 | **95.6** | 77.7 | **93.8** | 67.1 | 88.3 | 75.2 | 83.0 | |
|
|
| OS-Atlas-Base-4B | InternVL | OS-Atlas | 85.7 | 58.5 | 72.2 | 45.7 | 82.6 | 63.1 | 68.0 | |
|
|
| OS-Atlas-Base-7B | Qwen2-VL | OS-Atlas | 93.0 | 72.9 | 91.8 | 62.9 | **90.9** | 74.3 | 81.0 | |
|
|
| ShowUI-G | ShowUI | ShowUI | 91.6 | 69.0 | 81.8 | 59.0 | 83.0 | 65.5 | 75.0 | |
|
|
| ShowUI | ShowUI | ShowUI | 92.3 | 75.5 | 76.3 | 61.1 | 81.7 | 63.6 | 75.1 | |
|
|
| Iris | Iris | SeeClick | 85.3 | 64.2 | 86.7 | 57.5 | 82.6 | 71.2 | 74.6 | |
|
|
| Aria-UI | Aria | Aria-UI | 92.3 | 73.8 | 93.3 | 64.3 | 86.5 | 76.2 | 81.1 | |
|
|
| **UGround-V1-2B (Qwen2-VL)** | Qwen2-VL | UGround-V1 | 89.4 | 72.0 | 88.7 | 65.7 | 81.3 | 68.9 | 77.7 | |
|
|
| **UGround-V1-7B (Qwen2-VL)** | Qwen2-VL | UGround-V1 | 93.0 | **79.9** | **93.8** | **76.4** | **90.9** | **84.0** | **86.3** | |
|
|
|
|
|
### GUI Visual Grounding: ScreenSpot (Agent Setting) |
|
|
|
|
|
| Planner | Grounding Model | Arch | SFT data | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg | |
|
|
| ------- | ------------------------ | ---------------- | ---------------- | ----------- | ----------- | ------------ | ------------ | -------- | -------- | -------- | |
|
|
| GPT-4o | Qwen-VL | Qwen-VL | | 21.3 | 21.4 | 18.6 | 10.7 | 9.1 | 5.8 | 14.5 | |
|
|
| GPT-4o | SeeClick | Qwen-VL | SeeClick | 81.0 | 59.8 | 69.6 | 33.6 | 43.9 | 26.2 | 52.4 | |
|
|
| GPT-4o | Qwen-GUI | Qwen-VL | GUICourse | 67.8 | 24.5 | 53.1 | 16.4 | 50.4 | 18.5 | 38.5 | |
|
|
| GPT-4o | **UGround-V1** | LLaVA-UGround-V1 | UGround-V1 | **93.4** | **76.9** | **92.8** | **67.9** | **88.7** | **68.9** | **81.4** | |
|
|
| GPT-4o | OS-Atlas-Base-4B | InternVL | OS-Atlas | **94.1** | 73.8 | 77.8 | 47.1 | 86.5 | 65.3 | 74.1 | |
|
|
| GPT-4o | OS-Atlas-Base-7B | Qwen2-VL | OS-Atlas | 93.8 | **79.9** | 90.2 | 66.4 | **92.6** | **79.1** | 83.7 | |
|
|
| GPT-4o | **UGround-V1-2B (Qwen2-VL)** | Qwen2-VL | UGround-V1 | **94.1** | 77.7 | 92.8 | 63.6 | 90.0 | 70.9 | 81.5 | |
|
|
| GPT-4o | **UGround-V1-7B (Qwen2-VL)** | Qwen2-VL | UGround-V1 | **94.1** | **79.9** | **93.3** | **73.6** | 89.6 | 73.3 | **84.0** | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
 |
|
|
|
|
|
## Citation Information |
|
|
|
|
|
If you find this work useful, please consider citing our papers: |
|
|
|
|
|
``` |
|
|
@article{gou2024uground, |
|
|
title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents}, |
|
|
author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su}, |
|
|
journal={arXiv preprint arXiv:2410.05243}, |
|
|
year={2024}, |
|
|
url={https://arxiv.org/abs/2410.05243}, |
|
|
} |
|
|
|
|
|
@article{zheng2023seeact, |
|
|
title={GPT-4V(ision) is a Generalist Web Agent, if Grounded}, |
|
|
author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su}, |
|
|
journal={arXiv preprint arXiv:2401.01614}, |
|
|
year={2024}, |
|
|
} |
|
|
``` |
|
|
|