UGround / README.md

Update README.md

b2a6540 verified 9 months ago

8.82 kB

	---
	license: llama2
	pipeline_tag: image-text-to-text
	---

	# UGround (The Initial LLaVA-based Version)

	Update: We have trained [stronger models](https://huggingface.co/osunlp/UGround-V1-7B) based on Qwen2-VL with the same data. We suggest using them instead for better performance and more convenient training, inference and deployment.

	UGround is a strong GUI visual grounding model trained with a simple recipe. Check our homepage and paper for more details. This work is a collaboration between [OSU NLP Group](https://x.com/osunlp) and [Orby AI](https://www.orby.ai/).
	![radar](https://osu-nlp-group.github.io/UGround/static/images/radar.png)
	- Homepage: https://osu-nlp-group.github.io/UGround/
	- Repository: https://github.com/OSU-NLP-Group/UGround
	- Paper: https://arxiv.org/abs/2410.05243
	- Demo: https://huggingface.co/spaces/orby-osu/UGround
	- Point of Contact: [Boyu Gou](mailto:gou.43@osu.edu)


	## Models

	- Model-V1:
	- [Initial UGround](https://huggingface.co/osunlp/UGround):
	- [UGround-V1-2B (Qwen2-VL)](https://huggingface.co/osunlp/UGround-V1-2B)
	- [UGround-V1-7B (Qwen2-VL)](https://huggingface.co/osunlp/UGround-V1-7B)
	- [UGround-V1-72B (Qwen2-VL)](https://huggingface.co/osunlp/UGround-V1-72B)
	- [Training Data](https://huggingface.co/datasets/osunlp/UGround-V1-Data)

	## Release Plan

	- [x] [Model Weights](https://huggingface.co/collections/osunlp/uground-677824fc5823d21267bc9812)
	- [x] Initial Version (the one used in the paper)
	- [x] Qwen2-VL-Based V1 (2B, 7B, 72B)
	- [x] Code
	- [x] [Inference Code of UGround (Initial & Qwen2-VL-Based)](https://github.com/boyugou/llava_uground/)
	- [x] Offline Experiments (Code, Results, and Useful Resources)
	- [x] [ScreenSpot](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/ScreenSpot)
	- [x] [Multimodal-Mind2Web](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/Multimodal-Mind2Web)
	- [x] [OmniAct](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/OmniACT)
	- [x] [Android Control](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/AndroidControl)
	- [x] Online Experiments
	- [x] [Mind2Web-Live-SeeAct-V](https://github.com/boyugou/Mind2Web_Live_SeeAct_V)
	- [x] [AndroidWorld-SeeAct-V](https://github.com/boyugou/android_world_seeact_v)
	- [ ] Data Synthesis Pipeline (Coming Soon)
	- [x] [Training-Data (V1)](https://huggingface.co/datasets/osunlp/UGround-V1-Data)
	- [x] Online Demo (HF Spaces)


	## Main Results

	### GUI Visual Grounding: ScreenSpot (Standard Setting)

	\| Grounding Model \| Arch \| SFT data \| Mobile-Text \| Mobile-Icon \| Desktop-Text \| Desktop-Icon \| Web-Text \| Web-Icon \| Avg \|
	\| ---------------------------- \| ---------------- \| ---------------- \| ----------- \| ----------- \| ------------ \| ------------ \| -------- \| -------- \| -------- \|
	\| GPT-4 \| \| \| 22.6 \| 24.5 \| 20.2 \| 11.8 \| 9.2 \| 8.8 \| 16.2 \|
	\| GPT-4o \| \| \| 20.2 \| 24.9 \| 21.1 \| 23.6 \| 12.2 \| 7.8 \| 18.3 \|
	\| MiniGPT-v2 \| MiniGPT-v2 \| \| 8.4 \| 6.6 \| 6.2 \| 2.9 \| 6.5 \| 3.4 \| 5.7 \|
	\| Groma \| Groma \| \| 10.3 \| 2.6 \| 4.6 \| 4.3 \| 5.7 \| 3.4 \| 5.2 \|
	\| Fuyu \| Fuyu \| \| 41.0 \| 1.3 \| 33.0 \| 3.6 \| 33.9 \| 4.4 \| 19.5 \|
	\| Qwen-VL \| Qwen-VL \| \| 9.5 \| 4.8 \| 5.7 \| 5.0 \| 3.5 \| 2.4 \| 5.2 \|
	\| SeeClick \| Qwen-VL \| SeeClick \| 78.0 \| 52.0 \| 72.2 \| 30.0 \| 55.7 \| 32.5 \| 53.4 \|
	\| Qwen-GUI \| Qwen-VL \| GUICourse \| 52.4 \| 10.9 \| 45.9 \| 5.7 \| 43.0 \| 13.6 \| 28.6 \|
	\| UGround-V1 \| LLaVA-UGround-V1 \| UGround-V1 \| 82.8 \| 60.3 \| 82.5 \| 63.6 \| 80.4 \| 70.4 \| 73.3 \|
	\| Qwen2-VL \| Qwen2-VL \| \| 61.3 \| 39.3 \| 52.0 \| 45.0 \| 33.0 \| 21.8 \| 42.1 \|
	\| Auguvis-G-7B \| Qwen2-VL \| Aguvis-Stage-1 \| 88.3 \| 78.2 \| 88.1 \| 70.7 \| 85.7 \| 74.8 \| 81.0 \|
	\| Auguvis-7B \| Qwen2-VL \| Aguvis-Stage-1&2 \| 95.6 \| 77.7 \| 93.8 \| 67.1 \| 88.3 \| 75.2 \| 83.0 \|
	\| OS-Atlas-Base-4B \| InternVL \| OS-Atlas \| 85.7 \| 58.5 \| 72.2 \| 45.7 \| 82.6 \| 63.1 \| 68.0 \|
	\| OS-Atlas-Base-7B \| Qwen2-VL \| OS-Atlas \| 93.0 \| 72.9 \| 91.8 \| 62.9 \| 90.9 \| 74.3 \| 81.0 \|
	\| ShowUI-G \| ShowUI \| ShowUI \| 91.6 \| 69.0 \| 81.8 \| 59.0 \| 83.0 \| 65.5 \| 75.0 \|
	\| ShowUI \| ShowUI \| ShowUI \| 92.3 \| 75.5 \| 76.3 \| 61.1 \| 81.7 \| 63.6 \| 75.1 \|
	\| Iris \| Iris \| SeeClick \| 85.3 \| 64.2 \| 86.7 \| 57.5 \| 82.6 \| 71.2 \| 74.6 \|
	\| Aria-UI \| Aria \| Aria-UI \| 92.3 \| 73.8 \| 93.3 \| 64.3 \| 86.5 \| 76.2 \| 81.1 \|
	\| UGround-V1-2B (Qwen2-VL) \| Qwen2-VL \| UGround-V1 \| 89.4 \| 72.0 \| 88.7 \| 65.7 \| 81.3 \| 68.9 \| 77.7 \|
	\| UGround-V1-7B (Qwen2-VL) \| Qwen2-VL \| UGround-V1 \| 93.0 \| 79.9 \| 93.8 \| 76.4 \| 90.9 \| 84.0 \| 86.3 \|

	### GUI Visual Grounding: ScreenSpot (Agent Setting)

	\| Planner \| Grounding Model \| Arch \| SFT data \| Mobile-Text \| Mobile-Icon \| Desktop-Text \| Desktop-Icon \| Web-Text \| Web-Icon \| Avg \|
	\| ------- \| ------------------------ \| ---------------- \| ---------------- \| ----------- \| ----------- \| ------------ \| ------------ \| -------- \| -------- \| -------- \|
	\| GPT-4o \| Qwen-VL \| Qwen-VL \| \| 21.3 \| 21.4 \| 18.6 \| 10.7 \| 9.1 \| 5.8 \| 14.5 \|
	\| GPT-4o \| SeeClick \| Qwen-VL \| SeeClick \| 81.0 \| 59.8 \| 69.6 \| 33.6 \| 43.9 \| 26.2 \| 52.4 \|
	\| GPT-4o \| Qwen-GUI \| Qwen-VL \| GUICourse \| 67.8 \| 24.5 \| 53.1 \| 16.4 \| 50.4 \| 18.5 \| 38.5 \|
	\| GPT-4o \| UGround-V1 \| LLaVA-UGround-V1 \| UGround-V1 \| 93.4 \| 76.9 \| 92.8 \| 67.9 \| 88.7 \| 68.9 \| 81.4 \|
	\| GPT-4o \| OS-Atlas-Base-4B \| InternVL \| OS-Atlas \| 94.1 \| 73.8 \| 77.8 \| 47.1 \| 86.5 \| 65.3 \| 74.1 \|
	\| GPT-4o \| OS-Atlas-Base-7B \| Qwen2-VL \| OS-Atlas \| 93.8 \| 79.9 \| 90.2 \| 66.4 \| 92.6 \| 79.1 \| 83.7 \|
	\| GPT-4o \| UGround-V1-2B (Qwen2-VL) \| Qwen2-VL \| UGround-V1 \| 94.1 \| 77.7 \| 92.8 \| 63.6 \| 90.0 \| 70.9 \| 81.5 \|
	\| GPT-4o \| UGround-V1-7B (Qwen2-VL) \| Qwen2-VL \| UGround-V1 \| 94.1 \| 79.9 \| 93.3 \| 73.6 \| 89.6 \| 73.3 \| 84.0 \|





	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6500870f1e14749e84f8f887/u5bXFxxAWCXthyXWyZkM4.png)

	## Citation Information

	If you find this work useful, please consider citing our papers:

	```
	@article{gou2024uground,
	title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
	author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
	journal={arXiv preprint arXiv:2410.05243},
	year={2024},
	url={https://arxiv.org/abs/2410.05243},
	}

	@article{zheng2023seeact,
	title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
	author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
	journal={arXiv preprint arXiv:2401.01614},
	year={2024},
	}
	```