Image-Text-to-Text
Safetensors
llava_llama
File size: 8,818 Bytes
36877d4
 
bb5408e
36877d4
 
1d01d95
7df4525
db1831e
240e3a9
b2a6540
1380629
296854f
 
5f86cd7
e30a42a
5406276
296854f
d67a49b
 
 
9cdba25
 
 
 
 
 
d67a49b
 
 
9cdba25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
132ea2c
 
0482ffe
 
 
91dd55e
0482ffe
87b58f0
c61a302
 
 
 
 
 
 
 
 
1849123
c61a302
 
 
 
 
 
 
 
 
 
 
0482ffe
91dd55e
 
 
 
 
 
 
1849123
91dd55e
 
1849123
 
 
 
0482ffe
 
 
f00bdc5
 
132ea2c
 
b45f922
132ea2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
license: llama2
pipeline_tag: image-text-to-text
---

# UGround (The Initial LLaVA-based Version)

**Update: We have trained [stronger models](https://huggingface.co/osunlp/UGround-V1-7B) based on Qwen2-VL with the same data. We suggest using them instead for better performance and more convenient training, inference and deployment.**

UGround is a strong GUI visual grounding model trained with a simple recipe. Check our homepage and paper for more details. This work is a collaboration between [OSU NLP Group](https://x.com/osunlp) and [Orby AI](https://www.orby.ai/).
![radar](https://osu-nlp-group.github.io/UGround/static/images/radar.png)
- **Homepage:** https://osu-nlp-group.github.io/UGround/
- **Repository:** https://github.com/OSU-NLP-Group/UGround
- **Paper:** https://arxiv.org/abs/2410.05243
- **Demo:** https://huggingface.co/spaces/orby-osu/UGround
- **Point of Contact:** [Boyu Gou](mailto:gou.43@osu.edu)


## Models

- Model-V1:
  - [Initial UGround](https://huggingface.co/osunlp/UGround): 
  - [UGround-V1-2B (Qwen2-VL)](https://huggingface.co/osunlp/UGround-V1-2B)
  - [UGround-V1-7B (Qwen2-VL)](https://huggingface.co/osunlp/UGround-V1-7B)
  - [UGround-V1-72B (Qwen2-VL)](https://huggingface.co/osunlp/UGround-V1-72B)
  - [Training Data](https://huggingface.co/datasets/osunlp/UGround-V1-Data)

## Release Plan

- [x] [Model Weights](https://huggingface.co/collections/osunlp/uground-677824fc5823d21267bc9812)
  - [x] Initial Version (the one used in the paper)
  - [x] Qwen2-VL-Based V1 (2B, 7B, 72B)
- [x] Code
  - [x] [Inference Code of UGround (Initial & Qwen2-VL-Based)](https://github.com/boyugou/llava_uground/)
  - [x] Offline Experiments (Code, Results, and Useful Resources)
    - [x] [ScreenSpot](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/ScreenSpot)
    - [x] [Multimodal-Mind2Web](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/Multimodal-Mind2Web)
    - [x] [OmniAct](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/OmniACT)
    - [x] [Android Control](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/AndroidControl)
  - [x] Online Experiments
    - [x] [Mind2Web-Live-SeeAct-V](https://github.com/boyugou/Mind2Web_Live_SeeAct_V)
    - [x] [AndroidWorld-SeeAct-V](https://github.com/boyugou/android_world_seeact_v)
  - [ ] Data Synthesis Pipeline (Coming Soon)
- [x] [Training-Data (V1)](https://huggingface.co/datasets/osunlp/UGround-V1-Data)
- [x] Online Demo (HF Spaces)


## Main Results

### GUI Visual Grounding: ScreenSpot (Standard Setting)

| Grounding Model       | Arch             | SFT data         | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg      |
| ---------------------------- | ---------------- | ---------------- | ----------- | ----------- | ------------ | ------------ | -------- | -------- | -------- |
| GPT-4                        |                  |                  | 22.6        | 24.5        | 20.2         | 11.8         | 9.2      | 8.8      | 16.2     |
| GPT-4o                       |                  |                  | 20.2        | 24.9        | 21.1         | 23.6         | 12.2     | 7.8      | 18.3     |
| MiniGPT-v2                   | MiniGPT-v2       |                  | 8.4         | 6.6         | 6.2          | 2.9          | 6.5      | 3.4      | 5.7      |
| Groma                        | Groma            |                  | 10.3        | 2.6         | 4.6          | 4.3          | 5.7      | 3.4      | 5.2      |
| Fuyu                         | Fuyu             |                  | 41.0        | 1.3         | 33.0         | 3.6          | 33.9     | 4.4      | 19.5     |
| Qwen-VL                      | Qwen-VL          |                  | 9.5         | 4.8         | 5.7          | 5.0          | 3.5      | 2.4      | 5.2      |
| SeeClick                     | Qwen-VL          | SeeClick         | 78.0        | 52.0        | 72.2         | 30.0         | 55.7     | 32.5     | 53.4     |
| Qwen-GUI                     | Qwen-VL          | GUICourse        | 52.4        | 10.9        | 45.9         | 5.7          | 43.0     | 13.6     | 28.6     |
| **UGround-V1**               | LLaVA-UGround-V1 | UGround-V1       | **82.8**        | **60.3**        | **82.5**         | **63.6**         | **80.4**     | **70.4**     | **73.3**     |
| Qwen2-VL                     | Qwen2-VL         |                  | 61.3        | 39.3        | 52.0         | 45.0         | 33.0     | 21.8     | 42.1     |
| Auguvis-G-7B                 | Qwen2-VL         | Aguvis-Stage-1   | 88.3        | 78.2        | 88.1         | 70.7         | 85.7     | 74.8     | 81.0     |
| Auguvis-7B                   | Qwen2-VL         | Aguvis-Stage-1&2 | **95.6**    | 77.7        | **93.8**     | 67.1         | 88.3     | 75.2     | 83.0     |
| OS-Atlas-Base-4B             | InternVL         | OS-Atlas         | 85.7        | 58.5        | 72.2         | 45.7         | 82.6     | 63.1     | 68.0     |
| OS-Atlas-Base-7B             | Qwen2-VL         | OS-Atlas         | 93.0        | 72.9        | 91.8         | 62.9         | **90.9** | 74.3     | 81.0     |
| ShowUI-G                     | ShowUI           | ShowUI           | 91.6        | 69.0        | 81.8         | 59.0         | 83.0     | 65.5     | 75.0     |
| ShowUI                       | ShowUI           | ShowUI           | 92.3        | 75.5        | 76.3         | 61.1         | 81.7     | 63.6     | 75.1     |
| Iris                         | Iris             | SeeClick         | 85.3        | 64.2        | 86.7         | 57.5         | 82.6     | 71.2     | 74.6     |
| Aria-UI                      | Aria             | Aria-UI          | 92.3        | 73.8        | 93.3         | 64.3         | 86.5     | 76.2     | 81.1     |
| **UGround-V1-2B (Qwen2-VL)** | Qwen2-VL         | UGround-V1       | 89.4        | 72.0        | 88.7         | 65.7         | 81.3     | 68.9     | 77.7     |
| **UGround-V1-7B (Qwen2-VL)** | Qwen2-VL         | UGround-V1       | 93.0        | **79.9**    | **93.8**     | **76.4**     | **90.9** | **84.0** | **86.3** |

### GUI Visual Grounding: ScreenSpot (Agent Setting)

| Planner | Grounding Model          | Arch             | SFT data         | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg      |
| ------- | ------------------------ | ---------------- | ---------------- | ----------- | ----------- | ------------ | ------------ | -------- | -------- | -------- |
| GPT-4o  | Qwen-VL                  | Qwen-VL          |                  | 21.3        | 21.4        | 18.6         | 10.7         | 9.1      | 5.8      | 14.5     |
| GPT-4o  | SeeClick                 | Qwen-VL          | SeeClick         | 81.0        | 59.8        | 69.6         | 33.6         | 43.9     | 26.2     | 52.4     |
| GPT-4o  | Qwen-GUI                 | Qwen-VL          | GUICourse        | 67.8        | 24.5        | 53.1         | 16.4         | 50.4     | 18.5     | 38.5     |
| GPT-4o  | **UGround-V1**               | LLaVA-UGround-V1 | UGround-V1       | **93.4**        | **76.9**        | **92.8**         | **67.9**         | **88.7**     | **68.9**     | **81.4**     |
| GPT-4o  | OS-Atlas-Base-4B         | InternVL         | OS-Atlas         | **94.1**    | 73.8        | 77.8         | 47.1         | 86.5     | 65.3     | 74.1     |
| GPT-4o  | OS-Atlas-Base-7B         | Qwen2-VL         | OS-Atlas         | 93.8        | **79.9**    | 90.2         | 66.4         | **92.6** | **79.1** | 83.7     |
| GPT-4o  | **UGround-V1-2B (Qwen2-VL)** | Qwen2-VL         | UGround-V1       | **94.1**    | 77.7        | 92.8         | 63.6         | 90.0     | 70.9     | 81.5     |
| GPT-4o  | **UGround-V1-7B (Qwen2-VL)** | Qwen2-VL         | UGround-V1       | **94.1**    | **79.9**    | **93.3**     | **73.6**     | 89.6     | 73.3     | **84.0** |





![image/png](https://cdn-uploads.huggingface.co/production/uploads/6500870f1e14749e84f8f887/u5bXFxxAWCXthyXWyZkM4.png)

## Citation Information

If you find this work useful, please consider citing our papers: 

```
@article{gou2024uground,
        title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
        author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2410.05243},
        year={2024},
        url={https://arxiv.org/abs/2410.05243},
      }

@article{zheng2023seeact,
        title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
        author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2401.01614},
        year={2024},
      }
```