File size: 3,014 Bytes
2115c3c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d04be6b
2115c3c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---

license: mit
library_name: ultralytics
tags:
  - object-detection
  - yolo
  - gui
  - ui-detection
  - omniparser
pipeline_tag: object-detection
---


# GPA-GUI-Detector

A YOLO-based GUI element detection model for detecting interactive UI elements (icons, buttons, etc.) on screen for GUI Process Automation. This model is finetuned from the [OmniParser](https://github.com/microsoft/OmniParser) ecosystem.

## Model

The model weight file is `model.pt`. It is a YOLO model trained with the [Ultralytics](https://github.com/ultralytics/ultralytics) framework.

## Installation

```bash

pip install ultralytics

```

## Usage

### Basic Inference

```python

from ultralytics import YOLO



model = YOLO("model.pt")

results = model("screenshot.png")

```

### Detection with Custom Parameters

```python

from ultralytics import YOLO

from PIL import Image



# Load the model

model = YOLO("model.pt")



# Run inference with custom confidence and image size

results = model.predict(

    source="screenshot.png",

    conf=0.05,        # confidence threshold

    imgsz=640,        # input image size

    iou=0.7,          # NMS IoU threshold

)



# Parse results

boxes = results[0].boxes.xyxy.cpu().numpy()   # bounding boxes in [x1, y1, x2, y2]

scores = results[0].boxes.conf.cpu().numpy()  # confidence scores



# Draw results on image

img = Image.open("screenshot.png")

for box, score in zip(boxes, scores):

    x1, y1, x2, y2 = box

    print(f"Detected UI element at [{x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f}] (conf: {score:.2f})")



# Or save the annotated image directly

results[0].save("result.png")

```

### Integration with OmniParser

```python

import sys

sys.path.append("/path/to/OmniParser")



from util.utils import get_yolo_model, predict_yolo

from PIL import Image



model = get_yolo_model("model.pt")

image = Image.open("screenshot.png")



boxes, confidences, phrases = predict_yolo(

    model=model,

    image=image,

    box_threshold=0.05,

    imgsz=640,

    scale_img=False,

    iou_threshold=0.7,

)



for i, (box, conf) in enumerate(zip(boxes, confidences)):

    print(f"Element {i}: box={box.tolist()}, confidence={conf:.2f}")

```

## Example

Detection results on a sample screenshot (1920x1080) from the [ScreenSpot-Pro](https://github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding) benchmark (`conf=0.05`, `iou=0.1`, `imgsz=1280`).

**Input Screenshot**

<p align="center">
  <img src="images/example_input.png" width="80%" alt="Input Screenshot"/>
</p>

<table>
  <tr>
    <th align="center">OmniParser V2</th>

    <th align="center">GPA-GUI-Detector</th>

  </tr>

  <tr>

    <td align="center"><img src="images/example_omniparser.png" width="92%" alt="OmniParser V2"/></td>

    <td align="center"><img src="images/example_gpa.png" width="99%" alt="GPA-GUI-Detector"/></td>

  </tr>

</table>


## License

This model is released under the [MIT License](https://opensource.org/licenses/MIT).