Salesforce
/

GPA-GUI-Detector

Object Detection

Model card Files Files and versions

GPA-GUI-Detector / README.md

HelloKKMe's picture

Upload folder using huggingface_hub

d04be6b verified 4 days ago

|

history blame contribute delete

3.01 kB

	---
	license: mit
	library_name: ultralytics
	tags:
	- object-detection
	- yolo
	- gui
	- ui-detection
	- omniparser
	pipeline_tag: object-detection
	---

	# GPA-GUI-Detector

	A YOLO-based GUI element detection model for detecting interactive UI elements (icons, buttons, etc.) on screen for GUI Process Automation. This model is finetuned from the [OmniParser](https://github.com/microsoft/OmniParser) ecosystem.

	## Model

	The model weight file is `model.pt`. It is a YOLO model trained with the [Ultralytics](https://github.com/ultralytics/ultralytics) framework.

	## Installation

	```bash
	pip install ultralytics
	```

	## Usage

	### Basic Inference

	```python
	from ultralytics import YOLO

	model = YOLO("model.pt")
	results = model("screenshot.png")
	```

	### Detection with Custom Parameters

	```python
	from ultralytics import YOLO
	from PIL import Image

	# Load the model
	model = YOLO("model.pt")

	# Run inference with custom confidence and image size
	results = model.predict(
	source="screenshot.png",
	conf=0.05, # confidence threshold
	imgsz=640, # input image size
	iou=0.7, # NMS IoU threshold
	)

	# Parse results
	boxes = results[0].boxes.xyxy.cpu().numpy() # bounding boxes in [x1, y1, x2, y2]
	scores = results[0].boxes.conf.cpu().numpy() # confidence scores

	# Draw results on image
	img = Image.open("screenshot.png")
	for box, score in zip(boxes, scores):
	x1, y1, x2, y2 = box
	print(f"Detected UI element at [{x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f}] (conf: {score:.2f})")

	# Or save the annotated image directly
	results[0].save("result.png")
	```

	### Integration with OmniParser

	```python
	import sys
	sys.path.append("/path/to/OmniParser")

	from util.utils import get_yolo_model, predict_yolo
	from PIL import Image

	model = get_yolo_model("model.pt")
	image = Image.open("screenshot.png")

	boxes, confidences, phrases = predict_yolo(
	model=model,
	image=image,
	box_threshold=0.05,
	imgsz=640,
	scale_img=False,
	iou_threshold=0.7,
	)

	for i, (box, conf) in enumerate(zip(boxes, confidences)):
	print(f"Element {i}: box={box.tolist()}, confidence={conf:.2f}")
	```

	## Example

	Detection results on a sample screenshot (1920x1080) from the [ScreenSpot-Pro](https://github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding) benchmark (`conf=0.05`, `iou=0.1`, `imgsz=1280`).

	Input Screenshot

	<p align="center">
	<img src="images/example_input.png" width="80%" alt="Input Screenshot"/>
	</p>

	<table>
	<tr>
	<th align="center">OmniParser V2</th>
	<th align="center">GPA-GUI-Detector</th>
	</tr>
	<tr>
	<td align="center"><img src="images/example_omniparser.png" width="92%" alt="OmniParser V2"/></td>
	<td align="center"><img src="images/example_gpa.png" width="99%" alt="GPA-GUI-Detector"/></td>
	</tr>
	</table>

	## License

	This model is released under the [MIT License](https://opensource.org/licenses/MIT).