File size: 2,682 Bytes
fed44cb cca5029 fed44cb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | ---
license: mit
tags:
- object-detection
- yolo11
- ui-elements
- windows
- ultralytics
datasets:
- ui_synth_v2
pipeline_tag: object-detection
---
# Windows UI Element Detector — YOLO11s for Windows UI Elements
## Model Summary
A YOLO11s (small) model fine-tuned on 3 000 synthetic Windows-style UI screenshots to detect interactive UI elements. Designed as a lightweight computer-vision fallback for Windows UI automation agents when native UI Automation APIs fail.
## Classes
| ID | Class |
|----|------------|
| 0 | button |
| 1 | textbox |
| 2 | checkbox |
| 3 | dropdown |
| 4 | icon |
| 5 | tab |
| 6 | menu_item |
## Training Data
Trained on `ui_synth_v2`, a synthetic dataset of 3 000 Windows-style UI screenshots generated via HTML/CSS templates rendered with Playwright. Includes domain randomization (themes, fonts, scaling, noise).
## Metrics
| Metric | Value |
|--------------|--------|
| mAP50 | 0.9886 |
| mAP50-95 | 0.9543 |
| Precision | 0.9959 |
| Recall | 0.9730 |
### Per-Class AP@50
| Class | AP@50 |
|------------|--------|
| button | 0.9919 |
| textbox | 0.9771 |
| checkbox | 0.9864 |
| dropdown | 0.9829 |
| icon | 0.9950 |
| tab | 0.9950 |
| menu_item | 0.9915 |
## Usage
```python
from local_ui_locator import detect_elements, find_by_text, safe_click_point
# Detect all UI elements in a screenshot
detections = detect_elements("screenshot.png", conf=0.3)
for det in detections:
print(f"{det.type}: {det.bbox} score={det.score:.2f}")
# Find element by text
match = find_by_text("screenshot.png", query="Submit")
if match:
x, y = safe_click_point(match.bbox)
print(f"Click at ({x}, {y})")
```
### Direct Ultralytics usage
```python
from ultralytics import YOLO
model = YOLO("best.pt")
results = model.predict("screenshot.png", conf=0.3)
```
## Architecture
- **Base model:** YOLO11s (Ultralytics)
- **Input size:** 640px
- **Parameters:** ~9.4M
- **GFLOPs:** ~21.3
- **Inference speed:** ~44-80ms on CPU (M2 Pro), ~2-5ms on GPU (RTX 5060)
## Training
- **GPU:** NVIDIA RTX 5060 8GB (Blackwell)
- **Dataset:** 3 000 synthetic images (2 400 train / 300 val / 300 test)
- **Epochs:** 120 (early stopping with patience=25)
- **Batch size:** 16
- **Image size:** 640px
- **Optimizer:** SGD with cosine LR scheduler
## Limitations
- Trained on synthetic data only — real-world Windows UI may show domain gap
- Best on standard Windows 10/11 UI; custom-styled applications may perform worse
- Does not detect text content (use OCR for that)
- 7 classes only; complex widget types are not supported
## License
MIT
|