File size: 2,682 Bytes

fed44cb
 
 
 
 
 
 
 
 
 
 
 
 
cca5029
fed44cb

---
license: mit
tags:
  - object-detection
  - yolo11
  - ui-elements
  - windows
  - ultralytics
datasets:
  - ui_synth_v2
pipeline_tag: object-detection
---

# Windows UI Element Detector — YOLO11s for Windows UI Elements

## Model Summary

A YOLO11s (small) model fine-tuned on 3 000 synthetic Windows-style UI screenshots to detect interactive UI elements. Designed as a lightweight computer-vision fallback for Windows UI automation agents when native UI Automation APIs fail.

## Classes

| ID | Class      |
|----|------------|
| 0  | button     |
| 1  | textbox    |
| 2  | checkbox   |
| 3  | dropdown   |
| 4  | icon       |
| 5  | tab        |
| 6  | menu_item  |

## Training Data

Trained on `ui_synth_v2`, a synthetic dataset of 3 000 Windows-style UI screenshots generated via HTML/CSS templates rendered with Playwright. Includes domain randomization (themes, fonts, scaling, noise).

## Metrics

| Metric       | Value  |
|--------------|--------|
| mAP50        | 0.9886 |
| mAP50-95     | 0.9543 |
| Precision    | 0.9959 |
| Recall       | 0.9730 |

### Per-Class AP@50

| Class      | AP@50  |
|------------|--------|
| button     | 0.9919 |
| textbox    | 0.9771 |
| checkbox   | 0.9864 |
| dropdown   | 0.9829 |
| icon       | 0.9950 |
| tab        | 0.9950 |
| menu_item  | 0.9915 |

## Usage

```python
from local_ui_locator import detect_elements, find_by_text, safe_click_point

# Detect all UI elements in a screenshot
detections = detect_elements("screenshot.png", conf=0.3)
for det in detections:
    print(f"{det.type}: {det.bbox} score={det.score:.2f}")

# Find element by text
match = find_by_text("screenshot.png", query="Submit")
if match:
    x, y = safe_click_point(match.bbox)
    print(f"Click at ({x}, {y})")
```

### Direct Ultralytics usage

```python
from ultralytics import YOLO

model = YOLO("best.pt")
results = model.predict("screenshot.png", conf=0.3)
```

## Architecture

- **Base model:** YOLO11s (Ultralytics)
- **Input size:** 640px
- **Parameters:** ~9.4M
- **GFLOPs:** ~21.3
- **Inference speed:** ~44-80ms on CPU (M2 Pro), ~2-5ms on GPU (RTX 5060)

## Training

- **GPU:** NVIDIA RTX 5060 8GB (Blackwell)
- **Dataset:** 3 000 synthetic images (2 400 train / 300 val / 300 test)
- **Epochs:** 120 (early stopping with patience=25)
- **Batch size:** 16
- **Image size:** 640px
- **Optimizer:** SGD with cosine LR scheduler

## Limitations

- Trained on synthetic data only — real-world Windows UI may show domain gap
- Best on standard Windows 10/11 UI; custom-styled applications may perform worse
- Does not detect text content (use OCR for that)
- 7 classes only; complex widget types are not supported

## License

MIT