Windows UI Element Detector โ YOLO11s for Windows UI Elements
Model Summary
A YOLO11s (small) model fine-tuned on 3 000 synthetic Windows-style UI screenshots to detect interactive UI elements. Designed as a lightweight computer-vision fallback for Windows UI automation agents when native UI Automation APIs fail.
Classes
| ID | Class |
|---|---|
| 0 | button |
| 1 | textbox |
| 2 | checkbox |
| 3 | dropdown |
| 4 | icon |
| 5 | tab |
| 6 | menu_item |
Training Data
Trained on ui_synth_v2, a synthetic dataset of 3 000 Windows-style UI screenshots generated via HTML/CSS templates rendered with Playwright. Includes domain randomization (themes, fonts, scaling, noise).
Metrics
| Metric | Value |
|---|---|
| mAP50 | 0.9886 |
| mAP50-95 | 0.9543 |
| Precision | 0.9959 |
| Recall | 0.9730 |
Per-Class AP@50
| Class | AP@50 |
|---|---|
| button | 0.9919 |
| textbox | 0.9771 |
| checkbox | 0.9864 |
| dropdown | 0.9829 |
| icon | 0.9950 |
| tab | 0.9950 |
| menu_item | 0.9915 |
Usage
from local_ui_locator import detect_elements, find_by_text, safe_click_point
# Detect all UI elements in a screenshot
detections = detect_elements("screenshot.png", conf=0.3)
for det in detections:
print(f"{det.type}: {det.bbox} score={det.score:.2f}")
# Find element by text
match = find_by_text("screenshot.png", query="Submit")
if match:
x, y = safe_click_point(match.bbox)
print(f"Click at ({x}, {y})")
Direct Ultralytics usage
from ultralytics import YOLO
model = YOLO("best.pt")
results = model.predict("screenshot.png", conf=0.3)
Architecture
- Base model: YOLO11s (Ultralytics)
- Input size: 640px
- Parameters: ~9.4M
- GFLOPs: ~21.3
- Inference speed: ~44-80ms on CPU (M2 Pro), ~2-5ms on GPU (RTX 5060)
Training
- GPU: NVIDIA RTX 5060 8GB (Blackwell)
- Dataset: 3 000 synthetic images (2 400 train / 300 val / 300 test)
- Epochs: 120 (early stopping with patience=25)
- Batch size: 16
- Image size: 640px
- Optimizer: SGD with cosine LR scheduler
Limitations
- Trained on synthetic data only โ real-world Windows UI may show domain gap
- Best on standard Windows 10/11 UI; custom-styled applications may perform worse
- Does not detect text content (use OCR for that)
- 7 classes only; complex widget types are not supported
License
MIT
- Downloads last month
- 64