windows-ui-locator / README.md
Mypa1's picture
Update README.md
cca5029 verified
metadata
license: mit
tags:
  - object-detection
  - yolo11
  - ui-elements
  - windows
  - ultralytics
datasets:
  - ui_synth_v2
pipeline_tag: object-detection

Windows UI Element Detector — YOLO11s for Windows UI Elements

Model Summary

A YOLO11s (small) model fine-tuned on 3 000 synthetic Windows-style UI screenshots to detect interactive UI elements. Designed as a lightweight computer-vision fallback for Windows UI automation agents when native UI Automation APIs fail.

Classes

ID Class
0 button
1 textbox
2 checkbox
3 dropdown
4 icon
5 tab
6 menu_item

Training Data

Trained on ui_synth_v2, a synthetic dataset of 3 000 Windows-style UI screenshots generated via HTML/CSS templates rendered with Playwright. Includes domain randomization (themes, fonts, scaling, noise).

Metrics

Metric Value
mAP50 0.9886
mAP50-95 0.9543
Precision 0.9959
Recall 0.9730

Per-Class AP@50

Class AP@50
button 0.9919
textbox 0.9771
checkbox 0.9864
dropdown 0.9829
icon 0.9950
tab 0.9950
menu_item 0.9915

Usage

from local_ui_locator import detect_elements, find_by_text, safe_click_point

# Detect all UI elements in a screenshot
detections = detect_elements("screenshot.png", conf=0.3)
for det in detections:
    print(f"{det.type}: {det.bbox} score={det.score:.2f}")

# Find element by text
match = find_by_text("screenshot.png", query="Submit")
if match:
    x, y = safe_click_point(match.bbox)
    print(f"Click at ({x}, {y})")

Direct Ultralytics usage

from ultralytics import YOLO

model = YOLO("best.pt")
results = model.predict("screenshot.png", conf=0.3)

Architecture

  • Base model: YOLO11s (Ultralytics)
  • Input size: 640px
  • Parameters: ~9.4M
  • GFLOPs: ~21.3
  • Inference speed: ~44-80ms on CPU (M2 Pro), ~2-5ms on GPU (RTX 5060)

Training

  • GPU: NVIDIA RTX 5060 8GB (Blackwell)
  • Dataset: 3 000 synthetic images (2 400 train / 300 val / 300 test)
  • Epochs: 120 (early stopping with patience=25)
  • Batch size: 16
  • Image size: 640px
  • Optimizer: SGD with cosine LR scheduler

Limitations

  • Trained on synthetic data only — real-world Windows UI may show domain gap
  • Best on standard Windows 10/11 UI; custom-styled applications may perform worse
  • Does not detect text content (use OCR for that)
  • 7 classes only; complex widget types are not supported

License

MIT