| --- |
| license: mit |
| tags: |
| - object-detection |
| - yolo11 |
| - ui-elements |
| - windows |
| - ultralytics |
| datasets: |
| - ui_synth_v2 |
| pipeline_tag: object-detection |
| --- |
| |
| # Windows UI Element Detector — YOLO11s for Windows UI Elements |
|
|
| ## Model Summary |
|
|
| A YOLO11s (small) model fine-tuned on 3 000 synthetic Windows-style UI screenshots to detect interactive UI elements. Designed as a lightweight computer-vision fallback for Windows UI automation agents when native UI Automation APIs fail. |
|
|
| ## Classes |
|
|
| | ID | Class | |
| |----|------------| |
| | 0 | button | |
| | 1 | textbox | |
| | 2 | checkbox | |
| | 3 | dropdown | |
| | 4 | icon | |
| | 5 | tab | |
| | 6 | menu_item | |
| |
| ## Training Data |
| |
| Trained on `ui_synth_v2`, a synthetic dataset of 3 000 Windows-style UI screenshots generated via HTML/CSS templates rendered with Playwright. Includes domain randomization (themes, fonts, scaling, noise). |
| |
| ## Metrics |
| |
| | Metric | Value | |
| |--------------|--------| |
| | mAP50 | 0.9886 | |
| | mAP50-95 | 0.9543 | |
| | Precision | 0.9959 | |
| | Recall | 0.9730 | |
| |
| ### Per-Class AP@50 |
| |
| | Class | AP@50 | |
| |------------|--------| |
| | button | 0.9919 | |
| | textbox | 0.9771 | |
| | checkbox | 0.9864 | |
| | dropdown | 0.9829 | |
| | icon | 0.9950 | |
| | tab | 0.9950 | |
| | menu_item | 0.9915 | |
|
|
| ## Usage |
|
|
| ```python |
| from local_ui_locator import detect_elements, find_by_text, safe_click_point |
| |
| # Detect all UI elements in a screenshot |
| detections = detect_elements("screenshot.png", conf=0.3) |
| for det in detections: |
| print(f"{det.type}: {det.bbox} score={det.score:.2f}") |
| |
| # Find element by text |
| match = find_by_text("screenshot.png", query="Submit") |
| if match: |
| x, y = safe_click_point(match.bbox) |
| print(f"Click at ({x}, {y})") |
| ``` |
|
|
| ### Direct Ultralytics usage |
|
|
| ```python |
| from ultralytics import YOLO |
| |
| model = YOLO("best.pt") |
| results = model.predict("screenshot.png", conf=0.3) |
| ``` |
|
|
| ## Architecture |
|
|
| - **Base model:** YOLO11s (Ultralytics) |
| - **Input size:** 640px |
| - **Parameters:** ~9.4M |
| - **GFLOPs:** ~21.3 |
| - **Inference speed:** ~44-80ms on CPU (M2 Pro), ~2-5ms on GPU (RTX 5060) |
|
|
| ## Training |
|
|
| - **GPU:** NVIDIA RTX 5060 8GB (Blackwell) |
| - **Dataset:** 3 000 synthetic images (2 400 train / 300 val / 300 test) |
| - **Epochs:** 120 (early stopping with patience=25) |
| - **Batch size:** 16 |
| - **Image size:** 640px |
| - **Optimizer:** SGD with cosine LR scheduler |
|
|
| ## Limitations |
|
|
| - Trained on synthetic data only — real-world Windows UI may show domain gap |
| - Best on standard Windows 10/11 UI; custom-styled applications may perform worse |
| - Does not detect text content (use OCR for that) |
| - 7 classes only; complex widget types are not supported |
|
|
| ## License |
|
|
| MIT |
|
|