IndextDataLab
/

windows-ui-locator

+---
+license: mit
+tags:
+  - object-detection
+  - yolo11
+  - ui-elements
+  - windows
+  - ultralytics
+datasets:
+  - ui_synth_v2
+pipeline_tag: object-detection
+---
+# Local UI Locator — YOLO11s for Windows UI Elements
+## Model Summary
+A YOLO11s (small) model fine-tuned on 3 000 synthetic Windows-style UI screenshots to detect interactive UI elements. Designed as a lightweight computer-vision fallback for Windows UI automation agents when native UI Automation APIs fail.
+## Classes
+| ID | Class      |
+|----|------------|
+| 0  | button     |
+| 1  | textbox    |
+| 2  | checkbox   |
+| 3  | dropdown   |
+| 4  | icon       |
+| 5  | tab        |
+| 6  | menu_item  |
+## Training Data
+Trained on `ui_synth_v2`, a synthetic dataset of 3 000 Windows-style UI screenshots generated via HTML/CSS templates rendered with Playwright. Includes domain randomization (themes, fonts, scaling, noise).
+## Metrics
+| Metric       | Value  |
+|--------------|--------|
+| mAP50        | 0.9886 |
+| mAP50-95     | 0.9543 |
+| Precision    | 0.9959 |
+| Recall       | 0.9730 |
+### Per-Class AP@50
+| Class      | AP@50  |
+|------------|--------|
+| button     | 0.9919 |
+| textbox    | 0.9771 |
+| checkbox   | 0.9864 |
+| dropdown   | 0.9829 |
+| icon       | 0.9950 |
+| tab        | 0.9950 |
+| menu_item  | 0.9915 |
+## Usage
+```python
+from local_ui_locator import detect_elements, find_by_text, safe_click_point
+# Detect all UI elements in a screenshot
+detections = detect_elements("screenshot.png", conf=0.3)
+for det in detections:
+    print(f"{det.type}: {det.bbox} score={det.score:.2f}")
+# Find element by text
+match = find_by_text("screenshot.png", query="Submit")
+if match:
+    x, y = safe_click_point(match.bbox)
+    print(f"Click at ({x}, {y})")
+```
+### Direct Ultralytics usage
+```python
+from ultralytics import YOLO
+model = YOLO("best.pt")
+results = model.predict("screenshot.png", conf=0.3)
+```
+## Architecture
+- **Base model:** YOLO11s (Ultralytics)
+- **Input size:** 640px
+- **Parameters:** ~9.4M
+- **GFLOPs:** ~21.3
+- **Inference speed:** ~44-80ms on CPU (M2 Pro), ~2-5ms on GPU (RTX 5060)
+## Training
+- **GPU:** NVIDIA RTX 5060 8GB (Blackwell)
+- **Dataset:** 3 000 synthetic images (2 400 train / 300 val / 300 test)
+- **Epochs:** 120 (early stopping with patience=25)
+- **Batch size:** 16
+- **Image size:** 640px
+- **Optimizer:** SGD with cosine LR scheduler
+## Limitations
+- Trained on synthetic data only — real-world Windows UI may show domain gap
+- Best on standard Windows 10/11 UI; custom-styled applications may perform worse
+- Does not detect text content (use OCR for that)
+- 7 classes only; complex widget types are not supported
+## License
+MIT