--- license: mit tags: - object-detection - yolo11 - ui-elements - windows - ultralytics datasets: - ui_synth_v2 pipeline_tag: object-detection --- # Windows UI Element Detector — YOLO11s for Windows UI Elements ## Model Summary A YOLO11s (small) model fine-tuned on 3 000 synthetic Windows-style UI screenshots to detect interactive UI elements. Designed as a lightweight computer-vision fallback for Windows UI automation agents when native UI Automation APIs fail. ## Classes | ID | Class | |----|------------| | 0 | button | | 1 | textbox | | 2 | checkbox | | 3 | dropdown | | 4 | icon | | 5 | tab | | 6 | menu_item | ## Training Data Trained on `ui_synth_v2`, a synthetic dataset of 3 000 Windows-style UI screenshots generated via HTML/CSS templates rendered with Playwright. Includes domain randomization (themes, fonts, scaling, noise). ## Metrics | Metric | Value | |--------------|--------| | mAP50 | 0.9886 | | mAP50-95 | 0.9543 | | Precision | 0.9959 | | Recall | 0.9730 | ### Per-Class AP@50 | Class | AP@50 | |------------|--------| | button | 0.9919 | | textbox | 0.9771 | | checkbox | 0.9864 | | dropdown | 0.9829 | | icon | 0.9950 | | tab | 0.9950 | | menu_item | 0.9915 | ## Usage ```python from local_ui_locator import detect_elements, find_by_text, safe_click_point # Detect all UI elements in a screenshot detections = detect_elements("screenshot.png", conf=0.3) for det in detections: print(f"{det.type}: {det.bbox} score={det.score:.2f}") # Find element by text match = find_by_text("screenshot.png", query="Submit") if match: x, y = safe_click_point(match.bbox) print(f"Click at ({x}, {y})") ``` ### Direct Ultralytics usage ```python from ultralytics import YOLO model = YOLO("best.pt") results = model.predict("screenshot.png", conf=0.3) ``` ## Architecture - **Base model:** YOLO11s (Ultralytics) - **Input size:** 640px - **Parameters:** ~9.4M - **GFLOPs:** ~21.3 - **Inference speed:** ~44-80ms on CPU (M2 Pro), ~2-5ms on GPU (RTX 5060) ## Training - **GPU:** NVIDIA RTX 5060 8GB (Blackwell) - **Dataset:** 3 000 synthetic images (2 400 train / 300 val / 300 test) - **Epochs:** 120 (early stopping with patience=25) - **Batch size:** 16 - **Image size:** 640px - **Optimizer:** SGD with cosine LR scheduler ## Limitations - Trained on synthetic data only — real-world Windows UI may show domain gap - Best on standard Windows 10/11 UI; custom-styled applications may perform worse - Does not detect text content (use OCR for that) - 7 classes only; complex widget types are not supported ## License MIT