--- license: apache-2.0 datasets: - docling-project/screenparse tags: - object-detection - yolo - ui-understanding - screen-parsing - grounding - web - ultralytics language: - en pipeline_tag: object-detection library_name: ultralytics --- # ScreenParser **ScreenParser** is a YOLO-based UI element detector fine-tuned on [ScreenParse](https://huggingface.co/docling-project/screenparse), a large-scale dataset of 771K web page screenshots with dense annotations across **55 UI element classes**. Given a screenshot, it detects and classifies every visible UI component with bounding boxes and confidence scores. - **Developed by**: IBM Research - ETH Zurich - **Model type**: Object detection (YOLO11-L) - **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) - **Paper**: [ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing](TODO) - **Code**: [GitHub](TODO) - **Dataset**: [docling-project/screenparse](https://huggingface.co/docling-project/screenparse) ## Model Summary ScreenParser is a [YOLO11-Large](https://docs.ultralytics.com/models/yolo11/) model (25.4M parameters) fine-tuned at 1280px resolution on ScreenParse. ### Supported Classes (55) Table, Column/Browser, Button, Utility Button, App Icon, Navigation Bar, Status Bar, Search Field, Toolbar, Tooltip, Video, Tab Bar, Side Bar, Slider, Picker, ContextMenu, DockMenu, EditMenu, Image, Scroll, Switch, File Icon, Chart, Window, Screen, List, List Item, PopUp Menu, Steppers, Toggles, Text Input, Rating Indicator, Checkbox, Radiobox, Select, Avatar, Badge, Alert, Progress bar, Bottom navigation, Breadcrumb, Page control, Link, Menu, Pagination, Tab, Search Bar, Date-Time picker, Calendar, Text, Heading, Code snippet, Carousel, Notification, Logo ## Usage ### Single Image Inference ```python from ultralytics import YOLO from PIL import Image model = YOLO("docling-project/ScreenParser") results = model.predict("screenshot.png", imgsz=1280, conf=0.10, iou=0.10) for r in results: for box, cls_id, conf in zip(r.boxes.xyxy, r.boxes.cls, r.boxes.conf): x1, y1, x2, y2 = box.tolist() label = model.names[int(cls_id)] print(f"{label:20s} conf={conf:.2f} bbox=({int(x1)}, {int(y1)}, {int(x2-x1)}, {int(y2-y1)})") ``` ### Batch Inference ```python import os from ultralytics import YOLO model = YOLO("docling-project/ScreenParser") IMAGE_DIR = "screenshots/" images = sorted( os.path.join(IMAGE_DIR, f) for f in os.listdir(IMAGE_DIR) if f.lower().endswith((".png", ".jpg", ".jpeg")) ) results = model.predict(images, imgsz=1280, conf=0.10, iou=0.10, batch=16) for path, r in zip(images, results): print(f"--- {os.path.basename(path)} ({len(r.boxes)} elements) ---") for box, cls_id, conf in zip(r.boxes.xyxy, r.boxes.cls, r.boxes.conf): x1, y1, x2, y2 = box.tolist() label = model.names[int(cls_id)] print(f" {label:20s} conf={conf:.2f} bbox=({int(x1)}, {int(y1)}, {int(x2-x1)}, {int(y2-y1)})") ``` ### Save Visualizations ```python from ultralytics import YOLO model = YOLO("docling-project/ScreenParser") results = model.predict("screenshot.png", imgsz=1280, conf=0.10, iou=0.10, save=True) # Annotated image saved under runs/detect/predict/ ``` **Training data**: [ScreenParse](https://huggingface.co/docling-project/screenparse) — 771K web page screenshots with dense annotations across 55 UI element classes. Annotations were generated through automated DOM extraction, IoU-based filtering, and VLM-based refinement. ## Limitations - Does not produce text content for detected elements (bounding boxes and labels only) — pair with an OCR model or [ScreenVLM](https://huggingface.co/docling-project/ScreenVLM) for text extraction ## Citation ```bibtex @misc{gurbuz2026movingsparsegroundingcomplete, title={ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision}, author={A. Said Gurbuz and Sunghwan Hong and Ahmed Nassar and Marc Pollefeys and Peter Staar}, year={2026}, eprint={2602.14276}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2602.14276}, } ```