Instructions to use docling-project/ScreenParser with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- ultralytics
How to use docling-project/ScreenParser with ultralytics:
# Couldn't find a valid YOLO version tag. # Replace XX with the correct version. from ultralytics import YOLOvXX model = YOLOvXX.from_pretrained("docling-project/ScreenParser") source = 'http://images.cocodataset.org/val2017/000000039769.jpg' model.predict(source=source, save=True) - Notebooks
- Google Colab
- Kaggle
File size: 4,166 Bytes
74dbff2 6b82877 e7a6e13 6b82877 74dbff2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | ---
license: apache-2.0
datasets:
- docling-project/screenparse
tags:
- object-detection
- yolo
- ui-understanding
- screen-parsing
- grounding
- web
- ultralytics
language:
- en
pipeline_tag: object-detection
library_name: ultralytics
---
# ScreenParser
**ScreenParser** is a YOLO-based UI element detector fine-tuned on [ScreenParse](https://huggingface.co/docling-project/screenparse), a large-scale dataset of 771K web page screenshots with dense annotations across **55 UI element classes**. Given a screenshot, it detects and classifies every visible UI component with bounding boxes and confidence scores.
- **Developed by**: IBM Research - ETH Zurich
- **Model type**: Object detection (YOLO11-L)
- **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
- **Paper**: [ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing](TODO)
- **Code**: [GitHub](TODO)
- **Dataset**: [docling-project/screenparse](https://huggingface.co/docling-project/screenparse)
## Model Summary
ScreenParser is a [YOLO11-Large](https://docs.ultralytics.com/models/yolo11/) model (25.4M parameters) fine-tuned at 1280px resolution on ScreenParse.
### Supported Classes (55)
Table, Column/Browser, Button, Utility Button, App Icon, Navigation Bar, Status Bar, Search Field, Toolbar, Tooltip, Video, Tab Bar, Side Bar, Slider, Picker, ContextMenu, DockMenu, EditMenu, Image, Scroll, Switch, File Icon, Chart, Window, Screen, List, List Item, PopUp Menu, Steppers, Toggles, Text Input, Rating Indicator, Checkbox, Radiobox, Select, Avatar, Badge, Alert, Progress bar, Bottom navigation, Breadcrumb, Page control, Link, Menu, Pagination, Tab, Search Bar, Date-Time picker, Calendar, Text, Heading, Code snippet, Carousel, Notification, Logo
## Usage
### Single Image Inference
```python
from ultralytics import YOLO
from PIL import Image
model = YOLO("docling-project/ScreenParser")
results = model.predict("screenshot.png", imgsz=1280, conf=0.10, iou=0.10)
for r in results:
for box, cls_id, conf in zip(r.boxes.xyxy, r.boxes.cls, r.boxes.conf):
x1, y1, x2, y2 = box.tolist()
label = model.names[int(cls_id)]
print(f"{label:20s} conf={conf:.2f} bbox=({int(x1)}, {int(y1)}, {int(x2-x1)}, {int(y2-y1)})")
```
### Batch Inference
```python
import os
from ultralytics import YOLO
model = YOLO("docling-project/ScreenParser")
IMAGE_DIR = "screenshots/"
images = sorted(
os.path.join(IMAGE_DIR, f) for f in os.listdir(IMAGE_DIR)
if f.lower().endswith((".png", ".jpg", ".jpeg"))
)
results = model.predict(images, imgsz=1280, conf=0.10, iou=0.10, batch=16)
for path, r in zip(images, results):
print(f"--- {os.path.basename(path)} ({len(r.boxes)} elements) ---")
for box, cls_id, conf in zip(r.boxes.xyxy, r.boxes.cls, r.boxes.conf):
x1, y1, x2, y2 = box.tolist()
label = model.names[int(cls_id)]
print(f" {label:20s} conf={conf:.2f} bbox=({int(x1)}, {int(y1)}, {int(x2-x1)}, {int(y2-y1)})")
```
### Save Visualizations
```python
from ultralytics import YOLO
model = YOLO("docling-project/ScreenParser")
results = model.predict("screenshot.png", imgsz=1280, conf=0.10, iou=0.10, save=True)
# Annotated image saved under runs/detect/predict/
```
**Training data**: [ScreenParse](https://huggingface.co/docling-project/screenparse) — 771K web page screenshots with dense annotations across 55 UI element classes. Annotations were generated through automated DOM extraction, IoU-based filtering, and VLM-based refinement.
## Limitations
- Does not produce text content for detected elements (bounding boxes and labels only) — pair with an OCR model or [ScreenVLM](https://huggingface.co/docling-project/ScreenVLM) for text extraction
## Citation
```bibtex
@misc{gurbuz2026movingsparsegroundingcomplete,
title={ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision},
author={A. Said Gurbuz and Sunghwan Hong and Ahmed Nassar and Marc Pollefeys and Peter Staar},
year={2026},
eprint={2602.14276},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.14276},
}
```
|