File size: 4,166 Bytes
74dbff2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b82877
e7a6e13
6b82877
 
 
 
 
 
74dbff2
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
license: apache-2.0
datasets:
- docling-project/screenparse
tags:
- object-detection
- yolo
- ui-understanding
- screen-parsing
- grounding
- web
- ultralytics
language:
- en
pipeline_tag: object-detection
library_name: ultralytics
---

# ScreenParser

**ScreenParser** is a YOLO-based UI element detector fine-tuned on [ScreenParse](https://huggingface.co/docling-project/screenparse), a large-scale dataset of 771K web page screenshots with dense annotations across **55 UI element classes**. Given a screenshot, it detects and classifies every visible UI component with bounding boxes and confidence scores.

- **Developed by**: IBM Research - ETH Zurich
- **Model type**: Object detection (YOLO11-L)
- **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
- **Paper**: [ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing](TODO)
- **Code**: [GitHub](TODO)
- **Dataset**: [docling-project/screenparse](https://huggingface.co/docling-project/screenparse)

## Model Summary

ScreenParser is a [YOLO11-Large](https://docs.ultralytics.com/models/yolo11/) model (25.4M parameters) fine-tuned at 1280px resolution on ScreenParse.

### Supported Classes (55)

Table, Column/Browser, Button, Utility Button, App Icon, Navigation Bar, Status Bar, Search Field, Toolbar, Tooltip, Video, Tab Bar, Side Bar, Slider, Picker, ContextMenu, DockMenu, EditMenu, Image, Scroll, Switch, File Icon, Chart, Window, Screen, List, List Item, PopUp Menu, Steppers, Toggles, Text Input, Rating Indicator, Checkbox, Radiobox, Select, Avatar, Badge, Alert, Progress bar, Bottom navigation, Breadcrumb, Page control, Link, Menu, Pagination, Tab, Search Bar, Date-Time picker, Calendar, Text, Heading, Code snippet, Carousel, Notification, Logo

## Usage

### Single Image Inference

```python
from ultralytics import YOLO
from PIL import Image

model = YOLO("docling-project/ScreenParser")

results = model.predict("screenshot.png", imgsz=1280, conf=0.10, iou=0.10)

for r in results:
    for box, cls_id, conf in zip(r.boxes.xyxy, r.boxes.cls, r.boxes.conf):
        x1, y1, x2, y2 = box.tolist()
        label = model.names[int(cls_id)]
        print(f"{label:20s}  conf={conf:.2f}  bbox=({int(x1)}, {int(y1)}, {int(x2-x1)}, {int(y2-y1)})")
```

### Batch Inference

```python
import os
from ultralytics import YOLO

model = YOLO("docling-project/ScreenParser")
IMAGE_DIR = "screenshots/"

images = sorted(
    os.path.join(IMAGE_DIR, f) for f in os.listdir(IMAGE_DIR)
    if f.lower().endswith((".png", ".jpg", ".jpeg"))
)

results = model.predict(images, imgsz=1280, conf=0.10, iou=0.10, batch=16)

for path, r in zip(images, results):
    print(f"--- {os.path.basename(path)} ({len(r.boxes)} elements) ---")
    for box, cls_id, conf in zip(r.boxes.xyxy, r.boxes.cls, r.boxes.conf):
        x1, y1, x2, y2 = box.tolist()
        label = model.names[int(cls_id)]
        print(f"  {label:20s}  conf={conf:.2f}  bbox=({int(x1)}, {int(y1)}, {int(x2-x1)}, {int(y2-y1)})")
```

### Save Visualizations

```python
from ultralytics import YOLO

model = YOLO("docling-project/ScreenParser")
results = model.predict("screenshot.png", imgsz=1280, conf=0.10, iou=0.10, save=True)
# Annotated image saved under runs/detect/predict/
```

**Training data**: [ScreenParse](https://huggingface.co/docling-project/screenparse) — 771K web page screenshots with dense annotations across 55 UI element classes. Annotations were generated through automated DOM extraction, IoU-based filtering, and VLM-based refinement.

## Limitations

- Does not produce text content for detected elements (bounding boxes and labels only) — pair with an OCR model or [ScreenVLM](https://huggingface.co/docling-project/ScreenVLM) for text extraction

## Citation

```bibtex
@misc{gurbuz2026movingsparsegroundingcomplete,
      title={ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision},
      author={A. Said Gurbuz and Sunghwan Hong and Ahmed Nassar and Marc Pollefeys and Peter Staar},
      year={2026},
      eprint={2602.14276},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.14276},
}
```