Saidgurbuz commited on
Commit
74dbff2
·
verified ·
1 Parent(s): f3fce1e

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +106 -0
  2. best.pt +3 -0
README.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - docling-project/screenparse
5
+ tags:
6
+ - object-detection
7
+ - yolo
8
+ - ui-understanding
9
+ - screen-parsing
10
+ - grounding
11
+ - web
12
+ - ultralytics
13
+ language:
14
+ - en
15
+ pipeline_tag: object-detection
16
+ library_name: ultralytics
17
+ ---
18
+
19
+ # ScreenParser
20
+
21
+ **ScreenParser** is a YOLO-based UI element detector fine-tuned on [ScreenParse](https://huggingface.co/docling-project/screenparse), a large-scale dataset of 771K web page screenshots with dense annotations across **55 UI element classes**. Given a screenshot, it detects and classifies every visible UI component with bounding boxes and confidence scores.
22
+
23
+ - **Developed by**: IBM Research - ETH Zurich
24
+ - **Model type**: Object detection (YOLO11-L)
25
+ - **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
26
+ - **Paper**: [ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing](TODO)
27
+ - **Code**: [GitHub](TODO)
28
+ - **Dataset**: [docling-project/screenparse](https://huggingface.co/docling-project/screenparse)
29
+
30
+ ## Model Summary
31
+
32
+ ScreenParser is a [YOLO11-Large](https://docs.ultralytics.com/models/yolo11/) model (25.4M parameters) fine-tuned at 1280px resolution on ScreenParse.
33
+
34
+ ### Supported Classes (55)
35
+
36
+ Table, Column/Browser, Button, Utility Button, App Icon, Navigation Bar, Status Bar, Search Field, Toolbar, Tooltip, Video, Tab Bar, Side Bar, Slider, Picker, ContextMenu, DockMenu, EditMenu, Image, Scroll, Switch, File Icon, Chart, Window, Screen, List, List Item, PopUp Menu, Steppers, Toggles, Text Input, Rating Indicator, Checkbox, Radiobox, Select, Avatar, Badge, Alert, Progress bar, Bottom navigation, Breadcrumb, Page control, Link, Menu, Pagination, Tab, Search Bar, Date-Time picker, Calendar, Text, Heading, Code snippet, Carousel, Notification, Logo
37
+
38
+ ## Usage
39
+
40
+ ### Single Image Inference
41
+
42
+ ```python
43
+ from ultralytics import YOLO
44
+ from PIL import Image
45
+
46
+ model = YOLO("docling-project/ScreenParser")
47
+
48
+ results = model.predict("screenshot.png", imgsz=1280, conf=0.10, iou=0.10)
49
+
50
+ for r in results:
51
+ for box, cls_id, conf in zip(r.boxes.xyxy, r.boxes.cls, r.boxes.conf):
52
+ x1, y1, x2, y2 = box.tolist()
53
+ label = model.names[int(cls_id)]
54
+ print(f"{label:20s} conf={conf:.2f} bbox=({int(x1)}, {int(y1)}, {int(x2-x1)}, {int(y2-y1)})")
55
+ ```
56
+
57
+ ### Batch Inference
58
+
59
+ ```python
60
+ import os
61
+ from ultralytics import YOLO
62
+
63
+ model = YOLO("docling-project/ScreenParser")
64
+ IMAGE_DIR = "screenshots/"
65
+
66
+ images = sorted(
67
+ os.path.join(IMAGE_DIR, f) for f in os.listdir(IMAGE_DIR)
68
+ if f.lower().endswith((".png", ".jpg", ".jpeg"))
69
+ )
70
+
71
+ results = model.predict(images, imgsz=1280, conf=0.10, iou=0.10, batch=16)
72
+
73
+ for path, r in zip(images, results):
74
+ print(f"--- {os.path.basename(path)} ({len(r.boxes)} elements) ---")
75
+ for box, cls_id, conf in zip(r.boxes.xyxy, r.boxes.cls, r.boxes.conf):
76
+ x1, y1, x2, y2 = box.tolist()
77
+ label = model.names[int(cls_id)]
78
+ print(f" {label:20s} conf={conf:.2f} bbox=({int(x1)}, {int(y1)}, {int(x2-x1)}, {int(y2-y1)})")
79
+ ```
80
+
81
+ ### Save Visualizations
82
+
83
+ ```python
84
+ from ultralytics import YOLO
85
+
86
+ model = YOLO("docling-project/ScreenParser")
87
+ results = model.predict("screenshot.png", imgsz=1280, conf=0.10, iou=0.10, save=True)
88
+ # Annotated image saved under runs/detect/predict/
89
+ ```
90
+
91
+ **Training data**: [ScreenParse](https://huggingface.co/docling-project/screenparse) — 771K web page screenshots with dense annotations across 55 UI element classes. Annotations were generated through automated DOM extraction, IoU-based filtering, and VLM-based refinement.
92
+
93
+ ## Limitations
94
+
95
+ - Does not produce text content for detected elements (bounding boxes and labels only) — pair with an OCR model or [ScreenVLM](https://huggingface.co/docling-project/ScreenVLM) for text extraction
96
+
97
+ ## Citation
98
+
99
+ ```bibtex
100
+ @inproceedings{screenparse2026,
101
+ title={ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing},
102
+ author={TODO},
103
+ booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
104
+ year={2026}
105
+ }
106
+ ```
best.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d1d16bb335e6f38280dafbb3d2f2937975b62d8ef68ab3cf474b15b145b73286
3
+ size 51358361