docling-project/screenparse
Viewer • Updated • 1.45M • 7.37k • 3
How to use olragon/ScreenVLM-MLX-4bit with MLX:
# Make sure mlx-vlm is installed
# pip install --upgrade mlx-vlm
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
# Load the model
model, processor = load("olragon/ScreenVLM-MLX-4bit")
config = load_config("olragon/ScreenVLM-MLX-4bit")
# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."
# Apply chat template
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=1
)
# Generate output
output = generate(model, processor, formatted_prompt, image)
print(output)MLX 4-bit quantized version of docling-project/ScreenVLM for fast inference on Apple Silicon.
| Metric | Value |
|---|---|
| Prompt processing | 747–1382 tok/s |
| Generation speed | 432–462 tok/s |
| Inference time | ~1.7s (172 tokens) |
| Peak memory | 1.1–1.2 GB |
| Model load | 1.1s |
~500× faster than PyTorch CPU on the same hardware.
pip install mlx-vlm
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model, processor = load("olragon/ScreenVLM-MLX-4bit")
config = load_config("olragon/ScreenVLM-MLX-4bit")
prompt = apply_chat_template(processor, config, "<screentag>", num_images=1)
output = generate(model, processor, prompt, image="screenshot.png", max_tokens=2048)
print(output)
55 UI element classes with normalized bounding boxes (0–500 grid):
<button><loc_391><loc_46><loc_451><loc_49>Get started</button>
<tab><loc_582><loc_170><loc_633><loc_174>Tables</tab>
<logo><loc_62><loc_19><loc_182><loc_42>filament</logo>
<text><loc_73><loc_171><loc_427><loc_175>A cohesive set of building blocks</text>
Element types include: Button, Navigation Bar, Text Input, Link, Tab, Image, Video, Table, List, Card, Badge, Avatar, Alert, Search Bar, Logo, Heading, Code snippet, Checkbox, and more.
python -m mlx_vlm.convert \
--model docling-project/ScreenVLM \
--quantize --q-bits 4 \
--mlx-path ./ScreenVLM-MLX-4bit
Requires mlx-vlm >= 0.1.12, torch, torchvision (for image processor conversion).
@inproceedings{gurbuz2026screenparse,
title={ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision},
author={Gurbuz, A. Said and Hong, Sunghwan and Nassar, Ahmed and Pollefeys, Marc and Staar, Peter},
booktitle={ICML},
year={2026}
}
Original model by IBM Research & ETH Zurich. MLX conversion by olragon.
4-bit
Base model
docling-project/ScreenVLM