ScreenVLM MLX 4-bit

MLX 4-bit quantized version of docling-project/ScreenVLM for fast inference on Apple Silicon.

Model Details

Base model: ScreenVLM (316M params, Idefics3 = SigLIP2-base-patch16-512 + Granite 165M)
Quantization: 4-bit affine (7.654 bits/weight avg, vision encoder at higher precision)
Size: 288 MB (vs 721 MB original float32)
License: Apache 2.0

Performance (Apple M4, 64GB)

Metric	Value
Prompt processing	747–1382 tok/s
Generation speed	432–462 tok/s
Inference time	~1.7s (172 tokens)
Peak memory	1.1–1.2 GB
Model load	1.1s

~500× faster than PyTorch CPU on the same hardware.

Usage

pip install mlx-vlm

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model, processor = load("olragon/ScreenVLM-MLX-4bit")
config = load_config("olragon/ScreenVLM-MLX-4bit")

prompt = apply_chat_template(processor, config, "<screentag>", num_images=1)
output = generate(model, processor, prompt, image="screenshot.png", max_tokens=2048)
print(output)

Output Format (ScreenTag)

55 UI element classes with normalized bounding boxes (0–500 grid):

<button><loc_391><loc_46><loc_451><loc_49>Get started</button>
<tab><loc_582><loc_170><loc_633><loc_174>Tables</tab>
<logo><loc_62><loc_19><loc_182><loc_42>filament</logo>
<text><loc_73><loc_171><loc_427><loc_175>A cohesive set of building blocks</text>

Element types include: Button, Navigation Bar, Text Input, Link, Tab, Image, Video, Table, List, Card, Badge, Avatar, Alert, Search Bar, Logo, Heading, Code snippet, Checkbox, and more.

Conversion

python -m mlx_vlm.convert \
  --model docling-project/ScreenVLM \
  --quantize --q-bits 4 \
  --mlx-path ./ScreenVLM-MLX-4bit

Requires mlx-vlm >= 0.1.12, torch, torchvision (for image processor conversion).

Citation

@inproceedings{gurbuz2026screenparse,
  title={ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision},
  author={Gurbuz, A. Said and Hong, Sunghwan and Nassar, Ahmed and Pollefeys, Marc and Staar, Peter},
  booktitle={ICML},
  year={2026}
}

Acknowledgments

Original model by IBM Research & ETH Zurich. MLX conversion by olragon.

Downloads last month: 15

Safetensors

Model size

0.1B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for olragon/ScreenVLM-MLX-4bit

Base model

docling-project/ScreenVLM

Quantized

(1)

this model

olragon
/

ScreenVLM-MLX-4bit