OmniParser for Pure Vision Based GUI Agent
Paper
•
2408.00203
•
Published
•
24
ONNX conversion of microsoft/OmniParser-v2.0 for fast CPU/GPU inference.
| Model | Description | Size |
|---|---|---|
detector.onnx |
YOLO-based UI element detector | 80 MB |
caption.onnx + caption.onnx.data |
Icon/element captioning model | 350 MB |
florence2_onnx/ |
Florence-2 vision-language model (encoder, decoder, vision) | 1.1 GB |
paddleocr_onnx/ |
PaddleOCR text detection and recognition | 10 MB |
import onnxruntime as ort
import numpy as np
from PIL import Image
# Load detector
session = ort.InferenceSession("detector.onnx", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
# Preprocess image (640x640, normalized)
img = Image.open("screenshot.png").resize((640, 640))
input_array = np.array(img).transpose(2, 0, 1).astype(np.float32) / 255.0
input_array = np.expand_dims(input_array, 0)
# Run inference
outputs = session.run(None, {"images": input_array})
Converted to ONNX by @maxiboch.
See discussion: microsoft/OmniParser-v2.0#5
Base model
microsoft/OmniParser-v2.0