Pix2Act ONNX

ONNX export of google/pix2struct-base for GUI automation via Transformers.js/WebGPU.

Model Details

Property Value
Original Model google/pix2struct-base
Architecture Pix2Struct (Vision2Seq)
Parameters ~282M
Export Format ONNX (opset 17)
Quantization INT8 dynamic
Browser Eligible Yes

Description

Pix2Struct is a pretrained image-to-text model designed for understanding visually situated language. It uses a pixel-only approach making it ideal for:

  • Screenshot understanding
  • GUI automation
  • Document parsing
  • Web interaction

Usage

With ONNX Runtime (Python)

import onnxruntime as ort
from transformers import AutoProcessor
from PIL import Image

# Load processor
processor = AutoProcessor.from_pretrained("google/pix2struct-base")

# Load ONNX model
session = ort.InferenceSession("encoder_model.onnx")

# Process image
image = Image.open("screenshot.png")
inputs = processor(images=image, return_tensors="np")

# Run inference
outputs = session.run(None, dict(inputs))

With Transformers.js (Browser)

import { pipeline } from "@xenova/transformers";

// Load the ONNX model
const pipe = await pipeline("image-to-text", "bollscoasts/pix2act-onnx");

// Run inference on screenshot
const result = await pipe(image);
console.log(result);

Export Details

Exported using HotelBench ONNX export utilities.

hotelbench export-onnx pix2act -o models/pix2act-onnx --quantize

License

Apache 2.0

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bollscoasts/pix2act-onnx

Quantized
(1)
this model