pix2act-onnx / README.md
bollscoasts's picture
Upload folder using huggingface_hub
b1a54f3 verified
metadata
library_name: transformers
tags:
  - onnx
  - vlm
  - gui-automation
  - pix2struct
  - transformers.js
  - webgpu
license: apache-2.0
base_model: google/pix2struct-base

Pix2Act ONNX

ONNX export of google/pix2struct-base for GUI automation via Transformers.js/WebGPU.

Model Details

Property Value
Original Model google/pix2struct-base
Architecture Pix2Struct (Vision2Seq)
Parameters ~282M
Export Format ONNX (opset 17)
Quantization INT8 dynamic
Browser Eligible Yes

Description

Pix2Struct is a pretrained image-to-text model designed for understanding visually situated language. It uses a pixel-only approach making it ideal for:

  • Screenshot understanding
  • GUI automation
  • Document parsing
  • Web interaction

Usage

With ONNX Runtime (Python)

import onnxruntime as ort
from transformers import AutoProcessor
from PIL import Image

# Load processor
processor = AutoProcessor.from_pretrained("google/pix2struct-base")

# Load ONNX model
session = ort.InferenceSession("encoder_model.onnx")

# Process image
image = Image.open("screenshot.png")
inputs = processor(images=image, return_tensors="np")

# Run inference
outputs = session.run(None, dict(inputs))

With Transformers.js (Browser)

import { pipeline } from "@xenova/transformers";

// Load the ONNX model
const pipe = await pipeline("image-to-text", "bollscoasts/pix2act-onnx");

// Run inference on screenshot
const result = await pipe(image);
console.log(result);

Export Details

Exported using HotelBench ONNX export utilities.

hotelbench export-onnx pix2act -o models/pix2act-onnx --quantize

License

Apache 2.0