pix2act-onnx / README.md

bollscoasts

Upload folder using huggingface_hub

b1a54f3 verified 3 months ago

preview code

raw

history blame contribute delete

1.83 kB

metadata

library_name: transformers
tags:
  - onnx
  - vlm
  - gui-automation
  - pix2struct
  - transformers.js
  - webgpu
license: apache-2.0
base_model: google/pix2struct-base

Pix2Act ONNX

ONNX export of google/pix2struct-base for GUI automation via Transformers.js/WebGPU.

Model Details

Property	Value
Original Model	google/pix2struct-base
Architecture	Pix2Struct (Vision2Seq)
Parameters	~282M
Export Format	ONNX (opset 17)
Quantization	INT8 dynamic
Browser Eligible	Yes

Description

Pix2Struct is a pretrained image-to-text model designed for understanding visually situated language. It uses a pixel-only approach making it ideal for:

Screenshot understanding
GUI automation
Document parsing
Web interaction

Usage

With ONNX Runtime (Python)

import onnxruntime as ort
from transformers import AutoProcessor
from PIL import Image

# Load processor
processor = AutoProcessor.from_pretrained("google/pix2struct-base")

# Load ONNX model
session = ort.InferenceSession("encoder_model.onnx")

# Process image
image = Image.open("screenshot.png")
inputs = processor(images=image, return_tensors="np")

# Run inference
outputs = session.run(None, dict(inputs))

With Transformers.js (Browser)

import { pipeline } from "@xenova/transformers";

// Load the ONNX model
const pipe = await pipeline("image-to-text", "bollscoasts/pix2act-onnx");

// Run inference on screenshot
const result = await pipe(image);
console.log(result);

Export Details

Exported using HotelBench ONNX export utilities.

hotelbench export-onnx pix2act -o models/pix2act-onnx --quantize

License

Apache 2.0