Pix2Act-WebLINX-Base ONNX

ONNX export of McGill-NLP/pix2act-base-weblinx optimized for browser and server-side inference.

Model Details

Property Value
Original Model McGill-NLP/pix2act-base-weblinx
Architecture Pix2Struct (Vision2Seq)
Parameters ~282M
Export Format ONNX (opset 17)
Quantization INT8 dynamic
Model Size ~1.17 GB (quantized)
Browser Eligible Borderline (works with WebGPU)

Description

Pix2Act-WebLINX is fine-tuned on the WebLINX dataset for real-world web navigation tasks. It uses a pixel-only approach (no HTML/DOM access) making it ideal for:

  • Web automation tasks
  • GUI interaction from screenshots
  • Cross-platform browser control
  • Accessibility applications

Files

File Size Description
encoder_model.onnx 88 MB Vision encoder
decoder_model.onnx 183 MB Text decoder
decoder_with_past.onnx 170 MB Decoder with KV-cache
decoder_model_merged.onnx 728 MB Merged decoder (for Transformers.js)

Usage

With ONNX Runtime (Python)

import onnxruntime as ort
from transformers import AutoProcessor
from PIL import Image

# Load processor
processor = AutoProcessor.from_pretrained("bollscoasts/pix2act-weblinx-base-onnx")

# Load ONNX model
session = ort.InferenceSession("encoder_model.onnx")

# Process image
image = Image.open("screenshot.png")
inputs = processor(images=image, return_tensors="np")

# Run inference
outputs = session.run(None, dict(inputs))

With Transformers.js (Browser)

import { pipeline } from "@xenova/transformers";

// Load the ONNX model
const pipe = await pipeline("image-to-text", "bollscoasts/pix2act-weblinx-base-onnx");

// Run inference on screenshot
const result = await pipe(image);
console.log(result);

Export Details

Exported using HotelBench ONNX export utilities.

The original model was missing the image processor config, which was restored from google/pix2struct-base to enable ONNX export.

Performance

  • Browser (WebGPU): ~500-1000ms inference time
  • Server (CUDA): ~100-200ms inference time
  • Server (CPU): ~1-2s inference time

Citation

@article{pix2act,
  title={From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces},
  author={Shaw, Peter and Joshi, Mandar and Cohan, Iz and Berant, Jonathan and Pasupat, Panupong and Shin, Hexiang and Gur, Izzeddin and Kwiatkowski, Tom and Zhang, Kenton Lee},
  journal={NeurIPS 2023},
  year={2023}
}

@article{weblinx,
  title={WebLINX: Real-World Website Navigation with Multi-Turn Dialogue},
  author={Lù, Xing Han and Kasner, Zdeněk and Reddy, Siva},
  journal={ICML 2024},
  year={2024}
}

License

Apache 2.0 - See original model for full license details.

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bollscoasts/pix2act-weblinx-base-onnx

Quantized
(1)
this model

Dataset used to train bollscoasts/pix2act-weblinx-base-onnx