|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- Hcompany/WebClick |
|
|
base_model: |
|
|
- google/siglip2-base-patch16-224 |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: image-classification |
|
|
library_name: transformers |
|
|
tags: |
|
|
- agentbrowse |
|
|
- calendars |
|
|
- humanbrowse |
|
|
- SigLIP2 |
|
|
--- |
|
|
|
|
|
 |
|
|
|
|
|
# **WebClick-AgentBrowse-SigLIP2** |
|
|
|
|
|
> **WebClick-AgentBrowse-SigLIP2** is a vision-language encoder model fine-tuned from [`google/siglip2-base-patch16-224`](https://huggingface.co/google/siglip2-base-patch16-224) for **multi-class image classification**. |
|
|
It is trained to detect and classify web UI click regions into three classes: `agentbrowse`, `calendars`, and `humanbrowse`. The model utilizes the `SiglipForImageClassification` architecture. |
|
|
|
|
|
> \[!note] |
|
|
> **SigLIP 2**: *Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features* |
|
|
> [https://arxiv.org/pdf/2502.14786](https://arxiv.org/pdf/2502.14786) |
|
|
|
|
|
|
|
|
> [!note] |
|
|
agent-browse / calendars / human-browse |
|
|
|
|
|
--- |
|
|
|
|
|
```py |
|
|
Classification Report: |
|
|
precision recall f1-score support |
|
|
|
|
|
agentbrowse 0.9556 0.8763 0.9142 590 |
|
|
calendars 0.9707 0.9413 0.9558 528 |
|
|
humanbrowse 0.8481 0.9539 0.8979 521 |
|
|
|
|
|
accuracy 0.9219 1639 |
|
|
macro avg 0.9248 0.9238 0.9226 1639 |
|
|
weighted avg 0.9263 0.9219 0.9224 1639 |
|
|
``` |
|
|
|
|
|
 |
|
|
|
|
|
--- |
|
|
|
|
|
## Label Space: 3 Classes |
|
|
|
|
|
``` |
|
|
|
|
|
Class 0: agentbrowse |
|
|
Class 1: calendars |
|
|
Class 2: humanbrowse |
|
|
|
|
|
```` |
|
|
|
|
|
--- |
|
|
|
|
|
## Install Dependencies |
|
|
|
|
|
```bash |
|
|
pip install -q transformers torch pillow gradio hf_xet |
|
|
```` |
|
|
|
|
|
--- |
|
|
|
|
|
## Inference Code |
|
|
|
|
|
```python |
|
|
import gradio as gr |
|
|
from transformers import AutoImageProcessor, SiglipForImageClassification |
|
|
from PIL import Image |
|
|
import torch |
|
|
|
|
|
# Load model and processor |
|
|
model_name = "prithivMLmods/WebClick-AgentBrowse-SigLIP2" # Replace with actual HF model repo |
|
|
model = SiglipForImageClassification.from_pretrained(model_name) |
|
|
processor = AutoImageProcessor.from_pretrained(model_name) |
|
|
|
|
|
# Updated label mapping |
|
|
id2label = { |
|
|
"0": "agentbrowse", |
|
|
"1": "calendars", |
|
|
"2": "humanbrowse" |
|
|
} |
|
|
|
|
|
def classify_image(image): |
|
|
image = Image.fromarray(image).convert("RGB") |
|
|
inputs = processor(images=image, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
logits = outputs.logits |
|
|
probs = torch.nn.functional.softmax(logits, dim=1).squeeze().tolist() |
|
|
|
|
|
prediction = { |
|
|
id2label[str(i)]: round(probs[i], 3) for i in range(len(probs)) |
|
|
} |
|
|
|
|
|
return prediction |
|
|
|
|
|
# Gradio Interface |
|
|
iface = gr.Interface( |
|
|
fn=classify_image, |
|
|
inputs=gr.Image(type="numpy"), |
|
|
outputs=gr.Label(num_top_classes=3, label="Click Type Classification"), |
|
|
title="WebClick AgentBrowse Classifier", |
|
|
description="Upload a web UI screenshot to classify regions: agentbrowse, calendars, or humanbrowse." |
|
|
) |
|
|
|
|
|
if __name__ == "__main__": |
|
|
iface.launch() |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ID2Label Testing |
|
|
|
|
|
```py |
|
|
%%capture |
|
|
!pip install datasets==3.2.0 |
|
|
``` |
|
|
|
|
|
```py |
|
|
from datasets import load_dataset |
|
|
|
|
|
# Load the dataset |
|
|
dataset = load_dataset("Hcompany/WebClick") |
|
|
|
|
|
# Extract unique masterCategory values (assuming it's a string field) |
|
|
labels = sorted(set(example["bucket"] for example in dataset["test"])) |
|
|
|
|
|
# Create id2label mapping |
|
|
id2label = {str(i): label for i, label in enumerate(labels)} |
|
|
|
|
|
# Print the mapping |
|
|
print(id2label) |
|
|
``` |
|
|
|
|
|
``` |
|
|
{'0': 'agentbrowse', '1': 'calendars', '2': 'humanbrowse'} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
**WebClick-AgentBrowse-SigLIP2** is intended for: |
|
|
|
|
|
* **UI Understanding** – Classify user interaction zones in web interface screenshots. |
|
|
* **Multimodal Agents** – Enhance visual perception for agent planning or RPA systems. |
|
|
* **Interface Automation** – Facilitate click zone detection for automated agents. |
|
|
* **Web Analytics** – Analyze user behavior patterns based on layout interaction predictions. |