Model Card for Florence-2-Wave-UI
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model was fine-tuned from the Microsoft Florence-2-base-ft model to specialize in User Interface (UI) understanding and element description.
Model Details
Model Description
Florence-2-Wave-UI is a Vision-Language Model (VLM) fine-tuned to accurately understand and describe UI elements within screenshots. By providing an image and a bounding box coordinate of a specific UI element, the model generates a descriptive caption and categorizes the UI element type (e.g., button, text input, dropdown).
The model was trained using parameter-efficient fine-tuning (PEFT/LoRA) and the weights have been merged back into the base model for easy deployment.
- Developed by: Minhnv4
- Model type: Vision-Language Model (AutoModelForCausalLM)
- Language(s) (NLP): English
- License: MIT
- Finetuned from model: microsoft/Florence-2-base-ft
Model Sources
- Dataset Repository: agentsea/wave-ui-25k
Uses
Direct Use
The model is primarily intended to be used for:
- Automated UI testing and QA.
- Generating accessibility tags/descriptions for UI components.
- Parsing UI screenshots into structured hierarchical data.
- Converting bounding box coordinates into semantic descriptions.
Out-of-Scope Use
The model is focused exclusively on digital user interface elements. It is not designed for:
- General OCR on physical documents or handwriting.
- Describing natural real-world photos or landscapes.
- Identifying PII (Personally Identifiable Information) or sensitive data within screenshots safely.
Bias, Risks, and Limitations
Due to the nature of the training dataset (agentsea/wave-ui-25k), the model's performance will be biased towards modern web and mobile application UI paradigms. It may struggle with legacy software interfaces, highly customized desktop applications, or non-standard UI controls.
Recommendations
Users should be aware that the bounding box coordinates must be normalized relative to the image size (0-1000 scale) using the <loc_X> format for the model to interpret them correctly.
How to Get Started with the Model
Use the code below to get started with the model.
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"
repo_id = "minhvn4/florence2-wave-ui-lora"
# Load Processor and Model
processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True).to(device)
model.eval()
# Helper function to normalize bounding boxes
def normalize_bbox(bbox, img_width: int, img_height: int) -> str:
x1, y1, x2, y2 = bbox
def _norm(v, dim):
return min(999, max(0, int((v / dim) * 1000)))
return (f"<loc_{_norm(x1, img_width)}><loc_{_norm(y1, img_height)}>"
f"<loc_{_norm(x2, img_width)}><loc_{_norm(y2, img_height)}>")
# Prepare Image & Bounding Box
image = Image.open("your_screenshot.png").convert("RGB")
img_w, img_h = image.size
bbox = [100, 200, 300, 400] # [x1, y1, x2, y2]
# Format prompt
task_prompt = "<UI_DESCRIBE_REGION>"
prefix = f"{task_prompt}{normalize_bbox(bbox, img_w, img_h)}"
# Inference
inputs = processor(text=prefix, images=image, return_tensors="pt").to(device)
with torch.no_grad():
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=256,
num_beams=3,
repetition_penalty=1.3,
early_stopping=True
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
# Format cleanup
generated_text = generated_text.replace(prefix, "").replace("</s>", "").replace("<s>", "").strip()
print(generated_text)
# Expected Output format: <caption>Description of element</caption><type>UI_Type</type>
- Downloads last month
- 30
Model tree for minhvn4/florence2-wave-ui-lora
Base model
microsoft/Florence-2-base-ft