Model Card for Florence-2-Wave-UI

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model was fine-tuned from the Microsoft Florence-2-base-ft model to specialize in User Interface (UI) understanding and element description.

Model Details

Model Description

Florence-2-Wave-UI is a Vision-Language Model (VLM) fine-tuned to accurately understand and describe UI elements within screenshots. By providing an image and a bounding box coordinate of a specific UI element, the model generates a descriptive caption and categorizes the UI element type (e.g., button, text input, dropdown).

The model was trained using parameter-efficient fine-tuning (PEFT/LoRA) and the weights have been merged back into the base model for easy deployment.

  • Developed by: Minhnv4
  • Model type: Vision-Language Model (AutoModelForCausalLM)
  • Language(s) (NLP): English
  • License: MIT
  • Finetuned from model: microsoft/Florence-2-base-ft

Model Sources

Uses

Direct Use

The model is primarily intended to be used for:

  • Automated UI testing and QA.
  • Generating accessibility tags/descriptions for UI components.
  • Parsing UI screenshots into structured hierarchical data.
  • Converting bounding box coordinates into semantic descriptions.

Out-of-Scope Use

The model is focused exclusively on digital user interface elements. It is not designed for:

  • General OCR on physical documents or handwriting.
  • Describing natural real-world photos or landscapes.
  • Identifying PII (Personally Identifiable Information) or sensitive data within screenshots safely.

Bias, Risks, and Limitations

Due to the nature of the training dataset (agentsea/wave-ui-25k), the model's performance will be biased towards modern web and mobile application UI paradigms. It may struggle with legacy software interfaces, highly customized desktop applications, or non-standard UI controls.

Recommendations

Users should be aware that the bounding box coordinates must be normalized relative to the image size (0-1000 scale) using the <loc_X> format for the model to interpret them correctly.

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"
repo_id = "minhvn4/florence2-wave-ui-lora" 

# Load Processor and Model
processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True).to(device)
model.eval()

# Helper function to normalize bounding boxes
def normalize_bbox(bbox, img_width: int, img_height: int) -> str:
    x1, y1, x2, y2 = bbox
    def _norm(v, dim):
        return min(999, max(0, int((v / dim) * 1000)))
    return (f"<loc_{_norm(x1, img_width)}><loc_{_norm(y1, img_height)}>"
            f"<loc_{_norm(x2, img_width)}><loc_{_norm(y2, img_height)}>")

# Prepare Image & Bounding Box
image = Image.open("your_screenshot.png").convert("RGB")
img_w, img_h = image.size
bbox = [100, 200, 300, 400] # [x1, y1, x2, y2]

# Format prompt
task_prompt = "<UI_DESCRIBE_REGION>"
prefix = f"{task_prompt}{normalize_bbox(bbox, img_w, img_h)}"

# Inference
inputs = processor(text=prefix, images=image, return_tensors="pt").to(device)
with torch.no_grad():
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=256,
        num_beams=3,
        repetition_penalty=1.3,
        early_stopping=True
    )
    
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
# Format cleanup
generated_text = generated_text.replace(prefix, "").replace("</s>", "").replace("<s>", "").strip()

print(generated_text)
# Expected Output format: <caption>Description of element</caption><type>UI_Type</type>
Downloads last month
30
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for minhvn4/florence2-wave-ui-lora

Adapter
(17)
this model

Dataset used to train minhvn4/florence2-wave-ui-lora