Florence-2 Icon Caption (Fine-tuned)

Fine-tuned Florence-2 model for UI icon captioning with application context awareness.

Based on OmniParser-v2.0 icon_caption weights, further fine-tuned on 12k+ icon samples from 101 desktop applications.

Key Features

  • App-context aware: Pass the application name to get app-specific icon descriptions
  • Custom <ICON_CAPTION> task token: "<ICON_CAPTION> Adobe Photoshop" → "Describe the icon in Adobe Photoshop"
  • 21% exact match on validation set (vs 0% for OmniParser baseline), with many more semantically correct predictions
  • Trained on icons from: Figma, Photoshop, VS Code, Slack, Chrome, Excel, and 95+ more apps

Performance

Model Val Loss Exact Match
OmniParser (baseline) - 0%
This model 1.194 21%

Training improvements applied:

  • Label standardization (676 synonymous labels normalized)
  • Noise filtering (URL, overly specific content, solid-color images removed)
  • Frequency filtering (labels appearing < 3 times removed)
  • Vision tower unfrozen for better small-icon recognition

Usage

from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
from safetensors.torch import load_file
from PIL import Image
import torch

# Load processor from Florence-2-base
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)

# Register custom task token
processor.task_prompts_with_input["<ICON_CAPTION>"] = "Describe the icon in{input}"

# Load model structure from OmniParser config
from huggingface_hub import hf_hub_download
config_path = hf_hub_download("microsoft/OmniParser-v2.0", "icon_caption/config.json")
from pathlib import Path
config = AutoConfig.from_pretrained(str(Path(config_path).parent), trust_remote_code=True)
config._attn_implementation = "eager"
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)

# Load fine-tuned weights
weights_path = hf_hub_download("josley/florence-2-icon-caption", "model.safetensors")
model.load_state_dict(load_file(weights_path, device="cpu"), strict=False)
model.eval()

# Inference with app context
image = Image.open("icon.png").convert("RGB")
inputs = processor(text="<ICON_CAPTION> Adobe Photoshop", images=image, return_tensors="pt")
generated = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=20, num_beams=1
)
caption = processor.batch_decode(generated, skip_special_tokens=True)[0]
# Output: "brush tool"

Training Details

  • Base weights: microsoft/OmniParser-v2.0 (icon_caption)
  • Training data: 10,885 samples from 101 apps, Claude-annotated + cleaned
  • Validation: 1,210 samples
  • Best val_loss: 1.194 (epoch 8)
  • Config: batch=16 (8×2), lr=3e-6, fp16, vision tower unfrozen
  • Labels: Standardized with synonym normalization, frequency filtered (≥3 occurrences)

Intended Use

Designed for screen-analyze icon captioning pipeline. Replaces OmniParser's default icon_caption model with app-aware descriptions.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for josley/florence-2-icon-caption

Finetuned
(20)
this model