josley's picture
Upload folder using huggingface_hub
4f6c4bb verified
metadata
license: mit
base_model: microsoft/Florence-2-base
datasets:
  - custom
tags:
  - florence-2
  - icon-caption
  - ui-detection
  - omniparser
language:
  - en
pipeline_tag: image-to-text

Florence-2 Icon Caption (Fine-tuned)

Fine-tuned Florence-2 model for UI icon recognition in desktop applications.

Based on OmniParser-v2.0 icon_caption weights, further fine-tuned on 12.8k curated icon samples from 163 desktop applications (including WeChat, Photoshop, VS Code, Figma, etc.)

Key Features

  • Functional icon recognition: Trained only on learnable UI elements (buttons, tools, nav icons), excluding avatars/thumbnails/decorative elements
  • Clean, standardized labels: 2-5 word functional descriptions like search button, settings gear, chats nav icon
  • 163 app coverage: Adobe suite, Microsoft Office, WeChat, Slack, Chrome, and 150+ more
  • Chinese app support: WeChat, DingTalk, Feishu, QQ, Bilibili, etc.

Performance

Model Val Loss Exact Match Output Quality
OmniParser (baseline) - 0% Verbose, generic ("a loading or buffering indicator")
This model 1.329 18.8% Concise, functional ("settings gear", "search button")

Training Pipeline

  1. YOLO detection → crop icons from 750+ screenshots across 163 apps
  2. Claude annotation → send original screenshot + icon grid to Claude for context-aware labeling
  3. Smart filtering → skip avatars, thumbnails, video frames, line numbers (unlearnable elements)
  4. Label standardization → normalize synonyms (close window button → close button)
  5. Frequency filtering → remove labels appearing < 3 times
  6. Full parameter training → vision tower unfrozen, lr=3e-6, 15 epochs

Usage

from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from pathlib import Path
from PIL import Image
import torch

# Load processor
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)

# Load model structure from OmniParser config
config_path = hf_hub_download("microsoft/OmniParser-v2.0", "icon_caption/config.json")
config = AutoConfig.from_pretrained(str(Path(config_path).parent), trust_remote_code=True)
config._attn_implementation = "eager"
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)

# Load fine-tuned weights
weights = hf_hub_download("josley/florence-2-icon-caption", "model.safetensors")
model.load_state_dict(load_file(weights, device="cpu"), strict=False)
model = model.to("cuda", dtype=torch.float16).eval()

# Inference
image = Image.open("icon.png").convert("RGB")
inputs = processor(text="<CAPTION>", images=image, return_tensors="pt").to("cuda")
inputs["pixel_values"] = inputs["pixel_values"].to(torch.float16)
gen = model.generate(input_ids=inputs["input_ids"], pixel_values=inputs["pixel_values"],
                     max_new_tokens=20, num_beams=1, use_cache=False)
print(processor.batch_decode(gen, skip_special_tokens=True)[0])
# Output: "search button"

Training Details

  • Base weights: microsoft/OmniParser-v2.0 (icon_caption)
  • Training data: 12,789 curated samples from 163 apps
  • Validation: 1,422 samples
  • Best epoch: 9 (val_loss=1.329)
  • Config: batch=16 (8×2 grad_accum), lr=3e-6, fp16, vision tower unfrozen, 231M params
  • Annotation: Claude-powered with original screenshot context + smart filtering pipeline