Florence-2 Icon Caption (Fine-tuned)
Fine-tuned Florence-2 model for UI icon captioning with application context awareness.
Based on OmniParser-v2.0 icon_caption weights, further fine-tuned on 12k+ icon samples from 101 desktop applications.
Key Features
- App-context aware: Pass the application name to get app-specific icon descriptions
- Custom
<ICON_CAPTION>task token:"<ICON_CAPTION> Adobe Photoshop"→"Describe the icon in Adobe Photoshop" - 21% exact match on validation set (vs 0% for OmniParser baseline), with many more semantically correct predictions
- Trained on icons from: Figma, Photoshop, VS Code, Slack, Chrome, Excel, and 95+ more apps
Performance
| Model | Val Loss | Exact Match |
|---|---|---|
| OmniParser (baseline) | - | 0% |
| This model | 1.194 | 21% |
Training improvements applied:
- Label standardization (676 synonymous labels normalized)
- Noise filtering (URL, overly specific content, solid-color images removed)
- Frequency filtering (labels appearing < 3 times removed)
- Vision tower unfrozen for better small-icon recognition
Usage
from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
from safetensors.torch import load_file
from PIL import Image
import torch
# Load processor from Florence-2-base
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)
# Register custom task token
processor.task_prompts_with_input["<ICON_CAPTION>"] = "Describe the icon in{input}"
# Load model structure from OmniParser config
from huggingface_hub import hf_hub_download
config_path = hf_hub_download("microsoft/OmniParser-v2.0", "icon_caption/config.json")
from pathlib import Path
config = AutoConfig.from_pretrained(str(Path(config_path).parent), trust_remote_code=True)
config._attn_implementation = "eager"
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
# Load fine-tuned weights
weights_path = hf_hub_download("josley/florence-2-icon-caption", "model.safetensors")
model.load_state_dict(load_file(weights_path, device="cpu"), strict=False)
model.eval()
# Inference with app context
image = Image.open("icon.png").convert("RGB")
inputs = processor(text="<ICON_CAPTION> Adobe Photoshop", images=image, return_tensors="pt")
generated = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=20, num_beams=1
)
caption = processor.batch_decode(generated, skip_special_tokens=True)[0]
# Output: "brush tool"
Training Details
- Base weights: microsoft/OmniParser-v2.0 (icon_caption)
- Training data: 10,885 samples from 101 apps, Claude-annotated + cleaned
- Validation: 1,210 samples
- Best val_loss: 1.194 (epoch 8)
- Config: batch=16 (8×2), lr=3e-6, fp16, vision tower unfrozen
- Labels: Standardized with synonym normalization, frequency filtered (≥3 occurrences)
Intended Use
Designed for screen-analyze icon captioning pipeline. Replaces OmniParser's default icon_caption model with app-aware descriptions.
- Downloads last month
- -
Model tree for josley/florence-2-icon-caption
Base model
microsoft/Florence-2-base