Florence-2 Icon Captioning

minhvn4/florence2-icon is a fine-tuned version of Microsoft's Florence-2-base specifically tailored for Icon Captioning. This model understands and generates descriptive captions for UI icons, symbols, and pictograms.

Because it relies on custom code from the original Florence-2 implementation, you must use trust_remote_code=True when loading the model.

Model Details

Architecture: Florence-2 (AutoModelForCausalLM)
Base Model: microsoft/Florence-2-base
Task: Image to Text (Icon Captioning)
License: MIT
Format: Safetensors

Usage

Here's how to load and use the model for icon captioning in your Python code. Make sure to install the required dependencies (transformers, torch, Pillow, einops, timm).

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM

# Set model ID
model_id = "minhvn4/florence2-icon"

# Load the processor and model
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True).eval()

# Move model to target device
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

def generate_caption(image_path, prompt="<CAPTION>"):
    image = Image.open(image_path).convert("RGB")
    
    # Process inputs
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    
    # Generate text
    with torch.inference_mode():
        generated_ids = model.generate(
            input_ids=inputs["input_ids"],
            pixel_values=inputs["pixel_values"],
            max_new_tokens=20,
            num_beams=1,
            do_sample=False
        )
        
    # Decode output
    caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return caption.strip()

# Run inference on an icon
image_path = "path/to/your/icon.png"
caption = generate_caption(image_path)
print(f"Generated caption: {caption}")

Intended Use

This model is intended to be used for generating descriptive text for single UI icons. It can be integrated into UI parsing tools (like OmniParser), accessibility tools, or web/mobile development workflows to automatically provide clear text descriptions for graphical elements.

Troubleshooting

If you encounter an error like ValueError: The model class you are passing is not supported, ensure you are passing trust_remote_code=True to both the AutoProcessor and the AutoModelForCausalLM. You may also need to install einops and timm which are required by the Florence-2 architecture.

Downloads last month: 3

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for minhvn4/florence2-icon

Base model

microsoft/Florence-2-base

Finetuned

(20)

this model