josley
/

florence-2-icon-caption

@@ -17,20 +17,35 @@ pipeline_tag: image-to-text
 Fine-tuned Florence-2 model for **UI icon captioning with application context awareness**.
-Based on [OmniParser-v2.0](https://huggingface.co/microsoft/OmniParser-v2.0) icon_caption weights, further fine-tuned on 23k+ icon samples from 101 desktop applications.
 ## Key Features
-- **App-context aware**: Pass the application name to get more accurate, app-specific icon descriptions
 - Custom `<ICON_CAPTION>` task token: `"<ICON_CAPTION> Adobe Photoshop"` → `"Describe the icon in Adobe Photoshop"`
 - Trained on icons from: Figma, Photoshop, VS Code, Slack, Chrome, Excel, and 95+ more apps
 ## Usage
 ```python
 from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
 from safetensors.torch import load_file
 from PIL import Image
 # Load processor from Florence-2-base
 processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)
@@ -38,18 +53,27 @@ processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_rem
 # Register custom task token
 processor.task_prompts_with_input["<ICON_CAPTION>"] = "Describe the icon in{input}"
-# Load model with fine-tuned weights
-config = AutoConfig.from_pretrained("josley/florence-2-icon-caption", trust_remote_code=True)
 config._attn_implementation = "eager"
 model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
-state_dict = load_file("josley/florence-2-icon-caption/model.safetensors")
-model.load_state_dict(state_dict, strict=False)
 model.eval()
 # Inference with app context
 image = Image.open("icon.png").convert("RGB")
 inputs = processor(text="<ICON_CAPTION> Adobe Photoshop", images=image, return_tensors="pt")
-generated = model.generate(input_ids=inputs["input_ids"], pixel_values=inputs["pixel_values"], max_new_tokens=64)
 caption = processor.batch_decode(generated, skip_special_tokens=True)[0]
 # Output: "brush tool"
 ```
@@ -57,7 +81,12 @@ caption = processor.batch_decode(generated, skip_special_tokens=True)[0]
 ## Training Details
 - **Base weights**: microsoft/OmniParser-v2.0 (icon_caption)
-- **Training data**: 23,009 samples, 101 apps, Claude-annotated
-- **Validation**: 2,557 samples
-- **Best val_loss**: 2.037 (epoch 10)
-- **Config**: batch=16 (8×2), lr=5e-6, fp16, vision_tower frozen

 Fine-tuned Florence-2 model for **UI icon captioning with application context awareness**.
+Based on [OmniParser-v2.0](https://huggingface.co/microsoft/OmniParser-v2.0) icon_caption weights, further fine-tuned on 12k+ icon samples from 101 desktop applications.
 ## Key Features
+- **App-context aware**: Pass the application name to get app-specific icon descriptions
 - Custom `<ICON_CAPTION>` task token: `"<ICON_CAPTION> Adobe Photoshop"` → `"Describe the icon in Adobe Photoshop"`
+- **21% exact match** on validation set (vs 0% for OmniParser baseline), with many more semantically correct predictions
 - Trained on icons from: Figma, Photoshop, VS Code, Slack, Chrome, Excel, and 95+ more apps
+## Performance
+| Model | Val Loss | Exact Match |
+|-------|----------|-------------|
+| OmniParser (baseline) | - | 0% |
+| **This model** | **1.194** | **21%** |
+Training improvements applied:
+- Label standardization (676 synonymous labels normalized)
+- Noise filtering (URL, overly specific content, solid-color images removed)
+- Frequency filtering (labels appearing < 3 times removed)
+- Vision tower unfrozen for better small-icon recognition
 ## Usage
 ```python
 from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
 from safetensors.torch import load_file
 from PIL import Image
+import torch
 # Load processor from Florence-2-base
 processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)
 # Register custom task token
 processor.task_prompts_with_input["<ICON_CAPTION>"] = "Describe the icon in{input}"
+# Load model structure from OmniParser config
+from huggingface_hub import hf_hub_download
+config_path = hf_hub_download("microsoft/OmniParser-v2.0", "icon_caption/config.json")
+from pathlib import Path
+config = AutoConfig.from_pretrained(str(Path(config_path).parent), trust_remote_code=True)
 config._attn_implementation = "eager"
 model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
+# Load fine-tuned weights
+weights_path = hf_hub_download("josley/florence-2-icon-caption", "model.safetensors")
+model.load_state_dict(load_file(weights_path, device="cpu"), strict=False)
 model.eval()
 # Inference with app context
 image = Image.open("icon.png").convert("RGB")
 inputs = processor(text="<ICON_CAPTION> Adobe Photoshop", images=image, return_tensors="pt")
+generated = model.generate(
+    input_ids=inputs["input_ids"],
+    pixel_values=inputs["pixel_values"],
+    max_new_tokens=20, num_beams=1
+)
 caption = processor.batch_decode(generated, skip_special_tokens=True)[0]
 # Output: "brush tool"
 ```
 ## Training Details
 - **Base weights**: microsoft/OmniParser-v2.0 (icon_caption)
+- **Training data**: 10,885 samples from 101 apps, Claude-annotated + cleaned
+- **Validation**: 1,210 samples
+- **Best val_loss**: 1.194 (epoch 8)
+- **Config**: batch=16 (8×2), lr=3e-6, fp16, vision tower unfrozen
+- **Labels**: Standardized with synonym normalization, frequency filtered (≥3 occurrences)
+## Intended Use
+Designed for [screen-analyze](https://github.com/anthropics/screen-analyze) icon captioning pipeline. Replaces OmniParser's default icon_caption model with app-aware descriptions.