josley commited on
Commit
f428491
·
verified ·
1 Parent(s): 62ec05c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +40 -11
README.md CHANGED
@@ -17,20 +17,35 @@ pipeline_tag: image-to-text
17
 
18
  Fine-tuned Florence-2 model for **UI icon captioning with application context awareness**.
19
 
20
- Based on [OmniParser-v2.0](https://huggingface.co/microsoft/OmniParser-v2.0) icon_caption weights, further fine-tuned on 23k+ icon samples from 101 desktop applications.
21
 
22
  ## Key Features
23
 
24
- - **App-context aware**: Pass the application name to get more accurate, app-specific icon descriptions
25
  - Custom `<ICON_CAPTION>` task token: `"<ICON_CAPTION> Adobe Photoshop"` → `"Describe the icon in Adobe Photoshop"`
 
26
  - Trained on icons from: Figma, Photoshop, VS Code, Slack, Chrome, Excel, and 95+ more apps
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  ## Usage
29
 
30
  ```python
31
  from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
32
  from safetensors.torch import load_file
33
  from PIL import Image
 
34
 
35
  # Load processor from Florence-2-base
36
  processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)
@@ -38,18 +53,27 @@ processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_rem
38
  # Register custom task token
39
  processor.task_prompts_with_input["<ICON_CAPTION>"] = "Describe the icon in{input}"
40
 
41
- # Load model with fine-tuned weights
42
- config = AutoConfig.from_pretrained("josley/florence-2-icon-caption", trust_remote_code=True)
 
 
 
43
  config._attn_implementation = "eager"
44
  model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
45
- state_dict = load_file("josley/florence-2-icon-caption/model.safetensors")
46
- model.load_state_dict(state_dict, strict=False)
 
 
47
  model.eval()
48
 
49
  # Inference with app context
50
  image = Image.open("icon.png").convert("RGB")
51
  inputs = processor(text="<ICON_CAPTION> Adobe Photoshop", images=image, return_tensors="pt")
52
- generated = model.generate(input_ids=inputs["input_ids"], pixel_values=inputs["pixel_values"], max_new_tokens=64)
 
 
 
 
53
  caption = processor.batch_decode(generated, skip_special_tokens=True)[0]
54
  # Output: "brush tool"
55
  ```
@@ -57,7 +81,12 @@ caption = processor.batch_decode(generated, skip_special_tokens=True)[0]
57
  ## Training Details
58
 
59
  - **Base weights**: microsoft/OmniParser-v2.0 (icon_caption)
60
- - **Training data**: 23,009 samples, 101 apps, Claude-annotated
61
- - **Validation**: 2,557 samples
62
- - **Best val_loss**: 2.037 (epoch 10)
63
- - **Config**: batch=16 (8×2), lr=5e-6, fp16, vision_tower frozen
 
 
 
 
 
 
17
 
18
  Fine-tuned Florence-2 model for **UI icon captioning with application context awareness**.
19
 
20
+ Based on [OmniParser-v2.0](https://huggingface.co/microsoft/OmniParser-v2.0) icon_caption weights, further fine-tuned on 12k+ icon samples from 101 desktop applications.
21
 
22
  ## Key Features
23
 
24
+ - **App-context aware**: Pass the application name to get app-specific icon descriptions
25
  - Custom `<ICON_CAPTION>` task token: `"<ICON_CAPTION> Adobe Photoshop"` → `"Describe the icon in Adobe Photoshop"`
26
+ - **21% exact match** on validation set (vs 0% for OmniParser baseline), with many more semantically correct predictions
27
  - Trained on icons from: Figma, Photoshop, VS Code, Slack, Chrome, Excel, and 95+ more apps
28
 
29
+ ## Performance
30
+
31
+ | Model | Val Loss | Exact Match |
32
+ |-------|----------|-------------|
33
+ | OmniParser (baseline) | - | 0% |
34
+ | **This model** | **1.194** | **21%** |
35
+
36
+ Training improvements applied:
37
+ - Label standardization (676 synonymous labels normalized)
38
+ - Noise filtering (URL, overly specific content, solid-color images removed)
39
+ - Frequency filtering (labels appearing < 3 times removed)
40
+ - Vision tower unfrozen for better small-icon recognition
41
+
42
  ## Usage
43
 
44
  ```python
45
  from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
46
  from safetensors.torch import load_file
47
  from PIL import Image
48
+ import torch
49
 
50
  # Load processor from Florence-2-base
51
  processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)
 
53
  # Register custom task token
54
  processor.task_prompts_with_input["<ICON_CAPTION>"] = "Describe the icon in{input}"
55
 
56
+ # Load model structure from OmniParser config
57
+ from huggingface_hub import hf_hub_download
58
+ config_path = hf_hub_download("microsoft/OmniParser-v2.0", "icon_caption/config.json")
59
+ from pathlib import Path
60
+ config = AutoConfig.from_pretrained(str(Path(config_path).parent), trust_remote_code=True)
61
  config._attn_implementation = "eager"
62
  model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
63
+
64
+ # Load fine-tuned weights
65
+ weights_path = hf_hub_download("josley/florence-2-icon-caption", "model.safetensors")
66
+ model.load_state_dict(load_file(weights_path, device="cpu"), strict=False)
67
  model.eval()
68
 
69
  # Inference with app context
70
  image = Image.open("icon.png").convert("RGB")
71
  inputs = processor(text="<ICON_CAPTION> Adobe Photoshop", images=image, return_tensors="pt")
72
+ generated = model.generate(
73
+ input_ids=inputs["input_ids"],
74
+ pixel_values=inputs["pixel_values"],
75
+ max_new_tokens=20, num_beams=1
76
+ )
77
  caption = processor.batch_decode(generated, skip_special_tokens=True)[0]
78
  # Output: "brush tool"
79
  ```
 
81
  ## Training Details
82
 
83
  - **Base weights**: microsoft/OmniParser-v2.0 (icon_caption)
84
+ - **Training data**: 10,885 samples from 101 apps, Claude-annotated + cleaned
85
+ - **Validation**: 1,210 samples
86
+ - **Best val_loss**: 1.194 (epoch 8)
87
+ - **Config**: batch=16 (8×2), lr=3e-6, fp16, vision tower unfrozen
88
+ - **Labels**: Standardized with synonym normalization, frequency filtered (≥3 occurrences)
89
+
90
+ ## Intended Use
91
+
92
+ Designed for [screen-analyze](https://github.com/anthropics/screen-analyze) icon captioning pipeline. Replaces OmniParser's default icon_caption model with app-aware descriptions.