yushg's picture
Rename pegman label: Google Street View pegman icon -> pegman
b266f18 verified
---
license: mit
base_model: microsoft/Florence-2-base-ft
tags:
- florence2
- icon-captioning
- omniparser
- ui-understanding
- fine-tuned
datasets:
- FortAwesome/Font-Awesome
pipeline_tag: image-to-text
---
# OmniParser Florence-2 Fine-tuned Icon Captioner
Fine-tuned [Florence-2-base-ft](https://huggingface.co/microsoft/Florence-2-base-ft) for UI icon captioning, used as the caption model in [OmniParser v2](https://github.com/microsoft/OmniParser).
This model extends the original OmniParser icon captioning weights with:
- **1,970 Font Awesome icons** with rotated synonym labels and diverse augmentations
- **Custom Google Maps icons**: Street View pegman, Street View rotation controls
- **Hard negative training** to prevent false positives on similar-looking icons
## What changed from the base OmniParser weights
### Training data
- **Font Awesome Free**: 1,970 icons across solid/regular/brands styles, each with multiple synonym labels from FA metadata (e.g., sleigh icon trained with labels "sleigh", "christmas", "sled", "santa", "reindeer")
- **70/30 weighted sampling**: Primary icon name gets 70% of training steps, alternate synonyms get 30%
- **Label smoothing** (0.1): Prevents overconfidence on any single synonym
- **Screenshot anchors**: 91 icons from real Google Maps screenshots with the original model's own captions (prevents vocabulary drift)
- **Hard negatives**: 3 specific crops that were false positives in earlier training rounds
### Augmentations (training-time)
- Color inversion (black/white swap)
- Random foreground recoloring on white/gray backgrounds
- White foreground on random colored backgrounds
- Brightness, contrast, rotation, blur, tint
- Random rescale (downscale then upscale for aliasing artifacts)
- JPEG compression (quality 15-60 for SVG icons, 50-85 for photo crops)
### Architecture
- Vision encoder: **frozen** (90.4M params) β€” preserves general icon feature extraction
- Language decoder: **trained** (141.0M params) β€” learns new caption mappings
- 4 epochs, LR 2e-6, AdamW, batch size 8
## Benchmark: Google Maps screenshot (2570x2002, H100)
| Metric | Before | After |
|--------|--------|-------|
| Florence-2 latency | 342ms | 148ms |
| Total elements | 316 | 316 |
| Icons to Florence | 71 | 71 |
### Key caption improvements
| Icon | Before | After |
|------|--------|-------|
| Street View pegman | "A notification or alert." | **"pegman"** |
| Rotation control (BL) | "Refresh or reload." | **"street view rotation"** |
| Rotation control (BR) | "A painting or painting tool." | **"street view rotation"** |
| Location marker | "Location or location marker." | "Location" |
| User profile | "a user profile or account." | "user profile" |
| Record player | "a record player." | "record player" |
| Suitcase | "a suitcase or baggage." | "suitcase" |
47 icon captions changed total. Most changes are minor wording improvements (shorter, more precise). 3 new custom icon types learned.
## Usage
```python
from transformers import AutoProcessor, AutoModelForCausalLM
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"proteus-computer-use/omniparser-finetuned",
torch_dtype=torch.float16,
trust_remote_code=True,
).to("cuda")
# Caption a 64x64 icon crop
inputs = processor(images=icon_crop, text="<CAPTION>", return_tensors="pt").to("cuda")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=20,
num_beams=1,
)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
```
## Related
- [OmniParser v2](https://github.com/microsoft/OmniParser) β€” full UI parsing pipeline
- [omniparser-fast](https://github.com/proteus-computer-use/omniparser-fast) β€” low-latency GPU server with this model
- [Florence-2](https://huggingface.co/microsoft/Florence-2-base-ft) β€” base model