| --- |
| license: mit |
| base_model: microsoft/Florence-2-base-ft |
| tags: |
| - florence2 |
| - icon-captioning |
| - omniparser |
| - ui-understanding |
| - fine-tuned |
| datasets: |
| - FortAwesome/Font-Awesome |
| pipeline_tag: image-to-text |
| --- |
| |
| # OmniParser Florence-2 Fine-tuned Icon Captioner |
|
|
| Fine-tuned [Florence-2-base-ft](https://huggingface.co/microsoft/Florence-2-base-ft) for UI icon captioning, used as the caption model in [OmniParser v2](https://github.com/microsoft/OmniParser). |
|
|
| This model extends the original OmniParser icon captioning weights with: |
| - **1,970 Font Awesome icons** with rotated synonym labels and diverse augmentations |
| - **Custom Google Maps icons**: Street View pegman, Street View rotation controls |
| - **Hard negative training** to prevent false positives on similar-looking icons |
|
|
| ## What changed from the base OmniParser weights |
|
|
| ### Training data |
| - **Font Awesome Free**: 1,970 icons across solid/regular/brands styles, each with multiple synonym labels from FA metadata (e.g., sleigh icon trained with labels "sleigh", "christmas", "sled", "santa", "reindeer") |
| - **70/30 weighted sampling**: Primary icon name gets 70% of training steps, alternate synonyms get 30% |
| - **Label smoothing** (0.1): Prevents overconfidence on any single synonym |
| - **Screenshot anchors**: 91 icons from real Google Maps screenshots with the original model's own captions (prevents vocabulary drift) |
| - **Hard negatives**: 3 specific crops that were false positives in earlier training rounds |
|
|
| ### Augmentations (training-time) |
| - Color inversion (black/white swap) |
| - Random foreground recoloring on white/gray backgrounds |
| - White foreground on random colored backgrounds |
| - Brightness, contrast, rotation, blur, tint |
| - Random rescale (downscale then upscale for aliasing artifacts) |
| - JPEG compression (quality 15-60 for SVG icons, 50-85 for photo crops) |
|
|
| ### Architecture |
| - Vision encoder: **frozen** (90.4M params) β preserves general icon feature extraction |
| - Language decoder: **trained** (141.0M params) β learns new caption mappings |
| - 4 epochs, LR 2e-6, AdamW, batch size 8 |
|
|
| ## Benchmark: Google Maps screenshot (2570x2002, H100) |
|
|
| | Metric | Before | After | |
| |--------|--------|-------| |
| | Florence-2 latency | 342ms | 148ms | |
| | Total elements | 316 | 316 | |
| | Icons to Florence | 71 | 71 | |
|
|
| ### Key caption improvements |
|
|
| | Icon | Before | After | |
| |------|--------|-------| |
| | Street View pegman | "A notification or alert." | **"pegman"** | |
| | Rotation control (BL) | "Refresh or reload." | **"street view rotation"** | |
| | Rotation control (BR) | "A painting or painting tool." | **"street view rotation"** | |
| | Location marker | "Location or location marker." | "Location" | |
| | User profile | "a user profile or account." | "user profile" | |
| | Record player | "a record player." | "record player" | |
| | Suitcase | "a suitcase or baggage." | "suitcase" | |
|
|
| 47 icon captions changed total. Most changes are minor wording improvements (shorter, more precise). 3 new custom icon types learned. |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoProcessor, AutoModelForCausalLM |
| |
| processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained( |
| "proteus-computer-use/omniparser-finetuned", |
| torch_dtype=torch.float16, |
| trust_remote_code=True, |
| ).to("cuda") |
| |
| # Caption a 64x64 icon crop |
| inputs = processor(images=icon_crop, text="<CAPTION>", return_tensors="pt").to("cuda") |
| generated_ids = model.generate( |
| input_ids=inputs["input_ids"], |
| pixel_values=inputs["pixel_values"], |
| max_new_tokens=20, |
| num_beams=1, |
| ) |
| caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] |
| ``` |
|
|
| ## Related |
|
|
| - [OmniParser v2](https://github.com/microsoft/OmniParser) β full UI parsing pipeline |
| - [omniparser-fast](https://github.com/proteus-computer-use/omniparser-fast) β low-latency GPU server with this model |
| - [Florence-2](https://huggingface.co/microsoft/Florence-2-base-ft) β base model |
|
|