Rename pegman label: Google Street View pegman icon -> pegman

b266f18 verified about 1 month ago

3.97 kB

	---
	license: mit
	base_model: microsoft/Florence-2-base-ft
	tags:
	- florence2
	- icon-captioning
	- omniparser
	- ui-understanding
	- fine-tuned
	datasets:
	- FortAwesome/Font-Awesome
	pipeline_tag: image-to-text
	---

	# OmniParser Florence-2 Fine-tuned Icon Captioner

	Fine-tuned [Florence-2-base-ft](https://huggingface.co/microsoft/Florence-2-base-ft) for UI icon captioning, used as the caption model in [OmniParser v2](https://github.com/microsoft/OmniParser).

	This model extends the original OmniParser icon captioning weights with:
	- 1,970 Font Awesome icons with rotated synonym labels and diverse augmentations
	- Custom Google Maps icons: Street View pegman, Street View rotation controls
	- Hard negative training to prevent false positives on similar-looking icons

	## What changed from the base OmniParser weights

	### Training data
	- Font Awesome Free: 1,970 icons across solid/regular/brands styles, each with multiple synonym labels from FA metadata (e.g., sleigh icon trained with labels "sleigh", "christmas", "sled", "santa", "reindeer")
	- 70/30 weighted sampling: Primary icon name gets 70% of training steps, alternate synonyms get 30%
	- Label smoothing (0.1): Prevents overconfidence on any single synonym
	- Screenshot anchors: 91 icons from real Google Maps screenshots with the original model's own captions (prevents vocabulary drift)
	- Hard negatives: 3 specific crops that were false positives in earlier training rounds

	### Augmentations (training-time)
	- Color inversion (black/white swap)
	- Random foreground recoloring on white/gray backgrounds
	- White foreground on random colored backgrounds
	- Brightness, contrast, rotation, blur, tint
	- Random rescale (downscale then upscale for aliasing artifacts)
	- JPEG compression (quality 15-60 for SVG icons, 50-85 for photo crops)

	### Architecture
	- Vision encoder: frozen (90.4M params) — preserves general icon feature extraction
	- Language decoder: trained (141.0M params) — learns new caption mappings
	- 4 epochs, LR 2e-6, AdamW, batch size 8

	## Benchmark: Google Maps screenshot (2570x2002, H100)

	\| Metric \| Before \| After \|
	\|--------\|--------\|-------\|
	\| Florence-2 latency \| 342ms \| 148ms \|
	\| Total elements \| 316 \| 316 \|
	\| Icons to Florence \| 71 \| 71 \|

	### Key caption improvements

	\| Icon \| Before \| After \|
	\|------\|--------\|-------\|
	\| Street View pegman \| "A notification or alert." \| "pegman" \|
	\| Rotation control (BL) \| "Refresh or reload." \| "street view rotation" \|
	\| Rotation control (BR) \| "A painting or painting tool." \| "street view rotation" \|
	\| Location marker \| "Location or location marker." \| "Location" \|
	\| User profile \| "a user profile or account." \| "user profile" \|
	\| Record player \| "a record player." \| "record player" \|
	\| Suitcase \| "a suitcase or baggage." \| "suitcase" \|

	47 icon captions changed total. Most changes are minor wording improvements (shorter, more precise). 3 new custom icon types learned.

	## Usage

	```python
	from transformers import AutoProcessor, AutoModelForCausalLM

	processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	"proteus-computer-use/omniparser-finetuned",
	torch_dtype=torch.float16,
	trust_remote_code=True,
	).to("cuda")

	# Caption a 64x64 icon crop
	inputs = processor(images=icon_crop, text="<CAPTION>", return_tensors="pt").to("cuda")
	generated_ids = model.generate(
	input_ids=inputs["input_ids"],
	pixel_values=inputs["pixel_values"],
	max_new_tokens=20,
	num_beams=1,
	)
	caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
	```

	## Related

	- [OmniParser v2](https://github.com/microsoft/OmniParser) — full UI parsing pipeline
	- [omniparser-fast](https://github.com/proteus-computer-use/omniparser-fast) — low-latency GPU server with this model
	- [Florence-2](https://huggingface.co/microsoft/Florence-2-base-ft) — base model