|
|
--- |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- Salesforce/blip-image-captioning-base |
|
|
pipeline_tag: image-to-text |
|
|
tags: |
|
|
- blip |
|
|
- icon-description |
|
|
- image-captioning |
|
|
license: mit |
|
|
library_name: transformers |
|
|
--- |
|
|
# π§ BLIP β UI Elements Captioning |
|
|
|
|
|
This model is a fine-tuned version of [`Salesforce/blip-image-captioning-base`](https://huggingface.co/Salesforce/blip-image-captioning-base), adapted for **captioning UI elements** from macOS application screenshots. |
|
|
|
|
|
It is part of the **Screen2AX** research project focused on improving accessibility using vision-based deep learning. |
|
|
|
|
|
--- |
|
|
|
|
|
## π― Use Case |
|
|
|
|
|
The model takes an image of a **UI icon or element** and generates a **natural language description** (e.g., `"Settings icon"`, `"Play button"`, `"Search field"`). |
|
|
|
|
|
This helps build assistive technologies such as screen readers by providing textual labels for unlabeled visual components. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Architecture |
|
|
|
|
|
- Base model: [`Salesforce/blip-image-captioning-base`](https://huggingface.co/Salesforce/blip-image-captioning-base) |
|
|
- Architecture: **BLIP** (Bootstrapping Language-Image Pre-training) |
|
|
- Task: `image-to-text` |
|
|
|
|
|
--- |
|
|
|
|
|
## πΌ Example |
|
|
|
|
|
```python |
|
|
from transformers import BlipProcessor, BlipForConditionalGeneration |
|
|
from PIL import Image |
|
|
import requests |
|
|
|
|
|
processor = BlipProcessor.from_pretrained("macpaw-research/blip-icon-captioning") |
|
|
model = BlipForConditionalGeneration.from_pretrained("macpaw-research/blip-icon-captioning") |
|
|
|
|
|
image = Image.open("path/to/ui_icon.png") |
|
|
inputs = processor(images=image, return_tensors="pt") |
|
|
output = model.generate(**inputs) |
|
|
caption = processor.decode(output[0], skip_special_tokens=True) |
|
|
|
|
|
print(caption) |
|
|
# Example: "Settings icon" |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π License |
|
|
|
|
|
This model is released under the **MIT License**. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Related Projects |
|
|
|
|
|
- [Screen2AX Project](https://github.com/MacPaw/Screen2AX) |
|
|
- [Screen2AX HuggingFace Collection](https://huggingface.co/collections/macpaw-research/screen2ax) |
|
|
|
|
|
--- |
|
|
|
|
|
## βοΈ Citation |
|
|
|
|
|
If you use this model in your research, please cite the Screen2AX paper: |
|
|
|
|
|
```bibtex |
|
|
@misc{muryn2025screen2axvisionbasedapproachautomatic, |
|
|
title={Screen2AX: Vision-Based Approach for Automatic macOS Accessibility Generation}, |
|
|
author={Viktor Muryn and Marta Sumyk and Mariya Hirna and Sofiya Garkot and Maksym Shamrai}, |
|
|
year={2025}, |
|
|
eprint={2507.16704}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.LG}, |
|
|
url={https://arxiv.org/abs/2507.16704}, |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π MacPaw Research |
|
|
|
|
|
Learn more at [https://research.macpaw.com](https://research.macpaw.com) |