BLIP UI Captioning Model
This model was fine-tuned on the RICO Screen2Words dataset to generate natural language descriptions of mobile user interface screens.
The goal of the model is to learn multimodal representations by mapping visual UI layouts to textual descriptions. The model takes an image of a mobile app screen as input and produces a caption describing the UI content.
This project demonstrates multimodal fine-tuning using a Small Language Model (SLM) as part of the Orange Problem assignment.
Model Details
Model Description
- Developed by: Preksha M (PES1UG23CS450), Rachana R (PES1UG23CS459)
- Model type: Vision-Language Model (Image Captioning)
- Base model: Salesforce/blip-image-captioning-base
- Framework: Hugging Face Transformers
- Language: English
- License: Apache-2.0
- Finetuned from model: Salesforce/blip-image-captioning-base
The BLIP architecture combines a vision encoder and a text decoder to generate captions from images. This model was fine-tuned to specialize in describing mobile user interface screens.
Dataset
This model was fine-tuned using the RICO Screen2Words dataset.
Dataset link:
https://huggingface.co/datasets/rootsautomation/RICO-Screen2Words
The dataset contains:
- screenshots of mobile application interfaces
- natural language captions describing the UI screen
Each sample contains:
image โ mobile UI screenshot
caption โ textual description of the UI
A subset of 800 samples was used for fine-tuning to ensure the training process runs efficiently on limited compute resources.
Training Details
Base Model
Salesforce/blip-image-captioning-base
Training Configuration
- Dataset subset: 800 samples
- Batch size: 4
- Learning rate: 2e-5
- Optimizer: AdamW
- Epochs: 1
- Training device: NVIDIA T4 GPU
- Platform: Kaggle
The model was trained using a simple PyTorch training loop with the Hugging Face BLIP processor handling image-text tokenization.
Preprocessing
For each dataset sample:
- The UI screenshot image is processed using the BLIP processor.
- The caption is tokenized as the target output.
- The caption tokens are used as training labels for the model.
This allows the model to learn the mapping:
UI image โ descriptive caption
How to Use the Model
You can load the model directly from Hugging Face using the following code:
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("preksham2004/rico-blip-ui-caption")
model = BlipForConditionalGeneration.from_pretrained("preksham2004/rico-blip-ui-caption")
# Load an image
inputs = processor(images=image, return_tensors="pt")
# Generate caption
output = model.generate(**inputs)
caption = processor.decode(output[0], skip_special_tokens=True)
print(caption)
- Downloads last month
- 41