Instructions to use PMN23/rico-blip-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use PMN23/rico-blip-lora with PEFT:
Task type is invalid.
- Transformers
How to use PMN23/rico-blip-lora with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("PMN23/rico-blip-lora", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Configuration Parsing Warning:In adapter_config.json: "peft.task_type" must be a string
rico-blip-lora
This model is a fine-tuned version of Salesforce/blip-image-captioning-base on the RICO Screen2Words dataset for the task of generating textual descriptions of mobile UI screenshots.
The model adapts the BLIP vision-language architecture to better understand the layout and semantics of mobile application interfaces.
Model Description
This project performs multimodal fine-tuning of the BLIP image captioning model on mobile UI screenshots.
The base model Salesforce/blip-image-captioning-base is a vision-language transformer capable of generating captions from images. However, it is primarily trained on natural images.
To adapt the model to the domain of mobile user interfaces, it was fine-tuned on the RICO Screen2Words dataset, which contains screenshots of mobile apps paired with textual descriptions.
Fine-tuning was performed using LoRA (Low-Rank Adaptation), a parameter-efficient technique that allows training a small number of additional parameters while keeping the majority of the base model weights frozen.
This allows efficient training on limited hardware such as a single T4 GPU.
Intended Use
This model is intended for:
- Automatic captioning of mobile application screenshots
- UI understanding tasks
- Multimodal research involving vision-language models
- Demonstrating multimodal fine-tuning with small language models (SLMs)
Dataset
The model was fine-tuned on the RICO Screen2Words dataset, which contains mobile application screenshots paired with textual descriptions of the screen content.
Dataset link:
https://huggingface.co/datasets/rootsautomation/RICO-Screen2Words
Dataset characteristics:
- Image: Mobile UI screenshot
- Caption: Natural language description of the interface
The dataset enables training models to understand app layouts, menus, and interface components.
Training Procedure
Fine-tuning was performed using LoRA (Low-Rank Adaptation) via the PEFT library.
Instead of updating all model parameters, LoRA introduces small trainable matrices into attention layers. This significantly reduces the number of trainable parameters and allows efficient training.
The model was trained using Hugging Face Transformers on T4 GPU compute.
Training Hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: AdamW (fused)
- lr_scheduler_type: linear
- num_epochs: 2
- mixed_precision_training: Native AMP
Training Results
| Training Loss | Epoch | Step | Validation Loss |
|---|---|---|---|
| 8.1654 | 0.2823 | 500 | 8.1154 |
| 7.9349 | 0.5647 | 1000 | 7.9267 |
| 7.8727 | 0.8470 | 1500 | 7.8694 |
| 7.8471 | 1.1293 | 2000 | 7.8467 |
| 7.8374 | 1.4116 | 2500 | 7.8362 |
| 7.8324 | 1.6940 | 3000 | 7.8311 |
| 7.8319 | 1.9763 | 3500 | 7.8293 |
How to Run Inference
The model can be loaded directly from the Hugging Face Hub.
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import requests
model_name = "PMN23/rico-blip-lora"
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained(model_name)
url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
inputs = processor(image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
Merging LoRA Adapters with Base Model
If the uploaded model contains LoRA adapters, they can be merged with the base model before inference.
from transformers import BlipForConditionalGeneration
from peft import PeftModel
base_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
model = PeftModel.from_pretrained(base_model, "PMN23/rico-blip-lora")
model = model.merge_and_unload()
Framework Versions
- PEFT 0.18.1
- Transformers 5.0.0
- PyTorch 2.10.0+cu128
- Datasets 4.0.0
- Tokenizers 0.22.2
Repository
Full code, training notebooks, and documentation are available in the GitHub repository associated with this project.
- Downloads last month
- -
Model tree for PMN23/rico-blip-lora
Base model
Salesforce/blip-image-captioning-base