Configuration Parsing Warning:In adapter_config.json: "peft.task_type" must be a string

rico-blip-lora

This model is a fine-tuned version of Salesforce/blip-image-captioning-base on the RICO Screen2Words dataset for the task of generating textual descriptions of mobile UI screenshots.

The model adapts the BLIP vision-language architecture to better understand the layout and semantics of mobile application interfaces.

Model Description

This project performs multimodal fine-tuning of the BLIP image captioning model on mobile UI screenshots.

The base model Salesforce/blip-image-captioning-base is a vision-language transformer capable of generating captions from images. However, it is primarily trained on natural images.

To adapt the model to the domain of mobile user interfaces, it was fine-tuned on the RICO Screen2Words dataset, which contains screenshots of mobile apps paired with textual descriptions.

Fine-tuning was performed using LoRA (Low-Rank Adaptation), a parameter-efficient technique that allows training a small number of additional parameters while keeping the majority of the base model weights frozen.

This allows efficient training on limited hardware such as a single T4 GPU.

Intended Use

This model is intended for:

Automatic captioning of mobile application screenshots
UI understanding tasks
Multimodal research involving vision-language models
Demonstrating multimodal fine-tuning with small language models (SLMs)

Dataset

The model was fine-tuned on the RICO Screen2Words dataset, which contains mobile application screenshots paired with textual descriptions of the screen content.

Dataset link:

https://huggingface.co/datasets/rootsautomation/RICO-Screen2Words

Dataset characteristics:

Image: Mobile UI screenshot
Caption: Natural language description of the interface

The dataset enables training models to understand app layouts, menus, and interface components.

Training Procedure

Fine-tuning was performed using LoRA (Low-Rank Adaptation) via the PEFT library.

Instead of updating all model parameters, LoRA introduces small trainable matrices into attention layers. This significantly reduces the number of trainable parameters and allows efficient training.

The model was trained using Hugging Face Transformers on T4 GPU compute.

Training Hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: AdamW (fused)
lr_scheduler_type: linear
num_epochs: 2
mixed_precision_training: Native AMP

Training Results

Training Loss	Epoch	Step	Validation Loss
8.1654	0.2823	500	8.1154
7.9349	0.5647	1000	7.9267
7.8727	0.8470	1500	7.8694
7.8471	1.1293	2000	7.8467
7.8374	1.4116	2500	7.8362
7.8324	1.6940	3000	7.8311
7.8319	1.9763	3500	7.8293

How to Run Inference

The model can be loaded directly from the Hugging Face Hub.

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import requests

model_name = "PMN23/rico-blip-lora"

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained(model_name)

url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

inputs = processor(image, return_tensors="pt")

out = model.generate(**inputs)

print(processor.decode(out[0], skip_special_tokens=True))

Merging LoRA Adapters with Base Model

If the uploaded model contains LoRA adapters, they can be merged with the base model before inference.

from transformers import BlipForConditionalGeneration
from peft import PeftModel

base_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

model = PeftModel.from_pretrained(base_model, "PMN23/rico-blip-lora")

model = model.merge_and_unload()

Framework Versions

PEFT 0.18.1
Transformers 5.0.0
PyTorch 2.10.0+cu128
Datasets 4.0.0
Tokenizers 0.22.2

Repository

Full code, training notebooks, and documentation are available in the GitHub repository associated with this project.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for PMN23/rico-blip-lora

Base model

Salesforce/blip-image-captioning-base

Adapter

(9)

this model