Instructions to use preksham2004/rico-blip-ui-caption with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use preksham2004/rico-blip-ui-caption with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="preksham2004/rico-blip-ui-caption")

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("preksham2004/rico-blip-ui-caption")
model = AutoModelForMultimodalLM.from_pretrained("preksham2004/rico-blip-ui-caption")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use preksham2004/rico-blip-ui-caption with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "preksham2004/rico-blip-ui-caption"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "preksham2004/rico-blip-ui-caption",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/preksham2004/rico-blip-ui-caption

SGLang

How to use preksham2004/rico-blip-ui-caption with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "preksham2004/rico-blip-ui-caption" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "preksham2004/rico-blip-ui-caption",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "preksham2004/rico-blip-ui-caption" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "preksham2004/rico-blip-ui-caption",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use preksham2004/rico-blip-ui-caption with Docker Model Runner:
```
docker model run hf.co/preksham2004/rico-blip-ui-caption
```

BLIP UI Captioning Model

This model was fine-tuned on the RICO Screen2Words dataset to generate natural language descriptions of mobile user interface screens.

The goal of the model is to learn multimodal representations by mapping visual UI layouts to textual descriptions. The model takes an image of a mobile app screen as input and produces a caption describing the UI content.

This project demonstrates multimodal fine-tuning using a Small Language Model (SLM) as part of the Orange Problem assignment.

Model Details

Model Description

Developed by: Preksha M (PES1UG23CS450), Rachana R (PES1UG23CS459)
Model type: Vision-Language Model (Image Captioning)
Base model: Salesforce/blip-image-captioning-base
Framework: Hugging Face Transformers
Language: English
License: Apache-2.0
Finetuned from model: Salesforce/blip-image-captioning-base

The BLIP architecture combines a vision encoder and a text decoder to generate captions from images. This model was fine-tuned to specialize in describing mobile user interface screens.

Dataset

This model was fine-tuned using the RICO Screen2Words dataset.

Dataset link:
https://huggingface.co/datasets/rootsautomation/RICO-Screen2Words

The dataset contains:

screenshots of mobile application interfaces
natural language captions describing the UI screen

Each sample contains:

image → mobile UI screenshot
caption → textual description of the UI

A subset of 800 samples was used for fine-tuning to ensure the training process runs efficiently on limited compute resources.

Training Details

Base Model

Salesforce/blip-image-captioning-base

Training Configuration

Dataset subset: 800 samples
Batch size: 4
Learning rate: 2e-5
Optimizer: AdamW
Epochs: 1
Training device: NVIDIA T4 GPU
Platform: Kaggle

The model was trained using a simple PyTorch training loop with the Hugging Face BLIP processor handling image-text tokenization.

Preprocessing

For each dataset sample:

The UI screenshot image is processed using the BLIP processor.
The caption is tokenized as the target output.
The caption tokens are used as training labels for the model.

This allows the model to learn the mapping:

UI image → descriptive caption

How to Use the Model

You can load the model directly from Hugging Face using the following code:

from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("preksham2004/rico-blip-ui-caption")
model = BlipForConditionalGeneration.from_pretrained("preksham2004/rico-blip-ui-caption")

# Load an image
inputs = processor(images=image, return_tensors="pt")

# Generate caption
output = model.generate(**inputs)

caption = processor.decode(output[0], skip_special_tokens=True)

print(caption)

Downloads last month: 1

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support