GIT LoRA Fine-Tuned on RICO-Screen2Words

Model Description

This repository contains LoRA adapters for the GIT (Generative Image-to-Text Transformer) model, fine-tuned for UI screen caption generation.

The model generates natural language descriptions of mobile UI screenshots.

Instead of full fine-tuning, LoRA (Low-Rank Adaptation) is used to efficiently adapt the base model while training only a small number of parameters.

Base model:

microsoft/git-base

Dataset used:

rootsautomation/RICO-Screen2Words

Intended Use

The model takes a mobile UI screenshot as input and generates a caption describing the interface.

Example use cases:

UI documentation
Accessibility tools
Screen summarization
Interface understanding

Training Details

The model was fine-tuned using the RICO-Screen2Words dataset, which contains screenshots of mobile applications paired with human-written captions.

Training method:

Parameter-efficient fine-tuning using LoRA
Base model: microsoft/git-base
Vision encoder: Vision Transformer
Text decoder: Transformer language model

Training was designed to run on NVIDIA T4 GPUs using HuggingFace Transformers and PEFT.

Files in This Repository

This repository contains the LoRA adapter weights:

adapter_config.json
adapter_model.safetensors

These adapters can be loaded on top of the base GIT model.

Loading the Model

To use the adapters, first load the base model and then load the LoRA adapters.

from transformers import AutoProcessor, GitForCausalLM
from peft import PeftModel

base_model = GitForCausalLM.from_pretrained("microsoft/git-base")

model = PeftModel.from_pretrained(
    base_model,
    "HarshaDiwakar/orange-problem-git-lora"
)

processor = AutoProcessor.from_pretrained("microsoft/git-base")

Merging LoRA Adapters

The LoRA adapters can optionally be merged with the base model before inference.

model = model.merge_and_unload()

This produces a standalone model equivalent to full fine-tuning.

Example Inference

from PIL import Image

image = Image.open("example_ui.png")

inputs = processor(images=image, return_tensors="pt")

outputs = model.generate(**inputs)

caption = processor.batch_decode(outputs, skip_special_tokens=True)

print(caption)

Example output:

"This screen shows a shopping application with product listings and navigation tabs."

Requirements

transformers
peft
torch
Pillow

Install dependencies:

pip install transformers peft torch pillow

Dataset

The model was trained on:

https://huggingface.co/datasets/rootsautomation/RICO-Screen2Words

Limitations

Performance depends on the diversity of UI layouts present in the dataset.
The model may struggle with very complex or uncommon UI designs.

Citation

RICO Dataset
Microsoft GIT model

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for HarshaDiwakar/orange-problem-git-lora

Base model

microsoft/git-base

Adapter

(4)

this model

HarshaDiwakar
/

orange-problem-git-lora