---
license: apache-2.0
base_model: microsoft/git-base
tags:
- multimodal
- image-to-text
- lora
- transformers
- ui-captioning
datasets:
- rootsautomation/RICO-Screen2Words
---

# GIT LoRA Fine-Tuned on RICO-Screen2Words

## Model Description

This repository contains **LoRA adapters for the GIT (Generative Image-to-Text Transformer) model**, fine-tuned for **UI screen caption generation**.

The model generates **natural language descriptions of mobile UI screenshots**.

Instead of full fine-tuning, **LoRA (Low-Rank Adaptation)** is used to efficiently adapt the base model while training only a small number of parameters.

Base model:

```
microsoft/git-base
```

Dataset used:

```
rootsautomation/RICO-Screen2Words
```

---

# Intended Use

The model takes a **mobile UI screenshot as input** and generates a **caption describing the interface**.

Example use cases:

- UI documentation
- Accessibility tools
- Screen summarization
- Interface understanding

---

# Training Details

The model was fine-tuned using the **RICO-Screen2Words dataset**, which contains screenshots of mobile applications paired with human-written captions.

Training method:

- Parameter-efficient fine-tuning using **LoRA**
- Base model: `microsoft/git-base`
- Vision encoder: Vision Transformer
- Text decoder: Transformer language model

Training was designed to run on **NVIDIA T4 GPUs** using HuggingFace Transformers and PEFT.

---

# Files in This Repository

This repository contains the **LoRA adapter weights**:

```
adapter_config.json
adapter_model.safetensors
```

These adapters can be loaded on top of the base GIT model.

---

# Loading the Model

To use the adapters, first load the base model and then load the LoRA adapters.

```python
from transformers import AutoProcessor, GitForCausalLM
from peft import PeftModel

base_model = GitForCausalLM.from_pretrained("microsoft/git-base")

model = PeftModel.from_pretrained(
    base_model,
    "HarshaDiwakar/orange-problem-git-lora"
)

processor = AutoProcessor.from_pretrained("microsoft/git-base")
```

---

# Merging LoRA Adapters

The LoRA adapters can optionally be merged with the base model before inference.

```python
model = model.merge_and_unload()
```

This produces a standalone model equivalent to full fine-tuning.

---

# Example Inference

```python
from PIL import Image

image = Image.open("example_ui.png")

inputs = processor(images=image, return_tensors="pt")

outputs = model.generate(**inputs)

caption = processor.batch_decode(outputs, skip_special_tokens=True)

print(caption)
```

Example output:

```
"This screen shows a shopping application with product listings and navigation tabs."
```

---

# Requirements

```
transformers
peft
torch
Pillow
```

Install dependencies:

```
pip install transformers peft torch pillow
```

---

# Dataset

The model was trained on:

https://huggingface.co/datasets/rootsautomation/RICO-Screen2Words

---

# Limitations

- Performance depends on the diversity of UI layouts present in the dataset.
- The model may struggle with very complex or uncommon UI designs.

---

# Citation

- RICO Dataset
- Microsoft GIT model