HarshaDiwakar's picture
Update README.md
171116d verified
---
license: apache-2.0
base_model: microsoft/git-base
tags:
- multimodal
- image-to-text
- lora
- transformers
- ui-captioning
datasets:
- rootsautomation/RICO-Screen2Words
---
# GIT LoRA Fine-Tuned on RICO-Screen2Words
## Model Description
This repository contains **LoRA adapters for the GIT (Generative Image-to-Text Transformer) model**, fine-tuned for **UI screen caption generation**.
The model generates **natural language descriptions of mobile UI screenshots**.
Instead of full fine-tuning, **LoRA (Low-Rank Adaptation)** is used to efficiently adapt the base model while training only a small number of parameters.
Base model:
```
microsoft/git-base
```
Dataset used:
```
rootsautomation/RICO-Screen2Words
```
---
# Intended Use
The model takes a **mobile UI screenshot as input** and generates a **caption describing the interface**.
Example use cases:
- UI documentation
- Accessibility tools
- Screen summarization
- Interface understanding
---
# Training Details
The model was fine-tuned using the **RICO-Screen2Words dataset**, which contains screenshots of mobile applications paired with human-written captions.
Training method:
- Parameter-efficient fine-tuning using **LoRA**
- Base model: `microsoft/git-base`
- Vision encoder: Vision Transformer
- Text decoder: Transformer language model
Training was designed to run on **NVIDIA T4 GPUs** using HuggingFace Transformers and PEFT.
---
# Files in This Repository
This repository contains the **LoRA adapter weights**:
```
adapter_config.json
adapter_model.safetensors
```
These adapters can be loaded on top of the base GIT model.
---
# Loading the Model
To use the adapters, first load the base model and then load the LoRA adapters.
```python
from transformers import AutoProcessor, GitForCausalLM
from peft import PeftModel
base_model = GitForCausalLM.from_pretrained("microsoft/git-base")
model = PeftModel.from_pretrained(
base_model,
"HarshaDiwakar/orange-problem-git-lora"
)
processor = AutoProcessor.from_pretrained("microsoft/git-base")
```
---
# Merging LoRA Adapters
The LoRA adapters can optionally be merged with the base model before inference.
```python
model = model.merge_and_unload()
```
This produces a standalone model equivalent to full fine-tuning.
---
# Example Inference
```python
from PIL import Image
image = Image.open("example_ui.png")
inputs = processor(images=image, return_tensors="pt")
outputs = model.generate(**inputs)
caption = processor.batch_decode(outputs, skip_special_tokens=True)
print(caption)
```
Example output:
```
"This screen shows a shopping application with product listings and navigation tabs."
```
---
# Requirements
```
transformers
peft
torch
Pillow
```
Install dependencies:
```
pip install transformers peft torch pillow
```
---
# Dataset
The model was trained on:
https://huggingface.co/datasets/rootsautomation/RICO-Screen2Words
---
# Limitations
- Performance depends on the diversity of UI layouts present in the dataset.
- The model may struggle with very complex or uncommon UI designs.
---
# Citation
- RICO Dataset
- Microsoft GIT model