| --- |
| license: apache-2.0 |
| base_model: microsoft/git-base |
| tags: |
| - multimodal |
| - image-to-text |
| - lora |
| - transformers |
| - ui-captioning |
| datasets: |
| - rootsautomation/RICO-Screen2Words |
| --- |
| |
| # GIT LoRA Fine-Tuned on RICO-Screen2Words |
|
|
| ## Model Description |
|
|
| This repository contains **LoRA adapters for the GIT (Generative Image-to-Text Transformer) model**, fine-tuned for **UI screen caption generation**. |
|
|
| The model generates **natural language descriptions of mobile UI screenshots**. |
|
|
| Instead of full fine-tuning, **LoRA (Low-Rank Adaptation)** is used to efficiently adapt the base model while training only a small number of parameters. |
|
|
| Base model: |
|
|
| ``` |
| microsoft/git-base |
| ``` |
|
|
| Dataset used: |
|
|
| ``` |
| rootsautomation/RICO-Screen2Words |
| ``` |
|
|
| --- |
|
|
| # Intended Use |
|
|
| The model takes a **mobile UI screenshot as input** and generates a **caption describing the interface**. |
|
|
| Example use cases: |
|
|
| - UI documentation |
| - Accessibility tools |
| - Screen summarization |
| - Interface understanding |
|
|
| --- |
|
|
| # Training Details |
|
|
| The model was fine-tuned using the **RICO-Screen2Words dataset**, which contains screenshots of mobile applications paired with human-written captions. |
|
|
| Training method: |
|
|
| - Parameter-efficient fine-tuning using **LoRA** |
| - Base model: `microsoft/git-base` |
| - Vision encoder: Vision Transformer |
| - Text decoder: Transformer language model |
|
|
| Training was designed to run on **NVIDIA T4 GPUs** using HuggingFace Transformers and PEFT. |
|
|
| --- |
|
|
| # Files in This Repository |
|
|
| This repository contains the **LoRA adapter weights**: |
|
|
| ``` |
| adapter_config.json |
| adapter_model.safetensors |
| ``` |
|
|
| These adapters can be loaded on top of the base GIT model. |
|
|
| --- |
|
|
| # Loading the Model |
|
|
| To use the adapters, first load the base model and then load the LoRA adapters. |
|
|
| ```python |
| from transformers import AutoProcessor, GitForCausalLM |
| from peft import PeftModel |
| |
| base_model = GitForCausalLM.from_pretrained("microsoft/git-base") |
| |
| model = PeftModel.from_pretrained( |
| base_model, |
| "HarshaDiwakar/orange-problem-git-lora" |
| ) |
| |
| processor = AutoProcessor.from_pretrained("microsoft/git-base") |
| ``` |
|
|
| --- |
|
|
| # Merging LoRA Adapters |
|
|
| The LoRA adapters can optionally be merged with the base model before inference. |
|
|
| ```python |
| model = model.merge_and_unload() |
| ``` |
|
|
| This produces a standalone model equivalent to full fine-tuning. |
|
|
| --- |
|
|
| # Example Inference |
|
|
| ```python |
| from PIL import Image |
| |
| image = Image.open("example_ui.png") |
| |
| inputs = processor(images=image, return_tensors="pt") |
| |
| outputs = model.generate(**inputs) |
| |
| caption = processor.batch_decode(outputs, skip_special_tokens=True) |
| |
| print(caption) |
| ``` |
|
|
| Example output: |
|
|
| ``` |
| "This screen shows a shopping application with product listings and navigation tabs." |
| ``` |
|
|
| --- |
|
|
| # Requirements |
|
|
| ``` |
| transformers |
| peft |
| torch |
| Pillow |
| ``` |
|
|
| Install dependencies: |
|
|
| ``` |
| pip install transformers peft torch pillow |
| ``` |
|
|
| --- |
|
|
| # Dataset |
|
|
| The model was trained on: |
|
|
| https://huggingface.co/datasets/rootsautomation/RICO-Screen2Words |
|
|
| --- |
|
|
| # Limitations |
|
|
| - Performance depends on the diversity of UI layouts present in the dataset. |
| - The model may struggle with very complex or uncommon UI designs. |
|
|
| --- |
|
|
| # Citation |
|
|
| - RICO Dataset |
| - Microsoft GIT model |