--- license: apache-2.0 base_model: microsoft/git-base tags: - multimodal - image-to-text - lora - transformers - ui-captioning datasets: - rootsautomation/RICO-Screen2Words --- # GIT LoRA Fine-Tuned on RICO-Screen2Words ## Model Description This repository contains **LoRA adapters for the GIT (Generative Image-to-Text Transformer) model**, fine-tuned for **UI screen caption generation**. The model generates **natural language descriptions of mobile UI screenshots**. Instead of full fine-tuning, **LoRA (Low-Rank Adaptation)** is used to efficiently adapt the base model while training only a small number of parameters. Base model: ``` microsoft/git-base ``` Dataset used: ``` rootsautomation/RICO-Screen2Words ``` --- # Intended Use The model takes a **mobile UI screenshot as input** and generates a **caption describing the interface**. Example use cases: - UI documentation - Accessibility tools - Screen summarization - Interface understanding --- # Training Details The model was fine-tuned using the **RICO-Screen2Words dataset**, which contains screenshots of mobile applications paired with human-written captions. Training method: - Parameter-efficient fine-tuning using **LoRA** - Base model: `microsoft/git-base` - Vision encoder: Vision Transformer - Text decoder: Transformer language model Training was designed to run on **NVIDIA T4 GPUs** using HuggingFace Transformers and PEFT. --- # Files in This Repository This repository contains the **LoRA adapter weights**: ``` adapter_config.json adapter_model.safetensors ``` These adapters can be loaded on top of the base GIT model. --- # Loading the Model To use the adapters, first load the base model and then load the LoRA adapters. ```python from transformers import AutoProcessor, GitForCausalLM from peft import PeftModel base_model = GitForCausalLM.from_pretrained("microsoft/git-base") model = PeftModel.from_pretrained( base_model, "HarshaDiwakar/orange-problem-git-lora" ) processor = AutoProcessor.from_pretrained("microsoft/git-base") ``` --- # Merging LoRA Adapters The LoRA adapters can optionally be merged with the base model before inference. ```python model = model.merge_and_unload() ``` This produces a standalone model equivalent to full fine-tuning. --- # Example Inference ```python from PIL import Image image = Image.open("example_ui.png") inputs = processor(images=image, return_tensors="pt") outputs = model.generate(**inputs) caption = processor.batch_decode(outputs, skip_special_tokens=True) print(caption) ``` Example output: ``` "This screen shows a shopping application with product listings and navigation tabs." ``` --- # Requirements ``` transformers peft torch Pillow ``` Install dependencies: ``` pip install transformers peft torch pillow ``` --- # Dataset The model was trained on: https://huggingface.co/datasets/rootsautomation/RICO-Screen2Words --- # Limitations - Performance depends on the diversity of UI layouts present in the dataset. - The model may struggle with very complex or uncommon UI designs. --- # Citation - RICO Dataset - Microsoft GIT model