HarshaDiwakar
/

orange-problem-git-lora

Model card Files Files and versions

orange-problem-git-lora / README.md

HarshaDiwakar's picture

Update README.md

171116d verified 21 days ago

|

history blame contribute delete

3.14 kB

	---
	license: apache-2.0
	base_model: microsoft/git-base
	tags:
	- multimodal
	- image-to-text
	- lora
	- transformers
	- ui-captioning
	datasets:
	- rootsautomation/RICO-Screen2Words
	---

	# GIT LoRA Fine-Tuned on RICO-Screen2Words

	## Model Description

	This repository contains LoRA adapters for the GIT (Generative Image-to-Text Transformer) model, fine-tuned for UI screen caption generation.

	The model generates natural language descriptions of mobile UI screenshots.

	Instead of full fine-tuning, LoRA (Low-Rank Adaptation) is used to efficiently adapt the base model while training only a small number of parameters.

	Base model:

	```
	microsoft/git-base
	```

	Dataset used:

	```
	rootsautomation/RICO-Screen2Words
	```

	---

	# Intended Use

	The model takes a mobile UI screenshot as input and generates a caption describing the interface.

	Example use cases:

	- UI documentation
	- Accessibility tools
	- Screen summarization
	- Interface understanding

	---

	# Training Details

	The model was fine-tuned using the RICO-Screen2Words dataset, which contains screenshots of mobile applications paired with human-written captions.

	Training method:

	- Parameter-efficient fine-tuning using LoRA
	- Base model: `microsoft/git-base`
	- Vision encoder: Vision Transformer
	- Text decoder: Transformer language model

	Training was designed to run on NVIDIA T4 GPUs using HuggingFace Transformers and PEFT.

	---

	# Files in This Repository

	This repository contains the LoRA adapter weights:

	```
	adapter_config.json
	adapter_model.safetensors
	```

	These adapters can be loaded on top of the base GIT model.

	---

	# Loading the Model

	To use the adapters, first load the base model and then load the LoRA adapters.

	```python
	from transformers import AutoProcessor, GitForCausalLM
	from peft import PeftModel

	base_model = GitForCausalLM.from_pretrained("microsoft/git-base")

	model = PeftModel.from_pretrained(
	base_model,
	"HarshaDiwakar/orange-problem-git-lora"
	)

	processor = AutoProcessor.from_pretrained("microsoft/git-base")
	```

	---

	# Merging LoRA Adapters

	The LoRA adapters can optionally be merged with the base model before inference.

	```python
	model = model.merge_and_unload()
	```

	This produces a standalone model equivalent to full fine-tuning.

	---

	# Example Inference

	```python
	from PIL import Image

	image = Image.open("example_ui.png")

	inputs = processor(images=image, return_tensors="pt")

	outputs = model.generate(**inputs)

	caption = processor.batch_decode(outputs, skip_special_tokens=True)

	print(caption)
	```

	Example output:

	```
	"This screen shows a shopping application with product listings and navigation tabs."
	```

	---

	# Requirements

	```
	transformers
	peft
	torch
	Pillow
	```

	Install dependencies:

	```
	pip install transformers peft torch pillow
	```

	---

	# Dataset

	The model was trained on:

	https://huggingface.co/datasets/rootsautomation/RICO-Screen2Words

	---

	# Limitations

	- Performance depends on the diversity of UI layouts present in the dataset.
	- The model may struggle with very complex or uncommon UI designs.

	---

	# Citation

	- RICO Dataset
	- Microsoft GIT model