Upload folder using huggingface_hub

3f67ab1 verified 3 months ago

5.46 kB

	---
	library_name: transformers
	license: gemma
	base_model: google/gemma-3n-E4B-it
	tags:
	- llama-factory
	- lora
	- generated_from_trainer
	- android-control
	- ui-automation
	- vision-language-model
	datasets:
	- OfficerChul/Android-Control-84k
	model-index:
	- name: gemma-3n-e4b-android-control
	results: []

	---

	# Gemma-3n-E4B-it Android Control LoRA Fine-tuned Model

	## Model Overview
	This model is a fine-tuned version of Google's `gemma-3n-E4B-it` base model with LoRA adaptation for Android UI control tasks.

	## Training Data
	- Dataset: [OfficerChul/Android-Control-84k](https://huggingface.co/datasets/OfficerChul/Android-Control-84k)
	- Data Format: Mobile UI screenshots paired with user instructions to perform appropriate actions (click, scroll, input, etc.)

	### Training Data Format Example
	```json
	{
	"messages": [
	{
	"role": "system",
	"content": "You are a helpful assistant that can identify what action to perform on mobile UI Screenshot given the user instruction."
	},
	{
	"role": "user",
	"content": "<image>Click on the Recording 2"
	},
	{
	"role": "assistant",
	"content": "{\"action_type\": \"click\", \"x\": 561, \"y\": 535}"
	}
	],
	"images": ["and_ctrl/out_episode_18557_step_001.png"]
	}
	```

	## Training Method
	LoRA fine-tuning performed using [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) framework

	### 1. Training Configuration (`gemma3n-e4b-it.yaml`)
	- Base Model: `google/gemma-3n-E4B-it`
	- Training Method: LoRA (Low-Rank Adaptation)
	- LoRA Configuration:
	- Rank: 32
	- Target modules: `q_proj, k_proj, v_proj, o_proj`
	- Training Parameters:
	- Batch size: 4 (gradient accumulation: 48)
	- Learning rate: 2e-5
	- Epochs: 5
	- LR scheduler: Cosine
	- Optimizer: AdamW (fused)
	- Precision: bf16
	- Additional Settings:
	- Gradient checkpointing enabled
	- Vision tower, multi-modal projector, and language model all trainable
	- DeepSpeed ZeRO-2 utilized

	### 2. Model Merging (`gemma3n-e4b-it_lora_sft_merge.yaml`)
	Merged trained LoRA adapter with base model:
	- Base Model: `google/gemma-3n-E4B-it`

	## Training Results
	\| Training Loss \| Epoch \| Step \| Validation Loss \|
	\|:-------------:\|:------:\|:----:\|:---------------:\|
	\| 0.2226 \| 2.4288 \| 500 \| 0.2229 \|
	\| 0.1658 \| 4.8577 \| 1000 \| 0.2125 \|

	Final validation loss: 0.2124

	## Supported Action Types
	- `click`: Click on specific coordinates
	- `long_press`: Long press action
	- `scroll`: Scroll (up/down)
	- `input_text`: Text input
	- `navigate_back`: Navigate back
	- `navigate_home`: Navigate to home screen
	- `open_app`: Open application
	- `wait`: Wait action

	## Usage
	The merged model can be directly loaded using the Hugging Face Transformers library.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_path = "OfficerChul/gemma-3n-E4B-it-Android-Control-84k"
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = AutoModelForCausalLM.from_pretrained(
	model_path,
	trust_remote_code=True
	)
	```

	## Evaluation Results

	Performance comparison on Android UI control benchmark:

	\| Model \| Action Type Accuracy \| Click L2 Distance \| Input Text Match \| Scroll Direction Match \| Avg. Episode Accuracy \| Malformed JSON \| Execution Time (s) \| Inference Time (s) \|
	\|-------\|---------------------\|-------------------\|------------------\|----------------------\|---------------------\|----------------\|-------------------\|-------------------\|
	\| Qwen/Qwen3-VL-30B-A3B-Instruct \| 0.9090 \| 705.72 (n=812) \| 0.8264 (n=121) \| 0.3226 (n=248) \| 0.9063 \| 110 (5.5%) \| 1101.43 \| 485.12 \|
	\| Qwen/Qwen2.5-VL-7B-Instruct \| 0.6125 \| 59.89 (n=544) \| 0.8197 (n=61) \| 0.3243 (n=111) \| 0.6163 \| 499 (24.9%) \| 720.88 \| 580.92 \|
	\| Qwen/Qwen2.5-VL-3B-Instruct \| 0.6645 \| 88.21 (n=165) \| 0.7889 (n=90) \| 0.3519 (n=108) \| 0.6615 \| 440 (22.0%) \| 676.76 \| 536.27 \|
	\| OfficerChul/Qwen2.5-VL-7B-Instruct-Android-Control-5a \| 0.9970 \| 427.30 (n=1466) \| 0.9434 (n=159) \| 0.9775 (n=267) \| 0.9974 \| 0 (0.0%) \| 1086.97 \| 581.82 \|
	\| OfficerChul/Qwen2.5-VL-3B-Instruct-Android-Control-5a \| 0.9965 \| 446.54 (n=1467) \| 0.9363 (n=157) \| 0.9738 (n=267) \| 0.9976 \| 1 (0.1%) \| 672.88 \| 530.95 \|
	\| OfficerChul/InfiGUI-G1-7B-Android-Control-5a \| 0.9970 \| 466.24 (n=1466) \| 0.9434 (n=159) \| 0.9775 (n=267) \| 0.9968 \| 1 (0.1%) \| 897.58 \| 552.23 \|
	\| OfficerChul/InfiGUI-G1-3B-Android-Control-5a \| 0.9980 \| 449.73 (n=1467) \| 0.9625 (n=160) \| 0.9625 (n=267) \| 0.9983 \| 0 (0.0%) \| 722.63 \| 529.57 \|
	\| InfiX-ai/InfiGUI-G1-7B \| 0.6715 \| 82.21 (n=821) \| 0.8000 (n=70) \| 0.2268 (n=194) \| 0.6763 \| 457 (22.9%) \| 698.77 \| 557.50 \|
	\| InfiX-ai/InfiGUI-G1-3B \| 0.8745 \| 102.39 (n=1020) \| 0.7700 (n=100) \| 0.2299 (n=174) \| 0.8910 \| 78 (3.9%) \| 702.93 \| 559.65 \|
	\| OfficerChul/gemma-3n-E2B-it-Android-Control-84k \| 0.5819 \| 985.82 (n=123) \| 0.8596 (n=114) \| 0.2159 (n=88) \| 0.5781 \| 0 (0.0%) \| 322.95 \| 159.23 \|
	\| OfficerChul/gemma-3n-E4B-it-Android-Control-84k \| 0.5088 \| 878.66 (n=124) \| 0.8763 (n=97) \| 0.3689 (n=103) \| 0.5121 \| 0 (0.0%) \| 363.23 \| 177.11 \|

	## License
	Follows the license terms of the Google Gemma model.

	## Notes
	- This model was developed for research purposes in mobile UI automation and accessibility enhancement
	- Proper validation is required when using in production environments

	## Framework Versions
	- PEFT 0.17.1
	- Transformers 4.57.0
	- PyTorch 2.8.0+cu128
	- Datasets 4.0.0
	- Tokenizers 0.22.1