File size: 5,460 Bytes
3f67ab1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | ---
library_name: transformers
license: gemma
base_model: google/gemma-3n-E4B-it
tags:
- llama-factory
- lora
- generated_from_trainer
- android-control
- ui-automation
- vision-language-model
datasets:
- OfficerChul/Android-Control-84k
model-index:
- name: gemma-3n-e4b-android-control
results: []
---
# Gemma-3n-E4B-it Android Control LoRA Fine-tuned Model
## Model Overview
This model is a fine-tuned version of Google's `gemma-3n-E4B-it` base model with LoRA adaptation for Android UI control tasks.
## Training Data
- **Dataset**: [OfficerChul/Android-Control-84k](https://huggingface.co/datasets/OfficerChul/Android-Control-84k)
- **Data Format**: Mobile UI screenshots paired with user instructions to perform appropriate actions (click, scroll, input, etc.)
### Training Data Format Example
```json
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant that can identify what action to perform on mobile UI Screenshot given the user instruction."
},
{
"role": "user",
"content": "<image>Click on the Recording 2"
},
{
"role": "assistant",
"content": "{\"action_type\": \"click\", \"x\": 561, \"y\": 535}"
}
],
"images": ["and_ctrl/out_episode_18557_step_001.png"]
}
```
## Training Method
LoRA fine-tuning performed using [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) framework
### 1. Training Configuration (`gemma3n-e4b-it.yaml`)
- **Base Model**: `google/gemma-3n-E4B-it`
- **Training Method**: LoRA (Low-Rank Adaptation)
- **LoRA Configuration**:
- Rank: 32
- Target modules: `q_proj, k_proj, v_proj, o_proj`
- **Training Parameters**:
- Batch size: 4 (gradient accumulation: 48)
- Learning rate: 2e-5
- Epochs: 5
- LR scheduler: Cosine
- Optimizer: AdamW (fused)
- Precision: bf16
- **Additional Settings**:
- Gradient checkpointing enabled
- Vision tower, multi-modal projector, and language model all trainable
- DeepSpeed ZeRO-2 utilized
### 2. Model Merging (`gemma3n-e4b-it_lora_sft_merge.yaml`)
Merged trained LoRA adapter with base model:
- **Base Model**: `google/gemma-3n-E4B-it`
## Training Results
| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:------:|:----:|:---------------:|
| 0.2226 | 2.4288 | 500 | 0.2229 |
| 0.1658 | 4.8577 | 1000 | 0.2125 |
Final validation loss: **0.2124**
## Supported Action Types
- `click`: Click on specific coordinates
- `long_press`: Long press action
- `scroll`: Scroll (up/down)
- `input_text`: Text input
- `navigate_back`: Navigate back
- `navigate_home`: Navigate to home screen
- `open_app`: Open application
- `wait`: Wait action
## Usage
The merged model can be directly loaded using the Hugging Face Transformers library.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "OfficerChul/gemma-3n-E4B-it-Android-Control-84k"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True
)
```
## Evaluation Results
Performance comparison on Android UI control benchmark:
| Model | Action Type Accuracy | Click L2 Distance | Input Text Match | Scroll Direction Match | Avg. Episode Accuracy | Malformed JSON | Execution Time (s) | Inference Time (s) |
|-------|---------------------|-------------------|------------------|----------------------|---------------------|----------------|-------------------|-------------------|
| Qwen/Qwen3-VL-30B-A3B-Instruct | 0.9090 | 705.72 (n=812) | 0.8264 (n=121) | 0.3226 (n=248) | 0.9063 | 110 (5.5%) | 1101.43 | 485.12 |
| Qwen/Qwen2.5-VL-7B-Instruct | 0.6125 | 59.89 (n=544) | 0.8197 (n=61) | 0.3243 (n=111) | 0.6163 | 499 (24.9%) | 720.88 | 580.92 |
| Qwen/Qwen2.5-VL-3B-Instruct | 0.6645 | 88.21 (n=165) | 0.7889 (n=90) | 0.3519 (n=108) | 0.6615 | 440 (22.0%) | 676.76 | 536.27 |
| OfficerChul/Qwen2.5-VL-7B-Instruct-Android-Control-5a | 0.9970 | 427.30 (n=1466) | 0.9434 (n=159) | 0.9775 (n=267) | 0.9974 | 0 (0.0%) | 1086.97 | 581.82 |
| OfficerChul/Qwen2.5-VL-3B-Instruct-Android-Control-5a | 0.9965 | 446.54 (n=1467) | 0.9363 (n=157) | 0.9738 (n=267) | 0.9976 | 1 (0.1%) | 672.88 | 530.95 |
| OfficerChul/InfiGUI-G1-7B-Android-Control-5a | 0.9970 | 466.24 (n=1466) | 0.9434 (n=159) | 0.9775 (n=267) | 0.9968 | 1 (0.1%) | 897.58 | 552.23 |
| OfficerChul/InfiGUI-G1-3B-Android-Control-5a | 0.9980 | 449.73 (n=1467) | 0.9625 (n=160) | 0.9625 (n=267) | 0.9983 | 0 (0.0%) | 722.63 | 529.57 |
| InfiX-ai/InfiGUI-G1-7B | 0.6715 | 82.21 (n=821) | 0.8000 (n=70) | 0.2268 (n=194) | 0.6763 | 457 (22.9%) | 698.77 | 557.50 |
| InfiX-ai/InfiGUI-G1-3B | 0.8745 | 102.39 (n=1020) | 0.7700 (n=100) | 0.2299 (n=174) | 0.8910 | 78 (3.9%) | 702.93 | 559.65 |
| OfficerChul/gemma-3n-E2B-it-Android-Control-84k | 0.5819 | 985.82 (n=123) | 0.8596 (n=114) | 0.2159 (n=88) | 0.5781 | 0 (0.0%) | 322.95 | 159.23 |
| **OfficerChul/gemma-3n-E4B-it-Android-Control-84k** | **0.5088** | **878.66 (n=124)** | **0.8763 (n=97)** | **0.3689 (n=103)** | **0.5121** | **0 (0.0%)** | **363.23** | **177.11** |
## License
Follows the license terms of the Google Gemma model.
## Notes
- This model was developed for research purposes in mobile UI automation and accessibility enhancement
- Proper validation is required when using in production environments
## Framework Versions
- PEFT 0.17.1
- Transformers 4.57.0
- PyTorch 2.8.0+cu128
- Datasets 4.0.0
- Tokenizers 0.22.1 |