File size: 5,460 Bytes
3f67ab1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
library_name: transformers
license: gemma
base_model: google/gemma-3n-E4B-it
tags:
- llama-factory
- lora
- generated_from_trainer
- android-control
- ui-automation
- vision-language-model
datasets:
- OfficerChul/Android-Control-84k
model-index:
- name: gemma-3n-e4b-android-control
  results: []

---

# Gemma-3n-E4B-it Android Control LoRA Fine-tuned Model

## Model Overview
This model is a fine-tuned version of Google's `gemma-3n-E4B-it` base model with LoRA adaptation for Android UI control tasks.

## Training Data
- **Dataset**: [OfficerChul/Android-Control-84k](https://huggingface.co/datasets/OfficerChul/Android-Control-84k)
- **Data Format**: Mobile UI screenshots paired with user instructions to perform appropriate actions (click, scroll, input, etc.)

### Training Data Format Example
```json
{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant that can identify what action to perform on mobile UI Screenshot given the user instruction."
    },
    {
      "role": "user",
      "content": "<image>Click on the Recording 2"
    },
    {
      "role": "assistant",
      "content": "{\"action_type\": \"click\", \"x\": 561, \"y\": 535}"
    }
  ],
  "images": ["and_ctrl/out_episode_18557_step_001.png"]
}
```

## Training Method
LoRA fine-tuning performed using [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) framework

### 1. Training Configuration (`gemma3n-e4b-it.yaml`)
- **Base Model**: `google/gemma-3n-E4B-it`
- **Training Method**: LoRA (Low-Rank Adaptation)
- **LoRA Configuration**:
  - Rank: 32
  - Target modules: `q_proj, k_proj, v_proj, o_proj`
- **Training Parameters**:
  - Batch size: 4 (gradient accumulation: 48)
  - Learning rate: 2e-5
  - Epochs: 5
  - LR scheduler: Cosine
  - Optimizer: AdamW (fused)
  - Precision: bf16
- **Additional Settings**:
  - Gradient checkpointing enabled
  - Vision tower, multi-modal projector, and language model all trainable
  - DeepSpeed ZeRO-2 utilized

### 2. Model Merging (`gemma3n-e4b-it_lora_sft_merge.yaml`)
Merged trained LoRA adapter with base model:
- **Base Model**: `google/gemma-3n-E4B-it`

## Training Results
| Training Loss | Epoch  | Step | Validation Loss |
|:-------------:|:------:|:----:|:---------------:|
| 0.2226        | 2.4288 | 500  | 0.2229          |
| 0.1658        | 4.8577 | 1000 | 0.2125          |

Final validation loss: **0.2124**

## Supported Action Types
- `click`: Click on specific coordinates
- `long_press`: Long press action
- `scroll`: Scroll (up/down)
- `input_text`: Text input
- `navigate_back`: Navigate back
- `navigate_home`: Navigate to home screen
- `open_app`: Open application
- `wait`: Wait action

## Usage
The merged model can be directly loaded using the Hugging Face Transformers library.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "OfficerChul/gemma-3n-E4B-it-Android-Control-84k"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True
)
```

## Evaluation Results

Performance comparison on Android UI control benchmark:

| Model | Action Type Accuracy | Click L2 Distance | Input Text Match | Scroll Direction Match | Avg. Episode Accuracy | Malformed JSON | Execution Time (s) | Inference Time (s) |
|-------|---------------------|-------------------|------------------|----------------------|---------------------|----------------|-------------------|-------------------|
| Qwen/Qwen3-VL-30B-A3B-Instruct | 0.9090 | 705.72 (n=812) | 0.8264 (n=121) | 0.3226 (n=248) | 0.9063 | 110 (5.5%) | 1101.43 | 485.12 |
| Qwen/Qwen2.5-VL-7B-Instruct | 0.6125 | 59.89 (n=544) | 0.8197 (n=61) | 0.3243 (n=111) | 0.6163 | 499 (24.9%) | 720.88 | 580.92 |
| Qwen/Qwen2.5-VL-3B-Instruct | 0.6645 | 88.21 (n=165) | 0.7889 (n=90) | 0.3519 (n=108) | 0.6615 | 440 (22.0%) | 676.76 | 536.27 |
| OfficerChul/Qwen2.5-VL-7B-Instruct-Android-Control-5a | 0.9970 | 427.30 (n=1466) | 0.9434 (n=159) | 0.9775 (n=267) | 0.9974 | 0 (0.0%) | 1086.97 | 581.82 |
| OfficerChul/Qwen2.5-VL-3B-Instruct-Android-Control-5a | 0.9965 | 446.54 (n=1467) | 0.9363 (n=157) | 0.9738 (n=267) | 0.9976 | 1 (0.1%) | 672.88 | 530.95 |
| OfficerChul/InfiGUI-G1-7B-Android-Control-5a | 0.9970 | 466.24 (n=1466) | 0.9434 (n=159) | 0.9775 (n=267) | 0.9968 | 1 (0.1%) | 897.58 | 552.23 |
| OfficerChul/InfiGUI-G1-3B-Android-Control-5a | 0.9980 | 449.73 (n=1467) | 0.9625 (n=160) | 0.9625 (n=267) | 0.9983 | 0 (0.0%) | 722.63 | 529.57 |
| InfiX-ai/InfiGUI-G1-7B | 0.6715 | 82.21 (n=821) | 0.8000 (n=70) | 0.2268 (n=194) | 0.6763 | 457 (22.9%) | 698.77 | 557.50 |
| InfiX-ai/InfiGUI-G1-3B | 0.8745 | 102.39 (n=1020) | 0.7700 (n=100) | 0.2299 (n=174) | 0.8910 | 78 (3.9%) | 702.93 | 559.65 |
| OfficerChul/gemma-3n-E2B-it-Android-Control-84k | 0.5819 | 985.82 (n=123) | 0.8596 (n=114) | 0.2159 (n=88) | 0.5781 | 0 (0.0%) | 322.95 | 159.23 |
| **OfficerChul/gemma-3n-E4B-it-Android-Control-84k** | **0.5088** | **878.66 (n=124)** | **0.8763 (n=97)** | **0.3689 (n=103)** | **0.5121** | **0 (0.0%)** | **363.23** | **177.11** |

## License
Follows the license terms of the Google Gemma model.

## Notes
- This model was developed for research purposes in mobile UI automation and accessibility enhancement
- Proper validation is required when using in production environments

## Framework Versions
- PEFT 0.17.1
- Transformers 4.57.0
- PyTorch 2.8.0+cu128
- Datasets 4.0.0
- Tokenizers 0.22.1