|
|
--- |
|
|
library_name: transformers |
|
|
license: gemma |
|
|
base_model: google/gemma-3n-E4B-it |
|
|
tags: |
|
|
- llama-factory |
|
|
- lora |
|
|
- generated_from_trainer |
|
|
- android-control |
|
|
- ui-automation |
|
|
- vision-language-model |
|
|
datasets: |
|
|
- OfficerChul/Android-Control-84k |
|
|
model-index: |
|
|
- name: gemma-3n-e4b-android-control |
|
|
results: [] |
|
|
|
|
|
--- |
|
|
|
|
|
# Gemma-3n-E4B-it Android Control LoRA Fine-tuned Model |
|
|
|
|
|
## Model Overview |
|
|
This model is a fine-tuned version of Google's `gemma-3n-E4B-it` base model with LoRA adaptation for Android UI control tasks. |
|
|
|
|
|
## Training Data |
|
|
- **Dataset**: [OfficerChul/Android-Control-84k](https://huggingface.co/datasets/OfficerChul/Android-Control-84k) |
|
|
- **Data Format**: Mobile UI screenshots paired with user instructions to perform appropriate actions (click, scroll, input, etc.) |
|
|
|
|
|
### Training Data Format Example |
|
|
```json |
|
|
{ |
|
|
"messages": [ |
|
|
{ |
|
|
"role": "system", |
|
|
"content": "You are a helpful assistant that can identify what action to perform on mobile UI Screenshot given the user instruction." |
|
|
}, |
|
|
{ |
|
|
"role": "user", |
|
|
"content": "<image>Click on the Recording 2" |
|
|
}, |
|
|
{ |
|
|
"role": "assistant", |
|
|
"content": "{\"action_type\": \"click\", \"x\": 561, \"y\": 535}" |
|
|
} |
|
|
], |
|
|
"images": ["and_ctrl/out_episode_18557_step_001.png"] |
|
|
} |
|
|
``` |
|
|
|
|
|
## Training Method |
|
|
LoRA fine-tuning performed using [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) framework |
|
|
|
|
|
### 1. Training Configuration (`gemma3n-e4b-it.yaml`) |
|
|
- **Base Model**: `google/gemma-3n-E4B-it` |
|
|
- **Training Method**: LoRA (Low-Rank Adaptation) |
|
|
- **LoRA Configuration**: |
|
|
- Rank: 32 |
|
|
- Target modules: `q_proj, k_proj, v_proj, o_proj` |
|
|
- **Training Parameters**: |
|
|
- Batch size: 4 (gradient accumulation: 48) |
|
|
- Learning rate: 2e-5 |
|
|
- Epochs: 5 |
|
|
- LR scheduler: Cosine |
|
|
- Optimizer: AdamW (fused) |
|
|
- Precision: bf16 |
|
|
- **Additional Settings**: |
|
|
- Gradient checkpointing enabled |
|
|
- Vision tower, multi-modal projector, and language model all trainable |
|
|
- DeepSpeed ZeRO-2 utilized |
|
|
|
|
|
### 2. Model Merging (`gemma3n-e4b-it_lora_sft_merge.yaml`) |
|
|
Merged trained LoRA adapter with base model: |
|
|
- **Base Model**: `google/gemma-3n-E4B-it` |
|
|
|
|
|
## Training Results |
|
|
| Training Loss | Epoch | Step | Validation Loss | |
|
|
|:-------------:|:------:|:----:|:---------------:| |
|
|
| 0.2226 | 2.4288 | 500 | 0.2229 | |
|
|
| 0.1658 | 4.8577 | 1000 | 0.2125 | |
|
|
|
|
|
Final validation loss: **0.2124** |
|
|
|
|
|
## Supported Action Types |
|
|
- `click`: Click on specific coordinates |
|
|
- `long_press`: Long press action |
|
|
- `scroll`: Scroll (up/down) |
|
|
- `input_text`: Text input |
|
|
- `navigate_back`: Navigate back |
|
|
- `navigate_home`: Navigate to home screen |
|
|
- `open_app`: Open application |
|
|
- `wait`: Wait action |
|
|
|
|
|
## Usage |
|
|
The merged model can be directly loaded using the Hugging Face Transformers library. |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model_path = "OfficerChul/gemma-3n-E4B-it-Android-Control-84k" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_path, |
|
|
trust_remote_code=True |
|
|
) |
|
|
``` |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
Performance comparison on Android UI control benchmark: |
|
|
|
|
|
| Model | Action Type Accuracy | Click L2 Distance | Input Text Match | Scroll Direction Match | Avg. Episode Accuracy | Malformed JSON | Execution Time (s) | Inference Time (s) | |
|
|
|-------|---------------------|-------------------|------------------|----------------------|---------------------|----------------|-------------------|-------------------| |
|
|
| Qwen/Qwen3-VL-30B-A3B-Instruct | 0.9090 | 705.72 (n=812) | 0.8264 (n=121) | 0.3226 (n=248) | 0.9063 | 110 (5.5%) | 1101.43 | 485.12 | |
|
|
| Qwen/Qwen2.5-VL-7B-Instruct | 0.6125 | 59.89 (n=544) | 0.8197 (n=61) | 0.3243 (n=111) | 0.6163 | 499 (24.9%) | 720.88 | 580.92 | |
|
|
| Qwen/Qwen2.5-VL-3B-Instruct | 0.6645 | 88.21 (n=165) | 0.7889 (n=90) | 0.3519 (n=108) | 0.6615 | 440 (22.0%) | 676.76 | 536.27 | |
|
|
| OfficerChul/Qwen2.5-VL-7B-Instruct-Android-Control-5a | 0.9970 | 427.30 (n=1466) | 0.9434 (n=159) | 0.9775 (n=267) | 0.9974 | 0 (0.0%) | 1086.97 | 581.82 | |
|
|
| OfficerChul/Qwen2.5-VL-3B-Instruct-Android-Control-5a | 0.9965 | 446.54 (n=1467) | 0.9363 (n=157) | 0.9738 (n=267) | 0.9976 | 1 (0.1%) | 672.88 | 530.95 | |
|
|
| OfficerChul/InfiGUI-G1-7B-Android-Control-5a | 0.9970 | 466.24 (n=1466) | 0.9434 (n=159) | 0.9775 (n=267) | 0.9968 | 1 (0.1%) | 897.58 | 552.23 | |
|
|
| OfficerChul/InfiGUI-G1-3B-Android-Control-5a | 0.9980 | 449.73 (n=1467) | 0.9625 (n=160) | 0.9625 (n=267) | 0.9983 | 0 (0.0%) | 722.63 | 529.57 | |
|
|
| InfiX-ai/InfiGUI-G1-7B | 0.6715 | 82.21 (n=821) | 0.8000 (n=70) | 0.2268 (n=194) | 0.6763 | 457 (22.9%) | 698.77 | 557.50 | |
|
|
| InfiX-ai/InfiGUI-G1-3B | 0.8745 | 102.39 (n=1020) | 0.7700 (n=100) | 0.2299 (n=174) | 0.8910 | 78 (3.9%) | 702.93 | 559.65 | |
|
|
| OfficerChul/gemma-3n-E2B-it-Android-Control-84k | 0.5819 | 985.82 (n=123) | 0.8596 (n=114) | 0.2159 (n=88) | 0.5781 | 0 (0.0%) | 322.95 | 159.23 | |
|
|
| **OfficerChul/gemma-3n-E4B-it-Android-Control-84k** | **0.5088** | **878.66 (n=124)** | **0.8763 (n=97)** | **0.3689 (n=103)** | **0.5121** | **0 (0.0%)** | **363.23** | **177.11** | |
|
|
|
|
|
## License |
|
|
Follows the license terms of the Google Gemma model. |
|
|
|
|
|
## Notes |
|
|
- This model was developed for research purposes in mobile UI automation and accessibility enhancement |
|
|
- Proper validation is required when using in production environments |
|
|
|
|
|
## Framework Versions |
|
|
- PEFT 0.17.1 |
|
|
- Transformers 4.57.0 |
|
|
- PyTorch 2.8.0+cu128 |
|
|
- Datasets 4.0.0 |
|
|
- Tokenizers 0.22.1 |