dmitchelljackson commited on
Commit
ab974a1
·
verified ·
1 Parent(s): a7b37e7

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +187 -0
README.md ADDED
@@ -0,0 +1,187 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ base_model: google/gemma-4-E4B-it
5
+ tags:
6
+ - android
7
+ - ui-automation
8
+ - accessibility
9
+ - lora
10
+ - peft
11
+ ---
12
+
13
+ # Cerebellum — Android UI Action Predictor
14
+
15
+ LoRA adapter on top of `google/gemma-4-E4B-it` that predicts the next Android UI action given a screenshot and accessibility tree.
16
+
17
+ **Architecture:** The LLM (or orchestrating agent) issues high-level intent. Cerebellum executes it locally by grounding intent to a specific UI element and action — without screenshot round-trips to a remote model.
18
+
19
+ ---
20
+
21
+ ## What It Does
22
+
23
+ Given a task goal, the current screen (screenshot + accessibility tree), and optional action history, the model outputs a single compact action code indicating what to do next.
24
+
25
+ ---
26
+
27
+ ## Input Format
28
+
29
+ The model uses a chat-style prompt (Gemma4 format). The user turn is structured as:
30
+
31
+ ```
32
+ Task: {goal}
33
+
34
+ Step 1 (past): <|image|> -> {action_text}
35
+ Step 2 (past): <|image|> -> {action_text}
36
+ ...
37
+ Current screen: <|image|>
38
+ {compressed_accessibility_tree}
39
+ [n zone]=tap-target(top-to-bottom left-to-right) zone=tl/tc/tr/ml/mc/mr/bl/bc/br ed=text-input sr=scrollable fc=focused(use 'K your_text' to type here)
40
+ Actions: T{n}=tap element n, P{n}=long-press element n, K {text}=type text(space required), U/D/L/R=scroll(single token), B=back, H=home, W=wait, F=done, I=impossible
41
+ Next action:
42
+ ```
43
+
44
+ **Inputs:**
45
+ - `goal` — natural language task description (e.g. "Open the settings app and enable dark mode")
46
+ - `history` — up to 4 past (screenshot, action) pairs; can be empty
47
+ - `current screenshot` — PIL image of the current screen, resized to 896px on the long edge
48
+ - `compressed_accessibility_tree` — compact text representation of the UI element tree (see below)
49
+
50
+ ### Accessibility Tree Format
51
+
52
+ Each interactive element is one line:
53
+
54
+ ```
55
+ [0 btn tl] Settings
56
+ [1 ed mc fc=focused] Search...
57
+ [2 btn sr tr] More options
58
+ ```
59
+
60
+ Fields per element:
61
+ - `[n]` — element index (used in action codes)
62
+ - type: `btn`=button, `ed`=text-input, `img`=image, `chk`=checkbox, `swt`=switch, etc.
63
+ - zone: approximate screen position (tl/tc/tr/ml/mc/mr/bl/bc/br)
64
+ - `fc=focused` — this element has keyboard focus (K action types here)
65
+ - `sr=scrollable` — this element is scrollable
66
+ - label/content text follows
67
+
68
+ ---
69
+
70
+ ## Output Format
71
+
72
+ A single action code (one forward pass, greedy decode):
73
+
74
+ | Code | Action | Example |
75
+ |---|---|---|
76
+ | `T{n}` | Tap element n | `T7` |
77
+ | `P{n}` | Long-press element n | `P3` |
78
+ | `K {text}` | Type text into focused field | `K hello world` |
79
+ | `U` | Scroll up | `U` |
80
+ | `D` | Scroll down | `D` |
81
+ | `L` | Scroll left | `L` |
82
+ | `R` | Scroll right | `R` |
83
+ | `B` | System back | `B` |
84
+ | `H` | Home button | `H` |
85
+ | `W` | Wait (screen loading) | `W` |
86
+ | `F` | Done (task complete) | `F` |
87
+ | `I` | Impossible (task cannot complete) | `I` |
88
+
89
+ Single-token actions (U/D/L/R/B/H/W/F/I) self-terminate — no EOS token follows. T/P generate up to 5 tokens (letter + digits + EOS). K generates until EOS.
90
+
91
+ ---
92
+
93
+ ## Inference-Time Error Recovery
94
+
95
+ The model occasionally produces malformed outputs (action letter fused with wrong content, e.g. `B4`, `W3`, `T some text`). A lightweight validator detects these and retries with a disambiguating correction blurb appended to the prompt:
96
+
97
+ ```
98
+ Next action:
99
+ 'B4' is not valid. Did you mean 'B' (back) or 'T4' (tap element 4)? Try again:
100
+ ```
101
+
102
+ This zero-shot correction resolves the majority of format errors without additional training.
103
+
104
+ ---
105
+
106
+ ## Performance (step 656)
107
+
108
+ Evaluated on AndroidControl dataset (accessibility tree format, single-step predictions):
109
+
110
+ | Metric | Last 20 steps | Last 50 steps | All (102 steps) |
111
+ |---|---|---|---|
112
+ | Overall accuracy | 95.0% | 92.0% | 88.2% |
113
+ | Element index accuracy | 93.3% | 88.6% | 84.6% |
114
+
115
+ **Action type breakdown (last 20 steps):**
116
+
117
+ | Action | Accuracy |
118
+ |---|---|
119
+ | tap (T) | 93% |
120
+ | scroll (U/D/L/R) | 100% |
121
+ | back (B) | 100% |
122
+ | type (K) | 100% |
123
+ | wait (W) | 100% |
124
+
125
+ Remaining errors are primarily element index off-by-one on tap targets — a known SFT ceiling, addressed by RL.
126
+
127
+ ---
128
+
129
+ ## Training Process
130
+
131
+ **Base model:** `google/gemma-4-E4B-it` (4B MoE, 4-bit quantized during training via bitsandbytes)
132
+
133
+ **LoRA config:**
134
+ - `r=64`, `alpha=32`, `dropout=0.05`
135
+ - Target modules: all linear layers in the transformer
136
+
137
+ **Training data:** AndroidControl dataset (accessibility tree variant), ~20 shards from GCS. Each sample is a single (screenshot, a11y tree, goal, history) → action step from a real Android interaction trajectory.
138
+
139
+ **Key training decisions:**
140
+ - No label smoothing — removed after identifying it softened action type gradients
141
+ - `accum_steps=1` — every sample is its own gradient update (maximum signal density)
142
+ - `lr=5e-5`, cosine schedule
143
+ - Grammar-constrained loss: inference-time cap per action type (T/P: 5 tokens max, single-token actions: 1 token). Wrong action type predictions lose access to downstream element-index reward
144
+ - Type token weights: tap=4.0, long_press=4.0, type=8.0, scrolls=8.0 (upweighted to prevent collapse)
145
+ - Sample weights: rare actions (back/home/wait/done/impossible) upweighted 3× to prevent tap dominance
146
+ - Rolling window diversity quota (window=20): ensures each action type appears proportionally in recent batches
147
+
148
+ **Training infrastructure:**
149
+ - Single RTX 3060 12GB
150
+ - ~100s/step (full image + tree encoding + gradient update)
151
+ - Milestone checkpoints every ~100 steps via sentinel file
152
+
153
+ **To replicate from scratch:**
154
+ 1. Download AndroidControl dataset (GCS, 20 shards, ~47GB)
155
+ 2. Preprocess with `scripts/preprocess_a11y.py` to extract accessibility trees
156
+ 3. Train: `py -3.11 -u scripts/train_autoregressive.py --out checkpoints/autoreg/current`
157
+ 4. Resume: `py -3.11 -u scripts/train_autoregressive.py --resume checkpoints/autoreg/current/step_XXXXXXX --out checkpoints/autoreg/current`
158
+ 5. Monitor: tail the log file for HIT/miss lines; ntfy.sh push notifications every 5 steps (topic: Cerebellum-Training)
159
+
160
+ ---
161
+
162
+ ## Loading the Adapter
163
+
164
+ ```python
165
+ from transformers import AutoProcessor
166
+ from peft import PeftModel
167
+ from transformers import Gemma4ForConditionalGeneration
168
+ import torch
169
+
170
+ base = Gemma4ForConditionalGeneration.from_pretrained(
171
+ "google/gemma-4-E4B-it",
172
+ torch_dtype=torch.bfloat16,
173
+ device_map="auto",
174
+ )
175
+ model = PeftModel.from_pretrained(base, "dmitchelljackson/cerebellum-e4b-lora")
176
+ processor = AutoProcessor.from_pretrained("dmitchelljackson/cerebellum-e4b-lora")
177
+ model.eval()
178
+ ```
179
+
180
+ ---
181
+
182
+ ## Roadmap
183
+
184
+ - [x] SFT on AndroidControl (~88-95% single-step accuracy)
185
+ - [x] Inference-time error recovery (format validator + correction blurb)
186
+ - [ ] RL fine-tuning (GRPO) on AndroidWorld tasks for multi-step accuracy and semantic recovery
187
+ - [ ] Error recovery fine-tuning on collected failure cases