renezander030 commited on
Commit
ad37a63
·
verified ·
1 Parent(s): 665881a

v0.2 — Tier 2 LoRA r32, 26k mixed-domain examples incl. browser, ScreenSpot-v2 60.0%

Browse files
Files changed (4) hide show
  1. README.md +168 -161
  2. adapter_config.json +7 -7
  3. adapter_model.safetensors +2 -2
  4. training_args.bin +1 -1
README.md CHANGED
@@ -1,200 +1,207 @@
1
  ---
2
- license: apache-2.0
3
  library_name: peft
 
4
  tags:
5
- - ui-grounding
6
- - screen-grounding
7
- - browser-agent
8
- - claude-computer-use
9
- - codex
10
- - hybrid-ai
11
- - compound-ai
12
- - specialist-model
13
- - lora
14
- - peft
15
- - mlx
16
- - apple-silicon
17
- - qwen3-vl
18
- - gpt-4v-alternative
19
- - cost-effective-ai
20
- base_model: Qwen/Qwen3-VL-2B-Instruct
21
- pipeline_tag: image-text-to-text
22
- language:
23
- - en
24
- datasets:
25
- - OS-Copilot/OS-Atlas-Data
26
- - agentsea/wave-ui
27
  ---
28
 
29
- <p align="center">
30
- <img src="https://raw.githubusercontent.com/renezander030/browserground/main/assets/logo.svg" alt="browserground logo" width="120" height="120"/>
31
- </p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
- # browserground — Qwen3-VL-2B LoRA for hybrid AI agents (v0.1)
34
 
35
- > **The local UI-grounding specialist for hybrid AI agents.** Drop in a screenshot + text target, get a strict JSON bbox. 2B params. MLX-native. Apache 2.0.
36
 
 
37
 
38
- ## Why this exists the hybrid AI argument
39
 
40
- Today, most AI agents route **every** screenshot to a cloud frontier model (GPT-4V, Claude Vision, Gemini) just to find click coordinates. That's a $0.01–0.05 multimodal call adding 800ms–2s of latency, repeated 20–50× per agent run. Cost and latency compound. Screenshots full of private UI leave your machine.
41
 
42
- A general 200B-parameter LLM is overkill for "where is the Submit button?" — that's a narrow vision task. The right shape is a **hybrid one**: cheap fast specialist local models for the dedicated tasks they handle better, and the cloud LLM only for the planning and reasoning it's uniquely good at.
43
 
44
- That's exactly what browserground is the click-grounding specialist.
45
 
46
- ![hybrid architecture](https://raw.githubusercontent.com/renezander030/browserground/main/assets/hybrid-architecture.svg)
47
 
48
- | | Pure-cloud (status quo) | Hybrid (+ browserground) |
49
- |---|---|---|
50
- | Per-screenshot cost | $0.01–0.05 | **$0** |
51
- | Latency | 800ms–2s round-trip | **~1.8s local** |
52
- | Tokens billed by cloud | 1500+ multimodal | **~40 text** |
53
- | Screenshots leave machine | yes | **no** |
54
- | Rate limits | yes | **no** |
55
 
56
- ## What it does
57
 
58
- Given a screenshot and a target description (`"submit form button"`, `"the red Sign Up link"`, `"the second profile picture from the left"`), this LoRA-fine-tuned Qwen3-VL-2B emits a strict JSON object:
59
 
60
- ```json
61
- {"bbox_2d": [x1, y1, x2, y2]}
62
- ```
63
 
64
- the pixel coordinates of the element to click. **100% format compliance** on the held-out evaluation. Drop it into any browser-agent / screen-automation pipeline that needs to ground language → click target.
65
 
66
- ## Results on ScreenSpot-v2
67
 
68
- Point-grounding accuracy, 300 held-out items (100 per split: mobile / desktop / web). A hit = predicted bbox center falls inside the ground-truth bbox.
69
 
70
- | Model | Params | Overall | Mobile | Desktop | Web | Format-OK |
71
- |---|---:|---:|---:|---:|---:|---:|
72
- | GPT-4o (cloud) | — | 18.3% | — | — | — | — |
73
- | SeeClick (Qwen-VL-Chat) | 9.6B | 55.1% | — | — | — | — |
74
- | ShowUI-2B | 2B | 75.5% | — | — | — | — |
75
- | UI-TARS-2B-SFT (ByteDance) | 2B | 89.5% | — | — | — | — |
76
- | OS-Atlas-Base-7B | 7B | ~91% | — | — | — | — |
77
- | **browserground v0.1 (this model)** | **2B** | **45.3%** | **64.0%** | **28.0%** | **44.0%** | **100%** |
78
- | Qwen3-VL-2B-Instruct (zero-shot baseline) | 2B | 6.3% | 7.0% | 6.0% | 6.0% | 100% |
79
 
80
- - Beats **GPT-4o by 2.5×** and zero-shot Qwen3-VL by **7×** on the same benchmark
81
- - **100% strict-JSON format compliance** — no markdown fences, no commentary
82
- - Sits below ShowUI/UI-TARS at this v0.1; v0.2 (Tier 2, target ≥ 60%) on the roadmap
83
 
84
- Numbers for SeeClick / ShowUI / UI-TARS / OS-Atlas are from the OS-Atlas paper's reported ScreenSpot-v2 leaderboard.
85
 
86
- ## Quick start
87
 
88
- ```bash
89
- npm install -g browserground
90
- browserground parse screenshot.png --target "Submit button"
91
- # {"bbox_2d": [344, 612, 478, 658]}
92
- ```
93
 
94
- Full install + agent-stack integration: [github.com/renezander030/browserground](https://github.com/renezander030/browserground).
95
 
96
- ## Use from Python directly
97
 
98
- ```python
99
- from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
100
- from peft import PeftModel
101
- import torch
102
- from PIL import Image
103
 
104
- processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")
105
- model = Qwen3VLForConditionalGeneration.from_pretrained(
106
- "Qwen/Qwen3-VL-2B-Instruct", dtype=torch.bfloat16, device_map="auto"
107
- )
108
- model = PeftModel.from_pretrained(model, "renezander030/browserground")
109
- model = model.merge_and_unload(); model.eval()
110
 
111
- img = Image.open("screenshot.png").convert("RGB")
112
- messages = [
113
- {"role": "system", "content": [{"type": "text", "text":
114
- 'You are a UI-grounding model. Given a screenshot and a target description, '
115
- 'output the bounding box of the SINGLE UI element to click. Output ONLY a JSON '
116
- 'object: {"bbox_2d": [x1, y1, x2, y2]} with pixel coordinates, origin at top-left.'}]},
117
- {"role": "user", "content": [
118
- {"type": "image", "image": img},
119
- {"type": "text", "text": "Locate the element described: Submit button"},
120
- ]},
121
- ]
122
- prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
123
- inputs = processor(text=[prompt], images=[[img]], return_tensors="pt").to(model.device)
124
- out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
125
- print(processor.tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))
126
- ```
127
 
128
- ## Training recipe
129
 
130
- - **Base**: `Qwen/Qwen3-VL-2B-Instruct`
131
- - **Method**: LoRA rank 16, alpha 32, dropout 0.05, on all 7 linear modules of the LM (q/k/v/o/gate/up/down)
132
- - **Trainable params**: 17.4 M (0.81% of base)
133
- - **Data mix (12k examples)**:
134
- - OS-Atlas-Data desktop_domain (macOS): 4k
135
- - OS-Atlas-Data mobile_domain (aw_mobile, Android): 4k
136
- - OS-Atlas-Data mobile_domain (UIBert): 4k
137
- - **Hyperparams**: bf16, LR 1e-4, cosine schedule, batch 1 × grad-accum 8 (effective batch 8), 1 epoch, gradient checkpointing on
138
- - **Hardware**: 1× L40S 48 GB (RunPod Secure Cloud)
139
- - **Compute cost**: ~$2 training + ~$0.50 eval
140
- - **Wall time**: ~2 hr total
141
 
142
- Full training scripts (private repo, request access): [renezander030/imgparse-tier1](https://github.com/renezander030/imgparse-tier1).
143
 
144
- ## Output format
145
 
146
- ```json
147
- {"bbox_2d": [x1, y1, x2, y2]}
148
- ```
149
-
150
- — a single-line JSON object with pixel coordinates (top-left origin). No markdown fences, no commentary, no `<ref>` tokens. Verified 100% parseable on the eval set.
151
 
152
- ## Limitations & next
153
 
154
- - **Web and desktop accuracy** lag mobile (we trained primarily on macOS + mobile UI). v0.2 adds 8k+ web records and ~2× total data.
155
- - **Long-tail icon recognition** is weaker than text grounding.
156
- - **No mouse-action prediction** — this model only locates; doesn't decide click vs hover vs type. Pair with an action predictor for full computer-use loops.
157
- - **English-only training data**.
158
 
159
- ## Use cases (what's this drop-in for)
160
-
161
- - **Claude Computer Use / Claude Code** screen-grounding tool calls
162
- - **OpenAI Codex CLI** screen-grounding extension
163
- - **browser-use / Skyvern** click-targeting (Python adapter in the GitHub repo)
164
- - **Custom agent stacks** that need a $0/call grounding step instead of GPT-4V per screenshot
165
- - **Self-hosted compound-AI systems** with a routing layer (specialist model for grounding, general LLM for planning)
166
-
167
- ## Work with me
168
-
169
- This adapter is a public reference of the recipe I deliver to freelance clients: small, fast, structured-output local specialists that slot into compound-AI agent stacks and cut cloud-LLM bills without losing capability.
170
-
171
- If you need one of these, I can build it:
172
-
173
- - a **UI-grounding model trained on your own product's screenshots** — your dashboard, your app, your customer interfaces — for higher recall on the elements your agents actually click
174
- - a **hybrid agent architecture** that routes narrow tasks (grounding, OCR, classification, embedding, extraction) to local specialist models and reserves cloud frontier LLMs for the reasoning that actually needs them
175
- - an **on-prem agent deployment** — Apple Silicon (MLX), CUDA box, or your existing K8s — with no screenshots leaving your infrastructure
176
- - a **structured-output evaluation harness** that tells you when the local model is actually good enough to replace the cloud call in production
177
-
178
- Reach out: <https://renezander.com>
179
 
180
- ## Citation
181
-
182
- ```bibtex
183
- @misc{browserground-2026,
184
- title = {browserground: Qwen3-VL-2B LoRA for hybrid AI agent UI grounding},
185
- author = {Zander, René},
186
- year = {2026},
187
- url = {https://huggingface.co/renezander030/browserground}
188
- }
189
- ```
190
-
191
- ## License
192
-
193
- Apache 2.0, same as the base model `Qwen/Qwen3-VL-2B-Instruct`.
194
-
195
- ## Acknowledgements
196
-
197
- - `Qwen/Qwen3-VL-2B-Instruct` base
198
- - `OS-Copilot/OS-Atlas-Data` training data
199
- - `agentsea/wave-ui` (for the upcoming v0.2 web slice)
200
- - `OS-Copilot/ScreenSpot-v2` evaluation set
 
1
  ---
2
+ base_model: Qwen/Qwen3-VL-2B-Instruct
3
  library_name: peft
4
+ pipeline_tag: text-generation
5
  tags:
6
+ - base_model:adapter:Qwen/Qwen3-VL-2B-Instruct
7
+ - lora
8
+ - transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  ---
10
 
11
+ # Model Card for Model ID
12
+
13
+ <!-- Provide a quick summary of what the model is/does. -->
14
+
15
+
16
+
17
+ ## Model Details
18
+
19
+ ### Model Description
20
+
21
+ <!-- Provide a longer summary of what this model is. -->
22
+
23
+
24
+
25
+ - **Developed by:** [More Information Needed]
26
+ - **Funded by [optional]:** [More Information Needed]
27
+ - **Shared by [optional]:** [More Information Needed]
28
+ - **Model type:** [More Information Needed]
29
+ - **Language(s) (NLP):** [More Information Needed]
30
+ - **License:** [More Information Needed]
31
+ - **Finetuned from model [optional]:** [More Information Needed]
32
+
33
+ ### Model Sources [optional]
34
+
35
+ <!-- Provide the basic links for the model. -->
36
+
37
+ - **Repository:** [More Information Needed]
38
+ - **Paper [optional]:** [More Information Needed]
39
+ - **Demo [optional]:** [More Information Needed]
40
+
41
+ ## Uses
42
+
43
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
44
+
45
+ ### Direct Use
46
+
47
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
48
+
49
+ [More Information Needed]
50
+
51
+ ### Downstream Use [optional]
52
+
53
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
54
+
55
+ [More Information Needed]
56
+
57
+ ### Out-of-Scope Use
58
+
59
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
60
+
61
+ [More Information Needed]
62
+
63
+ ## Bias, Risks, and Limitations
64
+
65
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
66
+
67
+ [More Information Needed]
68
+
69
+ ### Recommendations
70
+
71
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
72
+
73
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
74
+
75
+ ## How to Get Started with the Model
76
+
77
+ Use the code below to get started with the model.
78
+
79
+ [More Information Needed]
80
+
81
+ ## Training Details
82
+
83
+ ### Training Data
84
+
85
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
86
+
87
+ [More Information Needed]
88
+
89
+ ### Training Procedure
90
+
91
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
92
+
93
+ #### Preprocessing [optional]
94
+
95
+ [More Information Needed]
96
+
97
+
98
+ #### Training Hyperparameters
99
+
100
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
101
+
102
+ #### Speeds, Sizes, Times [optional]
103
+
104
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
105
+
106
+ [More Information Needed]
107
+
108
+ ## Evaluation
109
+
110
+ <!-- This section describes the evaluation protocols and provides the results. -->
111
+
112
+ ### Testing Data, Factors & Metrics
113
+
114
+ #### Testing Data
115
+
116
+ <!-- This should link to a Dataset Card if possible. -->
117
+
118
+ [More Information Needed]
119
+
120
+ #### Factors
121
+
122
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
123
+
124
+ [More Information Needed]
125
+
126
+ #### Metrics
127
+
128
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
129
+
130
+ [More Information Needed]
131
+
132
+ ### Results
133
+
134
+ [More Information Needed]
135
+
136
+ #### Summary
137
 
 
138
 
 
139
 
140
+ ## Model Examination [optional]
141
 
142
+ <!-- Relevant interpretability work for the model goes here -->
143
 
144
+ [More Information Needed]
145
 
146
+ ## Environmental Impact
147
 
148
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
149
 
150
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
151
 
152
+ - **Hardware Type:** [More Information Needed]
153
+ - **Hours used:** [More Information Needed]
154
+ - **Cloud Provider:** [More Information Needed]
155
+ - **Compute Region:** [More Information Needed]
156
+ - **Carbon Emitted:** [More Information Needed]
 
 
157
 
158
+ ## Technical Specifications [optional]
159
 
160
+ ### Model Architecture and Objective
161
 
162
+ [More Information Needed]
 
 
163
 
164
+ ### Compute Infrastructure
165
 
166
+ [More Information Needed]
167
 
168
+ #### Hardware
169
 
170
+ [More Information Needed]
 
 
 
 
 
 
 
 
171
 
172
+ #### Software
 
 
173
 
174
+ [More Information Needed]
175
 
176
+ ## Citation [optional]
177
 
178
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
 
 
 
179
 
180
+ **BibTeX:**
181
 
182
+ [More Information Needed]
183
 
184
+ **APA:**
 
 
 
 
185
 
186
+ [More Information Needed]
 
 
 
 
 
187
 
188
+ ## Glossary [optional]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
189
 
190
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
191
 
192
+ [More Information Needed]
 
 
 
 
 
 
 
 
 
 
193
 
194
+ ## More Information [optional]
195
 
196
+ [More Information Needed]
197
 
198
+ ## Model Card Authors [optional]
 
 
 
 
199
 
200
+ [More Information Needed]
201
 
202
+ ## Model Card Contact
 
 
 
203
 
204
+ [More Information Needed]
205
+ ### Framework versions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
206
 
207
+ - PEFT 0.19.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
adapter_config.json CHANGED
@@ -16,7 +16,7 @@
16
  "layers_pattern": null,
17
  "layers_to_transform": null,
18
  "loftq_config": {},
19
- "lora_alpha": 32,
20
  "lora_bias": false,
21
  "lora_dropout": 0.05,
22
  "lora_ga_config": null,
@@ -26,17 +26,17 @@
26
  "peft_type": "LORA",
27
  "peft_version": "0.19.1",
28
  "qalora_group_size": 16,
29
- "r": 16,
30
  "rank_pattern": {},
31
  "revision": null,
32
  "target_modules": [
33
- "gate_proj",
34
- "q_proj",
35
- "up_proj",
36
  "k_proj",
37
- "o_proj",
38
  "v_proj",
39
- "down_proj"
 
 
 
40
  ],
41
  "target_parameters": null,
42
  "task_type": "CAUSAL_LM",
 
16
  "layers_pattern": null,
17
  "layers_to_transform": null,
18
  "loftq_config": {},
19
+ "lora_alpha": 64,
20
  "lora_bias": false,
21
  "lora_dropout": 0.05,
22
  "lora_ga_config": null,
 
26
  "peft_type": "LORA",
27
  "peft_version": "0.19.1",
28
  "qalora_group_size": 16,
29
+ "r": 32,
30
  "rank_pattern": {},
31
  "revision": null,
32
  "target_modules": [
 
 
 
33
  "k_proj",
34
+ "up_proj",
35
  "v_proj",
36
+ "o_proj",
37
+ "gate_proj",
38
+ "down_proj",
39
+ "q_proj"
40
  ],
41
  "target_parameters": null,
42
  "task_type": "CAUSAL_LM",
adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f9ea6853b82a088ff41e5a2fbbd7885982c3a6b4be1beececef1487209b34f7d
3
- size 69788264
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3e1b9394d00a106cff556f782dc94f4e04f5a14cc9c97c5caa9043779aae6d6d
3
+ size 139518856
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e7ab4cc91c472c42681f1d6ed40e046d602559c7da5638f817629965705e2827
3
  size 5841
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:776307d160109817b6b71575c48bf7ca50decf82d435feb04f35cc3037c58c46
3
  size 5841