renezander030 commited on
Commit
95a3a3d
·
verified ·
1 Parent(s): ad37a63

v0.3 — packaging: MLX, GGUF, Ollama, PyPI, browser-use/Skyvern adapters

Browse files
Files changed (1) hide show
  1. README.md +199 -144
README.md CHANGED
@@ -1,207 +1,262 @@
1
  ---
2
- base_model: Qwen/Qwen3-VL-2B-Instruct
3
  library_name: peft
4
- pipeline_tag: text-generation
5
  tags:
6
- - base_model:adapter:Qwen/Qwen3-VL-2B-Instruct
7
- - lora
8
- - transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  ---
10
 
11
- # Model Card for Model ID
12
-
13
- <!-- Provide a quick summary of what the model is/does. -->
14
-
15
-
16
-
17
- ## Model Details
18
-
19
- ### Model Description
20
-
21
- <!-- Provide a longer summary of what this model is. -->
22
-
23
-
24
-
25
- - **Developed by:** [More Information Needed]
26
- - **Funded by [optional]:** [More Information Needed]
27
- - **Shared by [optional]:** [More Information Needed]
28
- - **Model type:** [More Information Needed]
29
- - **Language(s) (NLP):** [More Information Needed]
30
- - **License:** [More Information Needed]
31
- - **Finetuned from model [optional]:** [More Information Needed]
32
-
33
- ### Model Sources [optional]
34
-
35
- <!-- Provide the basic links for the model. -->
36
-
37
- - **Repository:** [More Information Needed]
38
- - **Paper [optional]:** [More Information Needed]
39
- - **Demo [optional]:** [More Information Needed]
40
-
41
- ## Uses
42
-
43
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
44
-
45
- ### Direct Use
46
-
47
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
48
-
49
- [More Information Needed]
50
-
51
- ### Downstream Use [optional]
52
-
53
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
54
-
55
- [More Information Needed]
56
-
57
- ### Out-of-Scope Use
58
-
59
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
60
-
61
- [More Information Needed]
62
-
63
- ## Bias, Risks, and Limitations
64
-
65
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
66
-
67
- [More Information Needed]
68
-
69
- ### Recommendations
70
-
71
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
72
-
73
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
74
-
75
- ## How to Get Started with the Model
76
-
77
- Use the code below to get started with the model.
78
-
79
- [More Information Needed]
80
-
81
- ## Training Details
82
-
83
- ### Training Data
84
-
85
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
86
-
87
- [More Information Needed]
88
-
89
- ### Training Procedure
90
 
91
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
92
 
93
- #### Preprocessing [optional]
94
 
95
- [More Information Needed]
96
 
 
 
 
 
 
97
 
98
- #### Training Hyperparameters
99
 
100
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
101
 
102
- #### Speeds, Sizes, Times [optional]
103
 
104
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
105
 
106
- [More Information Needed]
107
 
108
- ## Evaluation
 
 
 
 
 
 
109
 
110
- <!-- This section describes the evaluation protocols and provides the results. -->
111
 
112
- ### Testing Data, Factors & Metrics
113
 
114
- #### Testing Data
 
 
115
 
116
- <!-- This should link to a Dataset Card if possible. -->
117
 
118
- [More Information Needed]
119
 
120
- #### Factors
121
 
122
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 
 
 
 
 
 
 
 
123
 
124
- [More Information Needed]
125
 
126
- #### Metrics
 
 
127
 
128
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
129
 
130
- [More Information Needed]
131
 
132
- ### Results
 
 
 
 
 
 
 
133
 
134
- [More Information Needed]
135
 
136
- #### Summary
137
 
 
138
 
 
139
 
140
- ## Model Examination [optional]
 
 
 
 
141
 
142
- <!-- Relevant interpretability work for the model goes here -->
143
 
144
- [More Information Needed]
145
 
146
- ## Environmental Impact
 
 
 
147
 
148
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
 
149
 
150
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
 
151
 
152
- - **Hardware Type:** [More Information Needed]
153
- - **Hours used:** [More Information Needed]
154
- - **Cloud Provider:** [More Information Needed]
155
- - **Compute Region:** [More Information Needed]
156
- - **Carbon Emitted:** [More Information Needed]
157
 
158
- ## Technical Specifications [optional]
159
 
160
- ### Model Architecture and Objective
 
 
 
161
 
162
- [More Information Needed]
163
 
164
- ### Compute Infrastructure
 
 
 
 
165
 
166
- [More Information Needed]
 
 
 
 
 
167
 
168
- #### Hardware
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
169
 
170
- [More Information Needed]
171
 
172
- #### Software
173
 
174
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
175
 
176
- ## Citation [optional]
177
 
178
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
179
 
180
- **BibTeX:**
 
 
181
 
182
- [More Information Needed]
183
 
184
- **APA:**
185
 
186
- [More Information Needed]
 
 
187
 
188
- ## Glossary [optional]
189
 
190
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
 
 
 
 
 
191
 
192
- [More Information Needed]
193
 
194
- ## More Information [optional]
 
 
 
195
 
196
- [More Information Needed]
197
 
198
- ## Model Card Authors [optional]
199
 
200
- [More Information Needed]
201
 
202
- ## Model Card Contact
 
 
 
 
 
203
 
204
- [More Information Needed]
205
- ### Framework versions
 
 
 
 
 
 
 
 
 
 
 
 
206
 
207
- - PEFT 0.19.1
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
  library_name: peft
 
4
  tags:
5
+ - ui-grounding
6
+ - screen-grounding
7
+ - browser-agent
8
+ - claude-computer-use
9
+ - codex
10
+ - browser-use
11
+ - skyvern
12
+ - hybrid-ai
13
+ - compound-ai
14
+ - specialist-model
15
+ - lora
16
+ - peft
17
+ - mlx
18
+ - gguf
19
+ - ollama
20
+ - apple-silicon
21
+ - qwen3-vl
22
+ - gpt-4v-alternative
23
+ - cost-effective-ai
24
+ base_model: Qwen/Qwen3-VL-2B-Instruct
25
+ pipeline_tag: image-text-to-text
26
+ language:
27
+ - en
28
+ datasets:
29
+ - OS-Copilot/OS-Atlas-Data
30
+ - agentsea/wave-ui
31
  ---
32
 
33
+ <p align="center">
34
+ <img src="https://raw.githubusercontent.com/renezander030/browserground/main/assets/logo.svg" alt="browserground logo" width="120" height="120"/>
35
+ </p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
+ # browserground Qwen3-VL-2B LoRA for hybrid AI agents (v0.3)
38
 
39
+ > **The local UI-grounding specialist for hybrid AI agents.** Drop in a screenshot + text target, get a strict JSON bbox. 2B params. MLX-native. Apache 2.0.
40
 
41
+ Three packaged builds, one install for every stack:
42
 
43
+ | Build | Use it for | Install |
44
+ |---|---|---|
45
+ | **MLX 4-bit** ([renezander030/browserground-mlx](https://huggingface.co/renezander030/browserground-mlx)) | Apple Silicon, fastest | `npm install -g browserground` (auto) or `pip install "browserground[mlx]"` |
46
+ | **GGUF Q4_K_M + f16 mmproj** ([renezander030/browserground-gguf](https://huggingface.co/renezander030/browserground-gguf)) | Ollama, llama.cpp | `ollama run renezander030/browserground` |
47
+ | **PEFT LoRA** (this repo) | `transformers`, training, fine-tuning | `pip install "browserground[transformers]"` |
48
 
49
+ ## Why this exists — the hybrid AI argument
50
 
51
+ Today, most AI agents route **every** screenshot to a cloud frontier model (GPT-4V, Claude Vision, Gemini) just to find click coordinates. That's a $0.01–0.05 multimodal call adding 800ms–2s of latency, repeated 20–50× per agent run. Cost and latency compound. Screenshots full of private UI leave your machine.
52
 
53
+ A general 200B-parameter LLM is overkill for "where is the Submit button?" — that's a narrow vision task. The right shape is a **hybrid one**: cheap fast specialist local models for the dedicated tasks they handle better, and the cloud LLM only for the planning and reasoning it's uniquely good at.
54
 
55
+ That's exactly what browserground is the click-grounding specialist.
56
 
57
+ ![hybrid architecture](https://raw.githubusercontent.com/renezander030/browserground/main/assets/hybrid-architecture.svg)
58
 
59
+ | | Pure-cloud (status quo) | Hybrid (+ browserground) |
60
+ |---|---|---|
61
+ | Per-screenshot cost | $0.01–0.05 | **$0** |
62
+ | Latency | 800ms–2s round-trip | **~1.5s MLX / ~1.8s transformers** |
63
+ | Tokens billed by cloud | 1500+ multimodal | **~40 text** |
64
+ | Screenshots leave machine | yes | **no** |
65
+ | Rate limits | yes | **no** |
66
 
67
+ ## What it does
68
 
69
+ Given a screenshot and a target description (`"submit form button"`, `"the red Sign Up link"`, `"the second profile picture from the left"`), this LoRA-fine-tuned Qwen3-VL-2B emits a strict JSON object:
70
 
71
+ ```json
72
+ {"bbox_2d": [x1, y1, x2, y2]}
73
+ ```
74
 
75
+ the pixel coordinates of the element to click. **100% format compliance** on the held-out evaluation. Drop it into any browser-agent / screen-automation pipeline that needs to ground language → click target.
76
 
77
+ ## Results on ScreenSpot-v2
78
 
79
+ Point-grounding accuracy, 300 held-out items (100 per split: mobile / desktop / web). A hit = predicted bbox center falls inside the ground-truth bbox.
80
 
81
+ | Model | Params | Overall | Mobile | Desktop | Web | Format-OK |
82
+ |---|---:|---:|---:|---:|---:|---:|
83
+ | GPT-5.4 (cloud frontier) ¹ | — | 85.4% | — | — | — | — |
84
+ | SeeClick (Qwen-VL-Chat) | 9.6B | 55.1% | — | — | — | — |
85
+ | ShowUI-2B | 2B | 75.5% | — | — | — | — |
86
+ | UI-TARS-2B-SFT (ByteDance) | 2B | 89.5% | — | — | — | — |
87
+ | OS-Atlas-Base-7B | 7B | ~91% | — | — | — | — |
88
+ | **browserground v0.3** | **2B** | **60.0%** | **78.0%** | **44.0%** | **58.0%** | **100%** |
89
+ | Qwen3-VL-2B-Instruct (zero-shot baseline) | 2B | 6.3% | 7.0% | 6.0% | 6.0% | 100% |
90
 
91
+ ¹ GPT-5.4 score is on the harder **ScreenSpot-Pro** benchmark — no public ScreenSpot-v2 number for the 2026 cloud generation. Open-source numbers in the table use v2 throughout.
92
 
93
+ - **+10× over zero-shot baseline** on the same benchmark (6.3% → 60.0%)
94
+ - **Beats SeeClick (9.6B) at 4.8× smaller** — 2B params, +5 pp accuracy
95
+ - **100% strict-JSON format compliance** — no markdown fences, no `<ref>` tokens, parseable every time
96
 
97
+ ### Where browserground beats UI-TARS-2B-SFT
98
 
99
+ UI-TARS-2B-SFT scores higher on overall accuracy (89.5%). That's a different product. Here's where this model is the better fit:
100
 
101
+ | | browserground v0.3 | UI-TARS-2B-SFT |
102
+ |---|---|---|
103
+ | Base model | Qwen3-VL-2B (2025) | Qwen2-VL-2B (2024) |
104
+ | Output format | **Strict `{"bbox_2d": [...]}` — 100% parseable** | Coord strings inside prose — needs regex/parsing |
105
+ | Training mix | Browser + macOS + Android (web-weighted for actual agent workloads) | OS-general; no browser-platform emphasis |
106
+ | Distribution | **CLI + Python + Ollama + MLX**; one install per stack | Server-class; no first-class Mac story |
107
+ | Design intent | A piece of a hybrid AI stack (one specialist among many) | Standalone agent toolkit |
108
+ | License + base lineage | Apache 2.0 on current-gen base | Apache 2.0 on a year-old base |
109
 
110
+ Pick UI-TARS when you want a complete agent toolkit and don't mind the heavier ecosystem. Pick browserground when you're composing your own hybrid AI stack and need a small, fast, strict-JSON grounding specialist that drops into a CLI / npm / pip / Ollama workflow on a laptop.
111
 
112
+ Numbers for SeeClick / ShowUI / UI-TARS / OS-Atlas are from the OS-Atlas paper's reported ScreenSpot-v2 leaderboard. GPT-5.4 reference is from the BenchLM ScreenSpot-Pro leaderboard (April 2026).
113
 
114
+ ## Quick start
115
 
116
+ ### npm CLI
117
 
118
+ ```bash
119
+ npm install -g browserground
120
+ browserground parse screenshot.png --target "Submit button"
121
+ # {"bbox_2d": [344, 612, 478, 658]}
122
+ ```
123
 
124
+ Daemon, HTTP server, batch, confidence, eval — all in the CLI. See the [GitHub README](https://github.com/renezander030/browserground) for details.
125
 
126
+ ### Python
127
 
128
+ ```bash
129
+ pip install "browserground[mlx]" # Apple Silicon (recommended)
130
+ pip install "browserground[transformers]" # everywhere else
131
+ ```
132
 
133
+ ```python
134
+ from browserground import ground, click_xy
135
 
136
+ res = ground("screenshot.png", "the green Subscribe button")
137
+ print(res["bbox_2d"])
138
 
139
+ x, y = click_xy("screenshot.png", "the back arrow")
140
+ ```
 
 
 
141
 
142
+ ### Ollama
143
 
144
+ ```bash
145
+ ollama pull renezander030/browserground
146
+ ollama run renezander030/browserground "Locate: Submit button" /path/to/screen.png
147
+ ```
148
 
149
+ ### From this LoRA directly (transformers)
150
 
151
+ ```python
152
+ from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
153
+ from peft import PeftModel
154
+ import torch
155
+ from PIL import Image
156
 
157
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")
158
+ model = Qwen3VLForConditionalGeneration.from_pretrained(
159
+ "Qwen/Qwen3-VL-2B-Instruct", dtype=torch.bfloat16, device_map="auto"
160
+ )
161
+ model = PeftModel.from_pretrained(model, "renezander030/browserground")
162
+ model = model.merge_and_unload(); model.eval()
163
 
164
+ img = Image.open("screenshot.png").convert("RGB")
165
+ messages = [
166
+ {"role": "system", "content": [{"type": "text", "text":
167
+ 'You are a UI-grounding model. Given a screenshot and a target description, '
168
+ 'output the bounding box of the SINGLE UI element to click. Output ONLY a JSON '
169
+ 'object: {"bbox_2d": [x1, y1, x2, y2]} with pixel coordinates, origin at top-left.'}]},
170
+ {"role": "user", "content": [
171
+ {"type": "image", "image": img},
172
+ {"type": "text", "text": "Locate the element described: Submit button"},
173
+ ]},
174
+ ]
175
+ prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
176
+ inputs = processor(text=[prompt], images=[[img]], return_tensors="pt").to(model.device)
177
+ out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
178
+ print(processor.tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))
179
+ ```
180
 
181
+ ## Training recipe (v0.2 → v0.3)
182
 
183
+ v0.3 is the same underlying LoRA as v0.2 — what shipped in v0.3 is **packaging**: MLX 4-bit, GGUF, Ollama, PyPI, browser-use + Skyvern adapters, batch / confidence / HTTP daemon / eval CLI surfaces. Numbers below are the v0.2 training run.
184
 
185
+ - **Base**: `Qwen/Qwen3-VL-2B-Instruct`
186
+ - **Method**: LoRA rank 32, alpha 64, dropout 0.05, on all 7 linear modules of the LM (q/k/v/o/gate/up/down)
187
+ - **Trainable params**: 34.9 M (1.6% of base)
188
+ - **Data mix (26k examples)**:
189
+ - OS-Atlas-Data desktop_domain (macOS): 6k
190
+ - OS-Atlas-Data mobile_domain (aw_mobile, Android): 6k
191
+ - OS-Atlas-Data mobile_domain (UIBert): 6k
192
+ - agentsea/wave-ui (web-platform-filtered): 8k
193
+ - **Hyperparams**: bf16, LR 1e-4, cosine schedule, batch 1 × grad-accum 8 (effective batch 8), 1 epoch, gradient checkpointing on
194
+ - **Hardware**: 1× RTX A6000 48 GB (RunPod Secure Cloud)
195
+ - **Wall time**: ~4.5 hr training + ~5 min eval
196
 
197
+ Full training scripts (private repo, request access): [renezander030/imgparse-tier1](https://github.com/renezander030/imgparse-tier1).
198
 
199
+ ## Output format
200
 
201
+ ```json
202
+ {"bbox_2d": [x1, y1, x2, y2]}
203
+ ```
204
 
205
+ a single-line JSON object with pixel coordinates (top-left origin). No markdown fences, no commentary, no `<ref>` tokens. Verified 100% parseable on the eval set.
206
 
207
+ With `--confidence`, output extends to:
208
 
209
+ ```json
210
+ {"bbox_2d": [x1, y1, x2, y2], "confidence": 0.92, "alternatives": [{"bbox_2d": [...]}]}
211
+ ```
212
 
213
+ ## Use cases
214
 
215
+ - **Claude Computer Use / Claude Code** screen-grounding tool calls
216
+ - **OpenAI Codex CLI** screen-grounding extension
217
+ - **browser-use** click-targeting (drop-in adapter in [GitHub plugins/browser-use/](https://github.com/renezander030/browserground/tree/main/plugins/browser-use))
218
+ - **Skyvern** local-first grounding with cloud fallback (adapter in [GitHub plugins/skyvern/](https://github.com/renezander030/browserground/tree/main/plugins/skyvern))
219
+ - **Custom agent stacks** that need a $0/call grounding step instead of GPT-4V per screenshot
220
+ - **Self-hosted compound-AI systems** with a routing layer (specialist model for grounding, general LLM for planning)
221
 
222
+ ## Limitations & next
223
 
224
+ - **Web and desktop accuracy** lag mobile. v0.4 will add more web/desktop training data.
225
+ - **Icon UI accuracy (~41%) lags text UI (~74%)** — icons need more visual exposure in training; planned for v0.4.
226
+ - **No mouse-action prediction** — this model only locates; doesn't decide click vs hover vs type. Pair with an action predictor for full computer-use loops.
227
+ - **English-only training data**.
228
 
229
+ ## Work with me
230
 
231
+ This adapter is a public reference of the recipe I deliver to freelance clients: small, fast, structured-output local specialists that slot into compound-AI agent stacks and cut cloud-LLM bills without losing capability.
232
 
233
+ If you need one of these, I can build it:
234
 
235
+ - a **UI-grounding model trained on your own product's screenshots** — your dashboard, your app, your customer interfaces — for higher recall on the elements your agents actually click
236
+ - a **hybrid agent architecture** that routes narrow tasks (grounding, OCR, classification, embedding, extraction) to local specialist models and reserves cloud frontier LLMs for the reasoning that actually needs them
237
+ - an **on-prem agent deployment** — Apple Silicon (MLX), CUDA box, or your existing K8s — with no screenshots leaving your infrastructure
238
+ - a **structured-output evaluation harness** that tells you when the local model is actually good enough to replace the cloud call in production
239
+
240
+ Reach out: <https://renezander.com>
241
 
242
+ ## Citation
243
+
244
+ ```bibtex
245
+ @misc{browserground-2026,
246
+ title = {browserground: Qwen3-VL-2B LoRA for hybrid AI agent UI grounding},
247
+ author = {Zander, René},
248
+ year = {2026},
249
+ url = {https://huggingface.co/renezander030/browserground}
250
+ }
251
+ ```
252
+
253
+ ## License
254
+
255
+ Apache 2.0, same as the base model `Qwen/Qwen3-VL-2B-Instruct`.
256
 
257
+ ## Acknowledgements
258
+
259
+ - `Qwen/Qwen3-VL-2B-Instruct` base
260
+ - `OS-Copilot/OS-Atlas-Data` training data
261
+ - `agentsea/wave-ui` web slice
262
+ - `OS-Copilot/ScreenSpot-v2` evaluation set