renezander030 commited on
Commit
f41a1d9
·
verified ·
1 Parent(s): c38817a

hybrid AI framing: logo, diagram, SEO tags, use-cases section

Browse files
Files changed (1) hide show
  1. README.md +58 -55
README.md CHANGED
@@ -5,10 +5,18 @@ tags:
5
  - ui-grounding
6
  - screen-grounding
7
  - browser-agent
 
 
 
 
 
8
  - lora
 
9
  - mlx
10
  - apple-silicon
11
  - qwen3-vl
 
 
12
  base_model: Qwen/Qwen3-VL-2B-Instruct
13
  pipeline_tag: image-text-to-text
14
  language:
@@ -18,11 +26,31 @@ datasets:
18
  - agentsea/wave-ui
19
  ---
20
 
21
- # browserground — Qwen3-VL-2B LoRA for UI grounding (v0.1)
 
 
22
 
23
- > **Drop-in local grounding for AI agents.**
24
- >
25
- > Qwen3-VL-2B LoRA. Strict-JSON output. Sub-second inference on Apple Silicon. One-command install. Hooks into browser-use, Claude Code, and Codex CLI out of the box.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
  ## What it does
28
 
@@ -34,9 +62,9 @@ Given a screenshot and a target description (`"submit form button"`, `"the red S
34
 
35
  — the pixel coordinates of the element to click. **100% format compliance** on the held-out evaluation. Drop it into any browser-agent / screen-automation pipeline that needs to ground language → click target.
36
 
37
- ## Results on ScreenSpot-v2 (point-grounding accuracy)
38
 
39
- Evaluated on 300 held-out items (100 per split: mobile / desktop / web). A hit = predicted bbox center falls inside the ground-truth bbox.
40
 
41
  | Model | Params | Overall | Mobile | Desktop | Web | Format-OK |
42
  |---|---:|---:|---:|---:|---:|---:|
@@ -48,38 +76,23 @@ Evaluated on 300 held-out items (100 per split: mobile / desktop / web). A hit =
48
  | **browserground v0.1 (this model)** | **2B** | **45.3%** | **64.0%** | **28.0%** | **44.0%** | **100%** |
49
  | Qwen3-VL-2B-Instruct (zero-shot baseline) | 2B | 6.3% | 7.0% | 6.0% | 6.0% | 100% |
50
 
51
- Notes:
52
- - Numbers for SeeClick / ShowUI / UI-TARS / OS-Atlas are from the OS-Atlas paper's reported ScreenSpot-v2 leaderboard.
53
- - **v0.1 is a one-epoch / 12k-example fine-tune** intended to validate the recipe. A v0.2 with ~50k mixed examples (including a wider web slice) is in progress with target ≥ 60%.
54
 
55
- ### What this version beats / where it sits
56
-
57
- - **Beats** GPT-4o (18.3%) and zero-shot Qwen3-VL (6.3%) by 2.5× and 7× respectively on the same benchmark.
58
- - **Sits below** SeeClick (55.1%), ShowUI-2B (75.5%), and UI-TARS-2B-SFT (89.5%) at this v0.1.
59
- - **Strong on mobile** (64.0%) — competitive with much larger fine-tunes on that split, reflecting the mobile-heavy training mix.
60
- - **Distinctive properties**: only public Qwen3-VL-2B grounding fine-tune known; documented MLX/Apple-Silicon deployment path; emits strict bbox JSON with no markdown fences (verified 100% compliance).
61
 
62
  ## Quick start
63
 
64
- ### Install (one line)
65
-
66
  ```bash
67
  npm install -g browserground
68
- # or: bun install -g browserground
 
69
  ```
70
 
71
- ### Use from the CLI
72
 
73
- ```bash
74
- browserground parse path/to/screenshot.png --target "Submit button"
75
- ```
76
-
77
- Returns:
78
- ```json
79
- {"bbox_2d": [344, 612, 478, 658]}
80
- ```
81
-
82
- ### Use from Python
83
 
84
  ```python
85
  from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
@@ -92,8 +105,7 @@ model = Qwen3VLForConditionalGeneration.from_pretrained(
92
  "Qwen/Qwen3-VL-2B-Instruct", dtype=torch.bfloat16, device_map="auto"
93
  )
94
  model = PeftModel.from_pretrained(model, "renezander030/browserground")
95
- model = model.merge_and_unload()
96
- model.eval()
97
 
98
  img = Image.open("screenshot.png").convert("RGB")
99
  messages = [
@@ -112,26 +124,12 @@ out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
112
  print(processor.tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))
113
  ```
114
 
115
- ### Use from Claude Code
116
-
117
- ```
118
- /install-plugin renezander030/browserground
119
- ```
120
-
121
- Then in any Claude Code conversation, mention a screenshot + a target. The plugin routes to the local CLI.
122
-
123
- ### Use from Codex CLI
124
-
125
- ```bash
126
- codex add-extension renezander030/browserground
127
- ```
128
-
129
  ## Training recipe
130
 
131
  - **Base**: `Qwen/Qwen3-VL-2B-Instruct`
132
  - **Method**: LoRA rank 16, alpha 32, dropout 0.05, on all 7 linear modules of the LM (q/k/v/o/gate/up/down)
133
  - **Trainable params**: 17.4 M (0.81% of base)
134
- - **Data mix (12k examples for v0.1)**:
135
  - OS-Atlas-Data desktop_domain (macOS): 4k
136
  - OS-Atlas-Data mobile_domain (aw_mobile, Android): 4k
137
  - OS-Atlas-Data mobile_domain (UIBert): 4k
@@ -140,12 +138,10 @@ codex add-extension renezander030/browserground
140
  - **Compute cost**: ~$2 training + ~$0.50 eval
141
  - **Wall time**: ~2 hr total
142
 
143
- Full training scripts in the [GitHub repo](https://github.com/renezander030/browserground).
144
 
145
  ## Output format
146
 
147
- The model is trained to emit exactly:
148
-
149
  ```json
150
  {"bbox_2d": [x1, y1, x2, y2]}
151
  ```
@@ -154,17 +150,24 @@ The model is trained to emit exactly:
154
 
155
  ## Limitations & next
156
 
157
- - **Web and desktop accuracy** lag mobile (we trained primarily on macOS + mobile UI). Tier 2 (in progress) adds 8k+ web records and ~2× total data.
158
- - **Long-tail icon recognition** is weaker than text grounding (text is OCR-friendly; icons need more visual exposure).
159
- - **No mouse-action prediction** — this model only locates; it doesn't decide click vs hover vs type. Pair with an action predictor for full computer-use loops.
160
- - **English-only training data** — non-English UI may underperform.
161
- - **Single-target per call** — for batch grounding of many elements on one screenshot, see `browserground parse --batch` (v0.2).
 
 
 
 
 
 
 
162
 
163
  ## Citation
164
 
165
  ```bibtex
166
  @misc{browserground-2026,
167
- title = {browserground: Qwen3-VL-2B LoRA for UI grounding},
168
  author = {Zander, René},
169
  year = {2026},
170
  url = {https://huggingface.co/renezander030/browserground}
 
5
  - ui-grounding
6
  - screen-grounding
7
  - browser-agent
8
+ - claude-computer-use
9
+ - codex
10
+ - hybrid-ai
11
+ - compound-ai
12
+ - specialist-model
13
  - lora
14
+ - peft
15
  - mlx
16
  - apple-silicon
17
  - qwen3-vl
18
+ - gpt-4v-alternative
19
+ - cost-effective-ai
20
  base_model: Qwen/Qwen3-VL-2B-Instruct
21
  pipeline_tag: image-text-to-text
22
  language:
 
26
  - agentsea/wave-ui
27
  ---
28
 
29
+ <p align="center">
30
+ <img src="https://raw.githubusercontent.com/renezander030/browserground/main/assets/logo.svg" alt="browserground logo" width="120" height="120"/>
31
+ </p>
32
 
33
+ # browserground Qwen3-VL-2B LoRA for hybrid AI agents (v0.1)
34
+
35
+ > **The local UI-grounding specialist for hybrid AI agents.** Drop in a screenshot + text target, get a strict JSON bbox. 2B params. MLX-native. Apache 2.0.
36
+
37
+ ## Why this exists — the hybrid AI argument
38
+
39
+ Today, most AI agents route **every** screenshot to a cloud frontier model (GPT-4V, Claude Vision, Gemini) just to find click coordinates. That's a $0.01–0.05 multimodal call adding 800ms–2s of latency, repeated 20–50× per agent run. Cost and latency compound. Screenshots full of private UI leave your machine.
40
+
41
+ A general 200B-parameter LLM is overkill for "where is the Submit button?" — that's a narrow vision task. The right shape is a **hybrid one**: cheap fast specialist local models for the dedicated tasks they handle better, and the cloud LLM only for the planning and reasoning it's uniquely good at.
42
+
43
+ That's exactly what browserground is — the click-grounding specialist.
44
+
45
+ ![hybrid architecture](https://raw.githubusercontent.com/renezander030/browserground/main/assets/hybrid-architecture.svg)
46
+
47
+ | | Pure-cloud (status quo) | Hybrid (+ browserground) |
48
+ |---|---|---|
49
+ | Per-screenshot cost | $0.01–0.05 | **$0** |
50
+ | Latency | 800ms–2s round-trip | **~1.8s local** |
51
+ | Tokens billed by cloud | 1500+ multimodal | **~40 text** |
52
+ | Screenshots leave machine | yes | **no** |
53
+ | Rate limits | yes | **no** |
54
 
55
  ## What it does
56
 
 
62
 
63
  — the pixel coordinates of the element to click. **100% format compliance** on the held-out evaluation. Drop it into any browser-agent / screen-automation pipeline that needs to ground language → click target.
64
 
65
+ ## Results on ScreenSpot-v2
66
 
67
+ Point-grounding accuracy, 300 held-out items (100 per split: mobile / desktop / web). A hit = predicted bbox center falls inside the ground-truth bbox.
68
 
69
  | Model | Params | Overall | Mobile | Desktop | Web | Format-OK |
70
  |---|---:|---:|---:|---:|---:|---:|
 
76
  | **browserground v0.1 (this model)** | **2B** | **45.3%** | **64.0%** | **28.0%** | **44.0%** | **100%** |
77
  | Qwen3-VL-2B-Instruct (zero-shot baseline) | 2B | 6.3% | 7.0% | 6.0% | 6.0% | 100% |
78
 
79
+ - Beats **GPT-4o by 2.5×** and zero-shot Qwen3-VL by **7×** on the same benchmark
80
+ - **100% strict-JSON format compliance** no markdown fences, no commentary
81
+ - Sits below ShowUI/UI-TARS at this v0.1; v0.2 (Tier 2, target ≥ 60%) on the roadmap
82
 
83
+ Numbers for SeeClick / ShowUI / UI-TARS / OS-Atlas are from the OS-Atlas paper's reported ScreenSpot-v2 leaderboard.
 
 
 
 
 
84
 
85
  ## Quick start
86
 
 
 
87
  ```bash
88
  npm install -g browserground
89
+ browserground parse screenshot.png --target "Submit button"
90
+ # {"bbox_2d": [344, 612, 478, 658]}
91
  ```
92
 
93
+ Full install + agent-stack integration: [github.com/renezander030/browserground](https://github.com/renezander030/browserground).
94
 
95
+ ## Use from Python directly
 
 
 
 
 
 
 
 
 
96
 
97
  ```python
98
  from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
 
105
  "Qwen/Qwen3-VL-2B-Instruct", dtype=torch.bfloat16, device_map="auto"
106
  )
107
  model = PeftModel.from_pretrained(model, "renezander030/browserground")
108
+ model = model.merge_and_unload(); model.eval()
 
109
 
110
  img = Image.open("screenshot.png").convert("RGB")
111
  messages = [
 
124
  print(processor.tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))
125
  ```
126
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
  ## Training recipe
128
 
129
  - **Base**: `Qwen/Qwen3-VL-2B-Instruct`
130
  - **Method**: LoRA rank 16, alpha 32, dropout 0.05, on all 7 linear modules of the LM (q/k/v/o/gate/up/down)
131
  - **Trainable params**: 17.4 M (0.81% of base)
132
+ - **Data mix (12k examples)**:
133
  - OS-Atlas-Data desktop_domain (macOS): 4k
134
  - OS-Atlas-Data mobile_domain (aw_mobile, Android): 4k
135
  - OS-Atlas-Data mobile_domain (UIBert): 4k
 
138
  - **Compute cost**: ~$2 training + ~$0.50 eval
139
  - **Wall time**: ~2 hr total
140
 
141
+ Full training scripts (private repo, request access): [renezander030/imgparse-tier1](https://github.com/renezander030/imgparse-tier1).
142
 
143
  ## Output format
144
 
 
 
145
  ```json
146
  {"bbox_2d": [x1, y1, x2, y2]}
147
  ```
 
150
 
151
  ## Limitations & next
152
 
153
+ - **Web and desktop accuracy** lag mobile (we trained primarily on macOS + mobile UI). v0.2 adds 8k+ web records and ~2× total data.
154
+ - **Long-tail icon recognition** is weaker than text grounding.
155
+ - **No mouse-action prediction** — this model only locates; doesn't decide click vs hover vs type. Pair with an action predictor for full computer-use loops.
156
+ - **English-only training data**.
157
+
158
+ ## Use cases (what's this drop-in for)
159
+
160
+ - **Claude Computer Use / Claude Code** screen-grounding tool calls
161
+ - **OpenAI Codex CLI** screen-grounding extension
162
+ - **browser-use / Skyvern** click-targeting (Python adapter in the GitHub repo)
163
+ - **Custom agent stacks** that need a $0/call grounding step instead of GPT-4V per screenshot
164
+ - **Self-hosted compound-AI systems** with a routing layer (specialist model for grounding, general LLM for planning)
165
 
166
  ## Citation
167
 
168
  ```bibtex
169
  @misc{browserground-2026,
170
+ title = {browserground: Qwen3-VL-2B LoRA for hybrid AI agent UI grounding},
171
  author = {Zander, René},
172
  year = {2026},
173
  url = {https://huggingface.co/renezander030/browserground}