renezander030 commited on
Commit
f67eade
·
verified ·
1 Parent(s): fcd91fb

v0.3 — TLDR-first card: UI-TARS/ShowUI honest comparison, JSON-diff in main table, router pattern, gap-to-89% recipe

Browse files
Files changed (1) hide show
  1. README.md +85 -54
README.md CHANGED
@@ -38,15 +38,48 @@ datasets:
38
 
39
  > **The local UI-grounding specialist for hybrid AI agents.** Drop in a screenshot + text target, get a strict JSON bbox. 2B params. MLX-native. Apache 2.0.
40
 
41
- Three packaged builds, one install for every stack:
42
 
43
- | Build | Use it for | Install |
44
- |---|---|---|
45
- | **MLX 4-bit** ([renezander030/browserground-mlx](https://huggingface.co/renezander030/browserground-mlx)) | Apple Silicon, fastest | `npm install -g browserground` (auto) or `pip install "browserground[mlx]"` |
46
- | **GGUF Q4_K_M + f16 mmproj** ([renezander030/browserground-gguf](https://huggingface.co/renezander030/browserground-gguf)) | Ollama, llama.cpp | `ollama run renezander030/browserground` |
47
- | **PEFT LoRA** (this repo) | `transformers`, training, fine-tuning | `pip install "browserground[transformers]"` |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
- ## Why this exists the hybrid AI argument
 
 
 
 
 
 
50
 
51
  Today, most AI agents route **every** screenshot to a cloud frontier model (GPT-4V, Claude Vision, Gemini) just to find click coordinates. That's a $0.01–0.05 multimodal call adding 800ms–2s of latency, repeated 20–50× per agent run. Cost and latency compound. Screenshots full of private UI leave your machine.
52
 
@@ -56,13 +89,20 @@ That's exactly what browserground is — the click-grounding specialist.
56
 
57
  ![hybrid architecture](https://raw.githubusercontent.com/renezander030/browserground/main/assets/hybrid-architecture.svg)
58
 
59
- | | Pure-cloud (status quo) | Hybrid (+ browserground) |
 
 
 
 
 
 
 
 
 
60
  |---|---|---|
61
- | Per-screenshot cost | $0.01–0.05 | **$0** |
62
- | Latency | 800ms–2s round-trip | **~1.5s MLX / ~1.8s transformers** |
63
- | Tokens billed by cloud | 1500+ multimodal | **~40 text** |
64
- | Screenshots leave machine | yes | **no** |
65
- | Rate limits | yes | **no** |
66
 
67
  ## What it does
68
 
@@ -74,7 +114,13 @@ Given a screenshot and a target description (`"submit form button"`, `"the red S
74
 
75
  — the pixel coordinates of the element to click. **100% format compliance** on the held-out evaluation. Drop it into any browser-agent / screen-automation pipeline that needs to ground language → click target.
76
 
77
- ## Results on ScreenSpot-v2
 
 
 
 
 
 
78
 
79
  Point-grounding accuracy, 300 held-out items (100 per split: mobile / desktop / web). A hit = predicted bbox center falls inside the ground-truth bbox.
80
 
@@ -94,23 +140,6 @@ Point-grounding accuracy, 300 held-out items (100 per split: mobile / desktop /
94
  - **Beats SeeClick (9.6B) at 4.8× smaller** — 2B params, +5 pp accuracy
95
  - **100% strict-JSON format compliance** — no markdown fences, no `<ref>` tokens, parseable every time
96
 
97
- ### Where browserground beats UI-TARS-2B-SFT
98
-
99
- UI-TARS-2B-SFT scores higher on overall accuracy (89.5%). That's a different product. Here's where this model is the better fit:
100
-
101
- | | browserground v0.3 | UI-TARS-2B-SFT |
102
- |---|---|---|
103
- | Base model | Qwen3-VL-2B (2025) | Qwen2-VL-2B (2024) |
104
- | Output format | **Strict `{"bbox_2d": [...]}` — 100% parseable** | Coord strings inside prose — needs regex/parsing |
105
- | Training mix | Browser + macOS + Android (web-weighted for actual agent workloads) | OS-general; no browser-platform emphasis |
106
- | Distribution | **CLI + Python + Ollama + MLX**; one install per stack | Server-class; no first-class Mac story |
107
- | Design intent | A piece of a hybrid AI stack (one specialist among many) | Standalone agent toolkit |
108
- | License + base lineage | Apache 2.0 on current-gen base | Apache 2.0 on a year-old base |
109
-
110
- Pick UI-TARS when you want a complete agent toolkit and don't mind the heavier ecosystem. Pick browserground when you're composing your own hybrid AI stack and need a small, fast, strict-JSON grounding specialist that drops into a CLI / npm / pip / Ollama workflow on a laptop.
111
-
112
- Numbers for SeeClick / ShowUI / UI-TARS / OS-Atlas are from the OS-Atlas paper's reported ScreenSpot-v2 leaderboard. GPT-5.4 reference is from the BenchLM ScreenSpot-Pro leaderboard (April 2026).
113
-
114
  ## Quick start
115
 
116
  ### npm CLI
@@ -121,7 +150,7 @@ browserground parse screenshot.png --target "Submit button"
121
  # {"bbox_2d": [344, 612, 478, 658]}
122
  ```
123
 
124
- Daemon, HTTP server, batch, confidence, eval — all in the CLI. See the [GitHub README](https://github.com/renezander030/browserground) for details.
125
 
126
  ### Python
127
 
@@ -178,9 +207,24 @@ out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
178
  print(processor.tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))
179
  ```
180
 
181
- ## Training recipe (v0.2 v0.3)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
 
183
- v0.3 is the same underlying LoRA as v0.2 — what shipped in v0.3 is **packaging**: MLX 4-bit, GGUF, Ollama, PyPI, browser-use + Skyvern adapters, batch / confidence / HTTP daemon / eval CLI surfaces. Numbers below are the v0.2 training run.
184
 
185
  - **Base**: `Qwen/Qwen3-VL-2B-Instruct`
186
  - **Method**: LoRA rank 32, alpha 64, dropout 0.05, on all 7 linear modules of the LM (q/k/v/o/gate/up/down)
@@ -196,35 +240,22 @@ v0.3 is the same underlying LoRA as v0.2 — what shipped in v0.3 is **packaging
196
 
197
  Full training scripts (private repo, request access): [renezander030/imgparse-tier1](https://github.com/renezander030/imgparse-tier1).
198
 
199
- ## Output format
200
-
201
- ```json
202
- {"bbox_2d": [x1, y1, x2, y2]}
203
- ```
204
-
205
- — a single-line JSON object with pixel coordinates (top-left origin). No markdown fences, no commentary, no `<ref>` tokens. Verified 100% parseable on the eval set.
206
-
207
- With `--confidence`, output extends to:
208
-
209
- ```json
210
- {"bbox_2d": [x1, y1, x2, y2], "confidence": 0.92, "alternatives": [{"bbox_2d": [...]}]}
211
- ```
212
-
213
  ## Use cases
214
 
215
  - **Claude Computer Use / Claude Code** screen-grounding tool calls
216
  - **OpenAI Codex CLI** screen-grounding extension
217
  - **browser-use** click-targeting (drop-in adapter in [GitHub plugins/browser-use/](https://github.com/renezander030/browserground/tree/main/plugins/browser-use))
218
  - **Skyvern** local-first grounding with cloud fallback (adapter in [GitHub plugins/skyvern/](https://github.com/renezander030/browserground/tree/main/plugins/skyvern))
219
- - **Custom agent stacks** that need a $0/call grounding step instead of GPT-4V per screenshot
220
  - **Self-hosted compound-AI systems** with a routing layer (specialist model for grounding, general LLM for planning)
221
 
222
  ## Limitations & next
223
 
224
- - **Web and desktop accuracy** lag mobile. v0.4 will add more web/desktop training data.
225
- - **Icon UI accuracy (~41%) lags text UI (~74%)** icons need more visual exposure in training; planned for v0.4.
226
- - **No mouse-action prediction** — this model only locates; doesn't decide click vs hover vs type. Pair with an action predictor for full computer-use loops.
227
- - **English-only training data**.
 
228
 
229
  ## Work with me
230
 
@@ -235,7 +266,7 @@ If you need one of these, I can build it:
235
  - a **UI-grounding model trained on your own product's screenshots** — your dashboard, your app, your customer interfaces — for higher recall on the elements your agents actually click
236
  - a **hybrid agent architecture** that routes narrow tasks (grounding, OCR, classification, embedding, extraction) to local specialist models and reserves cloud frontier LLMs for the reasoning that actually needs them
237
  - an **on-prem agent deployment** — Apple Silicon (MLX), CUDA box, or your existing K8s — with no screenshots leaving your infrastructure
238
- - a **structured-output evaluation harness** that tells you when the local model is actually good enough to replace the cloud call in production
239
 
240
  Reach out: <https://renezander.com>
241
 
 
38
 
39
  > **The local UI-grounding specialist for hybrid AI agents.** Drop in a screenshot + text target, get a strict JSON bbox. 2B params. MLX-native. Apache 2.0.
40
 
41
+ ---
42
 
43
+ ## TL;DR why browserground, not the other 2B grounding models
44
+
45
+ You already know the hybrid-AI argument: don't pay frontier-vision rates for "where is the button?" There are three good 2B specialists for that job — UI-TARS, ShowUI, browserground. Here's the case for picking **this** one.
46
+
47
+ | | browserground v0.3 | UI-TARS-2B-SFT | ShowUI-2B |
48
+ |---|---|---|---|
49
+ | ScreenSpot-v2 (overall) | 60.0% | **89.5%** | 75.5% |
50
+ | **Output format** | ✅ **strict JSON `{"bbox_2d": [...]}`, 100% parseable** | ❌ coord strings inside prose — needs regex | ❌ varies by prompt |
51
+ | **Apple Silicon native** | ✅ MLX 4-bit, Ollama, GGUF | ❌ server-class | ❌ server-class |
52
+ | **Distribution** | ✅ npm + pip + Ollama + HF, one install per stack | HF only | HF only |
53
+ | **Daemon / HTTP REST** | ✅ `serve --http :8401`, Ollama-shape API | ❌ | ❌ |
54
+ | **Batch + confidence + eval CLIs** | ✅ built-in | ❌ | ❌ |
55
+ | **Adapters** | ✅ `browser-use` Controller + Skyvern `ground_with_fallback` | ❌ DIY | ❌ DIY |
56
+ | Base model | Qwen3-VL-**2025** | Qwen2-VL-2024 | Qwen2-VL-2024 |
57
+ | Training compute | $2.20 (reproducible) | ByteDance lab scale | showlab paper scale |
58
+ | License | Apache 2.0 | Apache 2.0 | Apache 2.0 |
59
+
60
+ **The honest take on accuracy.** Yes, UI-TARS scores 89.5% to our 60.0% on ScreenSpot-v2 overall. That gap is a **training-data-and-compute gap**, not an architecture gap. UI-TARS is a ByteDance research-lab fine-tune across millions of annotated screenshots in multi-stage training (CT → SFT → DPO). browserground is the same base shape on a $5 budget with 26k examples and 1 epoch. Reaching ~89% is reproducible with ~$200–500 of compute and 250k records on the same recipe.
61
+
62
+ **Why ship at 60% anyway?** Because you don't use a 2B local model as a standalone cloud replacement. You use it as a router-stage primitive:
63
+
64
+ ```python
65
+ from browserground_skyvern import ground_with_fallback
66
+
67
+ bbox = ground_with_fallback(
68
+ screen, target,
69
+ confidence_threshold=0.55,
70
+ cloud_fallback=your_cloud_vision_fn, # GPT-4V / Claude Vision / Gemini
71
+ )
72
+ ```
73
+
74
+ On representative agent workloads, ~70–80% of grounding calls clear the confidence threshold and stay local at $0. The remaining 20–30% — sub-50px icons, ambiguous targets — escalate to cloud. **Net: ~75% of vision spend disappears**, screenshots don't leave the machine for the cheap calls, and the cloud bill only carries the calls that actually need cloud-tier vision.
75
 
76
+ That's the product. UI-TARS is the "I want one model for everything" answer; browserground is the "I want a fast, structured, MLX-native router primitive that plugs into the npm CLI / pip / Ollama" answer.
77
+
78
+ **On per-split numbers (the 60% breakdown):** mobile-app buttons are at 78%, text-labelled targets are at ~74%, icon-only targets are at ~41%. If your agent mostly clicks labelled buttons (the common case), real-world accuracy is closer to the high end. Icons get fixed in v0.4 with more icon-rich training data.
79
+
80
+ ---
81
+
82
+ ## The hybrid AI argument — for people new to this pattern
83
 
84
  Today, most AI agents route **every** screenshot to a cloud frontier model (GPT-4V, Claude Vision, Gemini) just to find click coordinates. That's a $0.01–0.05 multimodal call adding 800ms–2s of latency, repeated 20–50× per agent run. Cost and latency compound. Screenshots full of private UI leave your machine.
85
 
 
89
 
90
  ![hybrid architecture](https://raw.githubusercontent.com/renezander030/browserground/main/assets/hybrid-architecture.svg)
91
 
92
+ | | Pure-cloud (status quo) | Hybrid (+ browserground + confidence routing) |
93
+ |---|---|---|
94
+ | Per-screenshot cost on the common case | $0.01–0.05 | **$0** (local), cloud only on low-confidence escalations |
95
+ | Tokens billed by cloud per step | 1500+ multimodal | **~40 text** on the local path |
96
+ | Screenshots leave machine | yes | **no** for the local path |
97
+ | Rate limits | yes | **no** for the local path |
98
+
99
+ ## Three packaged builds
100
+
101
+ | Build | Use it for | Install |
102
  |---|---|---|
103
+ | **MLX 4-bit** ([renezander030/browserground-mlx](https://huggingface.co/renezander030/browserground-mlx)) | Apple Silicon, fastest | `npm install -g browserground` (auto) or `pip install "browserground[mlx]"` |
104
+ | **GGUF Q4_K_M + f16 mmproj** ([renezander030/browserground-gguf](https://huggingface.co/renezander030/browserground-gguf)) | Ollama, llama.cpp | `ollama run renezander030/browserground` |
105
+ | **PEFT LoRA** (this repo) | `transformers`, training, fine-tuning | `pip install "browserground[transformers]"` |
 
 
106
 
107
  ## What it does
108
 
 
114
 
115
  — the pixel coordinates of the element to click. **100% format compliance** on the held-out evaluation. Drop it into any browser-agent / screen-automation pipeline that needs to ground language → click target.
116
 
117
+ With `--confidence`, output extends to:
118
+
119
+ ```json
120
+ {"bbox_2d": [x1, y1, x2, y2], "confidence": 0.92, "alternatives": [{"bbox_2d": [...]}]}
121
+ ```
122
+
123
+ ## Full results on ScreenSpot-v2
124
 
125
  Point-grounding accuracy, 300 held-out items (100 per split: mobile / desktop / web). A hit = predicted bbox center falls inside the ground-truth bbox.
126
 
 
140
  - **Beats SeeClick (9.6B) at 4.8× smaller** — 2B params, +5 pp accuracy
141
  - **100% strict-JSON format compliance** — no markdown fences, no `<ref>` tokens, parseable every time
142
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
143
  ## Quick start
144
 
145
  ### npm CLI
 
150
  # {"bbox_2d": [344, 612, 478, 658]}
151
  ```
152
 
153
+ Daemon, HTTP server, batch, confidence, eval — all in the CLI. See the [GitHub README](https://github.com/renezander030/browserground) for the full surface.
154
 
155
  ### Python
156
 
 
207
  print(processor.tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))
208
  ```
209
 
210
+ ## What would it take to reach UI-TARS-level accuracy (~89-90%)?
211
+
212
+ The gap is **compute + data**, not architecture. Concrete recipe to close it:
213
+
214
+ | Lever | v0.3 (this) | v0.5+ target |
215
+ |---|---|---|
216
+ | Training records | 26k | 250k–500k (10–20× more) |
217
+ | Epochs | 1 | 3–5 |
218
+ | Adapter size | LoRA rank 32 (1.6% of base) | rank 128 or full fine-tune |
219
+ | Icon-rich data | thin | balanced — closes the 41% icon split |
220
+ | Training stages | SFT only | SFT → DPO with preference data |
221
+ | Compute spend | $2.20 | ~$200–500 |
222
+
223
+ This is reproducible — the training scripts in `imgparse-tier1` are the template. The current v0.3 is the *recipe-validated* milestone at the cheap end of the spectrum.
224
+
225
+ ## Training recipe (v0.2 LoRA — what's in this repo)
226
 
227
+ v0.3 is the same underlying LoRA as v0.2 — what shipped in v0.3 is **packaging**: MLX 4-bit, GGUF, Ollama, PyPI, browser-use + Skyvern adapters, batch / confidence / HTTP daemon / eval CLI surfaces.
228
 
229
  - **Base**: `Qwen/Qwen3-VL-2B-Instruct`
230
  - **Method**: LoRA rank 32, alpha 64, dropout 0.05, on all 7 linear modules of the LM (q/k/v/o/gate/up/down)
 
240
 
241
  Full training scripts (private repo, request access): [renezander030/imgparse-tier1](https://github.com/renezander030/imgparse-tier1).
242
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
243
  ## Use cases
244
 
245
  - **Claude Computer Use / Claude Code** screen-grounding tool calls
246
  - **OpenAI Codex CLI** screen-grounding extension
247
  - **browser-use** click-targeting (drop-in adapter in [GitHub plugins/browser-use/](https://github.com/renezander030/browserground/tree/main/plugins/browser-use))
248
  - **Skyvern** local-first grounding with cloud fallback (adapter in [GitHub plugins/skyvern/](https://github.com/renezander030/browserground/tree/main/plugins/skyvern))
249
+ - **Custom agent stacks** that need a $0/call grounding step for the common-case calls instead of GPT-4V per screenshot
250
  - **Self-hosted compound-AI systems** with a routing layer (specialist model for grounding, general LLM for planning)
251
 
252
  ## Limitations & next
253
 
254
+ - **Icon UI accuracy (~41%) lags text UI (~74%)** icons under-represented in the 26k training mix; planned for v0.4
255
+ - **Web and desktop accuracy** lag mobile more web/desktop training data in v0.4
256
+ - **No mouse-action prediction** — this model only locates; doesn't decide click vs hover vs type. Pair with an action predictor for full computer-use loops
257
+ - **English-only training data**
258
+ - **MLX latency numbers are targets** until v0.4 independent benchmarks
259
 
260
  ## Work with me
261
 
 
266
  - a **UI-grounding model trained on your own product's screenshots** — your dashboard, your app, your customer interfaces — for higher recall on the elements your agents actually click
267
  - a **hybrid agent architecture** that routes narrow tasks (grounding, OCR, classification, embedding, extraction) to local specialist models and reserves cloud frontier LLMs for the reasoning that actually needs them
268
  - an **on-prem agent deployment** — Apple Silicon (MLX), CUDA box, or your existing K8s — with no screenshots leaving your infrastructure
269
+ - a **confidence-routed harness** that tells you when the local model is actually good enough to keep the call out of the cloud bill in production
270
 
271
  Reach out: <https://renezander.com>
272