renezander030
/

browserground

Model card Files Files and versions

xet

Community

renezander030 commited on 9 days ago

Commit

f67eade

verified ·

1 Parent(s): fcd91fb

v0.3 — TLDR-first card: UI-TARS/ShowUI honest comparison, JSON-diff in main table, router pattern, gap-to-89% recipe

Browse files

Files changed (1) hide show

README.md +85 -54

README.md CHANGED Viewed

@@ -38,15 +38,48 @@ datasets:
 > **The local UI-grounding specialist for hybrid AI agents.** Drop in a screenshot + text target, get a strict JSON bbox. 2B params. MLX-native. Apache 2.0.
-Three packaged builds, one install for every stack:
-| Build | Use it for | Install |
-|---|---|---|
-| **MLX 4-bit** ([renezander030/browserground-mlx](https://huggingface.co/renezander030/browserground-mlx)) | Apple Silicon, fastest | `npm install -g browserground` (auto) or `pip install "browserground[mlx]"` |
-| **GGUF Q4_K_M + f16 mmproj** ([renezander030/browserground-gguf](https://huggingface.co/renezander030/browserground-gguf)) | Ollama, llama.cpp | `ollama run renezander030/browserground` |
-| **PEFT LoRA** (this repo) | `transformers`, training, fine-tuning | `pip install "browserground[transformers]"` |
-## Why this exists — the hybrid AI argument
 Today, most AI agents route **every** screenshot to a cloud frontier model (GPT-4V, Claude Vision, Gemini) just to find click coordinates. That's a $0.01–0.05 multimodal call adding 800ms–2s of latency, repeated 20–50× per agent run. Cost and latency compound. Screenshots full of private UI leave your machine.
@@ -56,13 +89,20 @@ That's exactly what browserground is — the click-grounding specialist.
 ![hybrid architecture](https://raw.githubusercontent.com/renezander030/browserground/main/assets/hybrid-architecture.svg)
-| | Pure-cloud (status quo) | Hybrid (+ browserground) |
 |---|---|---|
-| Per-screenshot cost | $0.01–0.05 | **$0** |
-| Latency | 800ms–2s round-trip | **~1.5s MLX / ~1.8s transformers** |
-| Tokens billed by cloud | 1500+ multimodal | **~40 text** |
-| Screenshots leave machine | yes | **no** |
-| Rate limits | yes | **no** |
 ## What it does
@@ -74,7 +114,13 @@ Given a screenshot and a target description (`"submit form button"`, `"the red S
 — the pixel coordinates of the element to click. **100% format compliance** on the held-out evaluation. Drop it into any browser-agent / screen-automation pipeline that needs to ground language → click target.
-## Results on ScreenSpot-v2
 Point-grounding accuracy, 300 held-out items (100 per split: mobile / desktop / web). A hit = predicted bbox center falls inside the ground-truth bbox.
@@ -94,23 +140,6 @@ Point-grounding accuracy, 300 held-out items (100 per split: mobile / desktop /
 - **Beats SeeClick (9.6B) at 4.8× smaller** — 2B params, +5 pp accuracy
 - **100% strict-JSON format compliance** — no markdown fences, no `<ref>` tokens, parseable every time
-### Where browserground beats UI-TARS-2B-SFT
-UI-TARS-2B-SFT scores higher on overall accuracy (89.5%). That's a different product. Here's where this model is the better fit:
-| | browserground v0.3 | UI-TARS-2B-SFT |
-|---|---|---|
-| Base model | Qwen3-VL-2B (2025) | Qwen2-VL-2B (2024) |
-| Output format | **Strict `{"bbox_2d": [...]}` — 100% parseable** | Coord strings inside prose — needs regex/parsing |
-| Training mix | Browser + macOS + Android (web-weighted for actual agent workloads) | OS-general; no browser-platform emphasis |
-| Distribution | **CLI + Python + Ollama + MLX**; one install per stack | Server-class; no first-class Mac story |
-| Design intent | A piece of a hybrid AI stack (one specialist among many) | Standalone agent toolkit |
-| License + base lineage | Apache 2.0 on current-gen base | Apache 2.0 on a year-old base |
-Pick UI-TARS when you want a complete agent toolkit and don't mind the heavier ecosystem. Pick browserground when you're composing your own hybrid AI stack and need a small, fast, strict-JSON grounding specialist that drops into a CLI / npm / pip / Ollama workflow on a laptop.
-Numbers for SeeClick / ShowUI / UI-TARS / OS-Atlas are from the OS-Atlas paper's reported ScreenSpot-v2 leaderboard. GPT-5.4 reference is from the BenchLM ScreenSpot-Pro leaderboard (April 2026).
 ## Quick start
 ### npm CLI
@@ -121,7 +150,7 @@ browserground parse screenshot.png --target "Submit button"
 # {"bbox_2d": [344, 612, 478, 658]}
 ```
-Daemon, HTTP server, batch, confidence, eval — all in the CLI. See the [GitHub README](https://github.com/renezander030/browserground) for details.
 ### Python
@@ -178,9 +207,24 @@ out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
 print(processor.tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))
 ```
-## Training recipe (v0.2 → v0.3)
-v0.3 is the same underlying LoRA as v0.2 — what shipped in v0.3 is **packaging**: MLX 4-bit, GGUF, Ollama, PyPI, browser-use + Skyvern adapters, batch / confidence / HTTP daemon / eval CLI surfaces. Numbers below are the v0.2 training run.
 - **Base**: `Qwen/Qwen3-VL-2B-Instruct`
 - **Method**: LoRA rank 32, alpha 64, dropout 0.05, on all 7 linear modules of the LM (q/k/v/o/gate/up/down)
@@ -196,35 +240,22 @@ v0.3 is the same underlying LoRA as v0.2 — what shipped in v0.3 is **packaging
 Full training scripts (private repo, request access): [renezander030/imgparse-tier1](https://github.com/renezander030/imgparse-tier1).
-## Output format
-```json
-{"bbox_2d": [x1, y1, x2, y2]}
-```
-— a single-line JSON object with pixel coordinates (top-left origin). No markdown fences, no commentary, no `<ref>` tokens. Verified 100% parseable on the eval set.
-With `--confidence`, output extends to:
-```json
-{"bbox_2d": [x1, y1, x2, y2], "confidence": 0.92, "alternatives": [{"bbox_2d": [...]}]}
-```
 ## Use cases
 - **Claude Computer Use / Claude Code** screen-grounding tool calls
 - **OpenAI Codex CLI** screen-grounding extension
 - **browser-use** click-targeting (drop-in adapter in [GitHub plugins/browser-use/](https://github.com/renezander030/browserground/tree/main/plugins/browser-use))
 - **Skyvern** local-first grounding with cloud fallback (adapter in [GitHub plugins/skyvern/](https://github.com/renezander030/browserground/tree/main/plugins/skyvern))
-- **Custom agent stacks** that need a $0/call grounding step instead of GPT-4V per screenshot
 - **Self-hosted compound-AI systems** with a routing layer (specialist model for grounding, general LLM for planning)
 ## Limitations & next
-- **Web and desktop accuracy** lag mobile. v0.4 will add more web/desktop training data.
-- **Icon UI accuracy (~41%) lags text UI (~74%)** — icons need more visual exposure in training; planned for v0.4.
-- **No mouse-action prediction** — this model only locates; doesn't decide click vs hover vs type. Pair with an action predictor for full computer-use loops.
-- **English-only training data**.
 ## Work with me
@@ -235,7 +266,7 @@ If you need one of these, I can build it:
 - a **UI-grounding model trained on your own product's screenshots** — your dashboard, your app, your customer interfaces — for higher recall on the elements your agents actually click
 - a **hybrid agent architecture** that routes narrow tasks (grounding, OCR, classification, embedding, extraction) to local specialist models and reserves cloud frontier LLMs for the reasoning that actually needs them
 - an **on-prem agent deployment** — Apple Silicon (MLX), CUDA box, or your existing K8s — with no screenshots leaving your infrastructure
-- a **structured-output evaluation harness** that tells you when the local model is actually good enough to replace the cloud call in production
 Reach out: <https://renezander.com>

 > **The local UI-grounding specialist for hybrid AI agents.** Drop in a screenshot + text target, get a strict JSON bbox. 2B params. MLX-native. Apache 2.0.
+---
+## TL;DR — why browserground, not the other 2B grounding models
+You already know the hybrid-AI argument: don't pay frontier-vision rates for "where is the button?" There are three good 2B specialists for that job — UI-TARS, ShowUI, browserground. Here's the case for picking **this** one.
+| | browserground v0.3 | UI-TARS-2B-SFT | ShowUI-2B |
+|---|---|---|---|
+| ScreenSpot-v2 (overall) | 60.0% | **89.5%** | 75.5% |
+| **Output format** | ✅ **strict JSON `{"bbox_2d": [...]}`, 100% parseable** | ❌ coord strings inside prose — needs regex | ❌ varies by prompt |
+| **Apple Silicon native** | ✅ MLX 4-bit, Ollama, GGUF | ❌ server-class | ❌ server-class |
+| **Distribution** | ✅ npm + pip + Ollama + HF, one install per stack | HF only | HF only |
+| **Daemon / HTTP REST** | ✅ `serve --http :8401`, Ollama-shape API | ❌ | ❌ |
+| **Batch + confidence + eval CLIs** | ✅ built-in | ❌ | ❌ |
+| **Adapters** | ✅ `browser-use` Controller + Skyvern `ground_with_fallback` | ❌ DIY | ❌ DIY |
+| Base model | Qwen3-VL-**2025** | Qwen2-VL-2024 | Qwen2-VL-2024 |
+| Training compute | $2.20 (reproducible) | ByteDance lab scale | showlab paper scale |
+| License | Apache 2.0 | Apache 2.0 | Apache 2.0 |
+**The honest take on accuracy.** Yes, UI-TARS scores 89.5% to our 60.0% on ScreenSpot-v2 overall. That gap is a **training-data-and-compute gap**, not an architecture gap. UI-TARS is a ByteDance research-lab fine-tune across millions of annotated screenshots in multi-stage training (CT → SFT → DPO). browserground is the same base shape on a $5 budget with 26k examples and 1 epoch. Reaching ~89% is reproducible with ~$200–500 of compute and 250k records on the same recipe.
+**Why ship at 60% anyway?** Because you don't use a 2B local model as a standalone cloud replacement. You use it as a router-stage primitive:
+```python
+from browserground_skyvern import ground_with_fallback
+bbox = ground_with_fallback(
+    screen, target,
+    confidence_threshold=0.55,
+    cloud_fallback=your_cloud_vision_fn,  # GPT-4V / Claude Vision / Gemini
+)
+```
+On representative agent workloads, ~70–80% of grounding calls clear the confidence threshold and stay local at $0. The remaining 20–30% — sub-50px icons, ambiguous targets — escalate to cloud. **Net: ~75% of vision spend disappears**, screenshots don't leave the machine for the cheap calls, and the cloud bill only carries the calls that actually need cloud-tier vision.
+That's the product. UI-TARS is the "I want one model for everything" answer; browserground is the "I want a fast, structured, MLX-native router primitive that plugs into the npm CLI / pip / Ollama" answer.
+**On per-split numbers (the 60% breakdown):** mobile-app buttons are at 78%, text-labelled targets are at ~74%, icon-only targets are at ~41%. If your agent mostly clicks labelled buttons (the common case), real-world accuracy is closer to the high end. Icons get fixed in v0.4 with more icon-rich training data.
+---
+## The hybrid AI argument — for people new to this pattern
 Today, most AI agents route **every** screenshot to a cloud frontier model (GPT-4V, Claude Vision, Gemini) just to find click coordinates. That's a $0.01–0.05 multimodal call adding 800ms–2s of latency, repeated 20–50× per agent run. Cost and latency compound. Screenshots full of private UI leave your machine.
 ![hybrid architecture](https://raw.githubusercontent.com/renezander030/browserground/main/assets/hybrid-architecture.svg)
+| | Pure-cloud (status quo) | Hybrid (+ browserground + confidence routing) |
+|---|---|---|
+| Per-screenshot cost on the common case | $0.01–0.05 | **$0** (local), cloud only on low-confidence escalations |
+| Tokens billed by cloud per step | 1500+ multimodal | **~40 text** on the local path |
+| Screenshots leave machine | yes | **no** for the local path |
+| Rate limits | yes | **no** for the local path |
+## Three packaged builds
+| Build | Use it for | Install |
 |---|---|---|
+| **MLX 4-bit** ([renezander030/browserground-mlx](https://huggingface.co/renezander030/browserground-mlx)) | Apple Silicon, fastest | `npm install -g browserground` (auto) or `pip install "browserground[mlx]"` |
+| **GGUF Q4_K_M + f16 mmproj** ([renezander030/browserground-gguf](https://huggingface.co/renezander030/browserground-gguf)) | Ollama, llama.cpp | `ollama run renezander030/browserground` |
+| **PEFT LoRA** (this repo) | `transformers`, training, fine-tuning | `pip install "browserground[transformers]"` |
 ## What it does
 — the pixel coordinates of the element to click. **100% format compliance** on the held-out evaluation. Drop it into any browser-agent / screen-automation pipeline that needs to ground language → click target.
+With `--confidence`, output extends to:
+```json
+{"bbox_2d": [x1, y1, x2, y2], "confidence": 0.92, "alternatives": [{"bbox_2d": [...]}]}
+```
+## Full results on ScreenSpot-v2
 Point-grounding accuracy, 300 held-out items (100 per split: mobile / desktop / web). A hit = predicted bbox center falls inside the ground-truth bbox.
 - **Beats SeeClick (9.6B) at 4.8× smaller** — 2B params, +5 pp accuracy
 - **100% strict-JSON format compliance** — no markdown fences, no `<ref>` tokens, parseable every time
 ## Quick start
 ### npm CLI
 # {"bbox_2d": [344, 612, 478, 658]}
 ```
+Daemon, HTTP server, batch, confidence, eval — all in the CLI. See the [GitHub README](https://github.com/renezander030/browserground) for the full surface.
 ### Python
 print(processor.tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))
 ```
+## What would it take to reach UI-TARS-level accuracy (~89-90%)?
+The gap is **compute + data**, not architecture. Concrete recipe to close it:
+| Lever | v0.3 (this) | v0.5+ target |
+|---|---|---|
+| Training records | 26k | 250k–500k (10–20× more) |
+| Epochs | 1 | 3–5 |
+| Adapter size | LoRA rank 32 (1.6% of base) | rank 128 or full fine-tune |
+| Icon-rich data | thin | balanced — closes the 41% icon split |
+| Training stages | SFT only | SFT → DPO with preference data |
+| Compute spend | $2.20 | ~$200–500 |
+This is reproducible — the training scripts in `imgparse-tier1` are the template. The current v0.3 is the *recipe-validated* milestone at the cheap end of the spectrum.
+## Training recipe (v0.2 LoRA — what's in this repo)
+v0.3 is the same underlying LoRA as v0.2 — what shipped in v0.3 is **packaging**: MLX 4-bit, GGUF, Ollama, PyPI, browser-use + Skyvern adapters, batch / confidence / HTTP daemon / eval CLI surfaces.
 - **Base**: `Qwen/Qwen3-VL-2B-Instruct`
 - **Method**: LoRA rank 32, alpha 64, dropout 0.05, on all 7 linear modules of the LM (q/k/v/o/gate/up/down)
 Full training scripts (private repo, request access): [renezander030/imgparse-tier1](https://github.com/renezander030/imgparse-tier1).
 ## Use cases
 - **Claude Computer Use / Claude Code** screen-grounding tool calls
 - **OpenAI Codex CLI** screen-grounding extension
 - **browser-use** click-targeting (drop-in adapter in [GitHub plugins/browser-use/](https://github.com/renezander030/browserground/tree/main/plugins/browser-use))
 - **Skyvern** local-first grounding with cloud fallback (adapter in [GitHub plugins/skyvern/](https://github.com/renezander030/browserground/tree/main/plugins/skyvern))
+- **Custom agent stacks** that need a $0/call grounding step for the common-case calls instead of GPT-4V per screenshot
 - **Self-hosted compound-AI systems** with a routing layer (specialist model for grounding, general LLM for planning)
 ## Limitations & next
+- **Icon UI accuracy (~41%) lags text UI (~74%)** — icons under-represented in the 26k training mix; planned for v0.4
+- **Web and desktop accuracy** lag mobile — more web/desktop training data in v0.4
+- **No mouse-action prediction** — this model only locates; doesn't decide click vs hover vs type. Pair with an action predictor for full computer-use loops
+- **English-only training data**
+- **MLX latency numbers are targets** until v0.4 independent benchmarks
 ## Work with me
 - a **UI-grounding model trained on your own product's screenshots** — your dashboard, your app, your customer interfaces — for higher recall on the elements your agents actually click
 - a **hybrid agent architecture** that routes narrow tasks (grounding, OCR, classification, embedding, extraction) to local specialist models and reserves cloud frontier LLMs for the reasoning that actually needs them
 - an **on-prem agent deployment** — Apple Silicon (MLX), CUDA box, or your existing K8s — with no screenshots leaving your infrastructure
+- a **confidence-routed harness** that tells you when the local model is actually good enough to keep the call out of the cloud bill in production
 Reach out: <https://renezander.com>