renezander030
/

browserground

Model card Files Files and versions

xet

Community

renezander030 commited on 6 days ago

Commit

f41a1d9

verified ·

1 Parent(s): c38817a

hybrid AI framing: logo, diagram, SEO tags, use-cases section

Browse files

Files changed (1) hide show

README.md +58 -55

README.md CHANGED Viewed

@@ -5,10 +5,18 @@ tags:
   - ui-grounding
   - screen-grounding
   - browser-agent
   - lora
   - mlx
   - apple-silicon
   - qwen3-vl
 base_model: Qwen/Qwen3-VL-2B-Instruct
 pipeline_tag: image-text-to-text
 language:
@@ -18,11 +26,31 @@ datasets:
   - agentsea/wave-ui
 ---
-# browserground — Qwen3-VL-2B LoRA for UI grounding (v0.1)
-> **Drop-in local grounding for AI agents.**
->
-> Qwen3-VL-2B LoRA. Strict-JSON output. Sub-second inference on Apple Silicon. One-command install. Hooks into browser-use, Claude Code, and Codex CLI out of the box.
 ## What it does
@@ -34,9 +62,9 @@ Given a screenshot and a target description (`"submit form button"`, `"the red S
 — the pixel coordinates of the element to click. **100% format compliance** on the held-out evaluation. Drop it into any browser-agent / screen-automation pipeline that needs to ground language → click target.
-## Results on ScreenSpot-v2 (point-grounding accuracy)
-Evaluated on 300 held-out items (100 per split: mobile / desktop / web). A hit = predicted bbox center falls inside the ground-truth bbox.
 | Model | Params | Overall | Mobile | Desktop | Web | Format-OK |
 |---|---:|---:|---:|---:|---:|---:|
@@ -48,38 +76,23 @@ Evaluated on 300 held-out items (100 per split: mobile / desktop / web). A hit =
 | **browserground v0.1 (this model)** | **2B** | **45.3%** | **64.0%** | **28.0%** | **44.0%** | **100%** |
 | Qwen3-VL-2B-Instruct (zero-shot baseline) | 2B | 6.3% | 7.0% | 6.0% | 6.0% | 100% |
-Notes:
-- Numbers for SeeClick / ShowUI / UI-TARS / OS-Atlas are from the OS-Atlas paper's reported ScreenSpot-v2 leaderboard.
-- **v0.1 is a one-epoch / 12k-example fine-tune** intended to validate the recipe. A v0.2 with ~50k mixed examples (including a wider web slice) is in progress with target ≥ 60%.
-### What this version beats / where it sits
-- **Beats** GPT-4o (18.3%) and zero-shot Qwen3-VL (6.3%) by 2.5× and 7× respectively on the same benchmark.
-- **Sits below** SeeClick (55.1%), ShowUI-2B (75.5%), and UI-TARS-2B-SFT (89.5%) at this v0.1.
-- **Strong on mobile** (64.0%) — competitive with much larger fine-tunes on that split, reflecting the mobile-heavy training mix.
-- **Distinctive properties**: only public Qwen3-VL-2B grounding fine-tune known; documented MLX/Apple-Silicon deployment path; emits strict bbox JSON with no markdown fences (verified 100% compliance).
 ## Quick start
-### Install (one line)
 ```bash
 npm install -g browserground
-# or: bun install -g browserground
 ```
-### Use from the CLI
-```bash
-browserground parse path/to/screenshot.png --target "Submit button"
-```
-Returns:
-```json
-{"bbox_2d": [344, 612, 478, 658]}
-```
-### Use from Python
 ```python
 from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
@@ -92,8 +105,7 @@ model = Qwen3VLForConditionalGeneration.from_pretrained(
     "Qwen/Qwen3-VL-2B-Instruct", dtype=torch.bfloat16, device_map="auto"
 )
 model = PeftModel.from_pretrained(model, "renezander030/browserground")
-model = model.merge_and_unload()
-model.eval()
 img = Image.open("screenshot.png").convert("RGB")
 messages = [
@@ -112,26 +124,12 @@ out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
 print(processor.tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))
 ```
-### Use from Claude Code
-```
-/install-plugin renezander030/browserground
-```
-Then in any Claude Code conversation, mention a screenshot + a target. The plugin routes to the local CLI.
-### Use from Codex CLI
-```bash
-codex add-extension renezander030/browserground
-```
 ## Training recipe
 - **Base**: `Qwen/Qwen3-VL-2B-Instruct`
 - **Method**: LoRA rank 16, alpha 32, dropout 0.05, on all 7 linear modules of the LM (q/k/v/o/gate/up/down)
 - **Trainable params**: 17.4 M (0.81% of base)
-- **Data mix (12k examples for v0.1)**:
   - OS-Atlas-Data desktop_domain (macOS): 4k
   - OS-Atlas-Data mobile_domain (aw_mobile, Android): 4k
   - OS-Atlas-Data mobile_domain (UIBert): 4k
@@ -140,12 +138,10 @@ codex add-extension renezander030/browserground
 - **Compute cost**: ~$2 training + ~$0.50 eval
 - **Wall time**: ~2 hr total
-Full training scripts in the [GitHub repo](https://github.com/renezander030/browserground).
 ## Output format
-The model is trained to emit exactly:
 ```json
 {"bbox_2d": [x1, y1, x2, y2]}
 ```
@@ -154,17 +150,24 @@ The model is trained to emit exactly:
 ## Limitations & next
-- **Web and desktop accuracy** lag mobile (we trained primarily on macOS + mobile UI). Tier 2 (in progress) adds 8k+ web records and ~2× total data.
-- **Long-tail icon recognition** is weaker than text grounding (text is OCR-friendly; icons need more visual exposure).
-- **No mouse-action prediction** — this model only locates; it doesn't decide click vs hover vs type. Pair with an action predictor for full computer-use loops.
-- **English-only training data** — non-English UI may underperform.
-- **Single-target per call** — for batch grounding of many elements on one screenshot, see `browserground parse --batch` (v0.2).
 ## Citation
 ```bibtex
 @misc{browserground-2026,
-  title  = {browserground: Qwen3-VL-2B LoRA for UI grounding},
   author = {Zander, René},
   year   = {2026},
   url    = {https://huggingface.co/renezander030/browserground}

   - ui-grounding
   - screen-grounding
   - browser-agent
+  - claude-computer-use
+  - codex
+  - hybrid-ai
+  - compound-ai
+  - specialist-model
   - lora
+  - peft
   - mlx
   - apple-silicon
   - qwen3-vl
+  - gpt-4v-alternative
+  - cost-effective-ai
 base_model: Qwen/Qwen3-VL-2B-Instruct
 pipeline_tag: image-text-to-text
 language:
   - agentsea/wave-ui
 ---
+<p align="center">
+  <img src="https://raw.githubusercontent.com/renezander030/browserground/main/assets/logo.svg" alt="browserground logo" width="120" height="120"/>
+</p>
+# browserground — Qwen3-VL-2B LoRA for hybrid AI agents (v0.1)
+> **The local UI-grounding specialist for hybrid AI agents.** Drop in a screenshot + text target, get a strict JSON bbox. 2B params. MLX-native. Apache 2.0.
+## Why this exists — the hybrid AI argument
+Today, most AI agents route **every** screenshot to a cloud frontier model (GPT-4V, Claude Vision, Gemini) just to find click coordinates. That's a $0.01–0.05 multimodal call adding 800ms–2s of latency, repeated 20–50× per agent run. Cost and latency compound. Screenshots full of private UI leave your machine.
+A general 200B-parameter LLM is overkill for "where is the Submit button?" — that's a narrow vision task. The right shape is a **hybrid one**: cheap fast specialist local models for the dedicated tasks they handle better, and the cloud LLM only for the planning and reasoning it's uniquely good at.
+That's exactly what browserground is — the click-grounding specialist.
+![hybrid architecture](https://raw.githubusercontent.com/renezander030/browserground/main/assets/hybrid-architecture.svg)
+| | Pure-cloud (status quo) | Hybrid (+ browserground) |
+|---|---|---|
+| Per-screenshot cost | $0.01–0.05 | **$0** |
+| Latency | 800ms–2s round-trip | **~1.8s local** |
+| Tokens billed by cloud | 1500+ multimodal | **~40 text** |
+| Screenshots leave machine | yes | **no** |
+| Rate limits | yes | **no** |
 ## What it does
 — the pixel coordinates of the element to click. **100% format compliance** on the held-out evaluation. Drop it into any browser-agent / screen-automation pipeline that needs to ground language → click target.
+## Results on ScreenSpot-v2
+Point-grounding accuracy, 300 held-out items (100 per split: mobile / desktop / web). A hit = predicted bbox center falls inside the ground-truth bbox.
 | Model | Params | Overall | Mobile | Desktop | Web | Format-OK |
 |---|---:|---:|---:|---:|---:|---:|
 | **browserground v0.1 (this model)** | **2B** | **45.3%** | **64.0%** | **28.0%** | **44.0%** | **100%** |
 | Qwen3-VL-2B-Instruct (zero-shot baseline) | 2B | 6.3% | 7.0% | 6.0% | 6.0% | 100% |
+- Beats **GPT-4o by 2.5×** and zero-shot Qwen3-VL by **7×** on the same benchmark
+- **100% strict-JSON format compliance** — no markdown fences, no commentary
+- Sits below ShowUI/UI-TARS at this v0.1; v0.2 (Tier 2, target ≥ 60%) on the roadmap
+Numbers for SeeClick / ShowUI / UI-TARS / OS-Atlas are from the OS-Atlas paper's reported ScreenSpot-v2 leaderboard.
 ## Quick start
 ```bash
 npm install -g browserground
+browserground parse screenshot.png --target "Submit button"
+# {"bbox_2d": [344, 612, 478, 658]}
 ```
+Full install + agent-stack integration: [github.com/renezander030/browserground](https://github.com/renezander030/browserground).
+## Use from Python directly
 ```python
 from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
     "Qwen/Qwen3-VL-2B-Instruct", dtype=torch.bfloat16, device_map="auto"
 )
 model = PeftModel.from_pretrained(model, "renezander030/browserground")
+model = model.merge_and_unload(); model.eval()
 img = Image.open("screenshot.png").convert("RGB")
 messages = [
 print(processor.tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))
 ```
 ## Training recipe
 - **Base**: `Qwen/Qwen3-VL-2B-Instruct`
 - **Method**: LoRA rank 16, alpha 32, dropout 0.05, on all 7 linear modules of the LM (q/k/v/o/gate/up/down)
 - **Trainable params**: 17.4 M (0.81% of base)
+- **Data mix (12k examples)**:
   - OS-Atlas-Data desktop_domain (macOS): 4k
   - OS-Atlas-Data mobile_domain (aw_mobile, Android): 4k
   - OS-Atlas-Data mobile_domain (UIBert): 4k
 - **Compute cost**: ~$2 training + ~$0.50 eval
 - **Wall time**: ~2 hr total
+Full training scripts (private repo, request access): [renezander030/imgparse-tier1](https://github.com/renezander030/imgparse-tier1).
 ## Output format
 ```json
 {"bbox_2d": [x1, y1, x2, y2]}
 ```
 ## Limitations & next
+- **Web and desktop accuracy** lag mobile (we trained primarily on macOS + mobile UI). v0.2 adds 8k+ web records and ~2× total data.
+- **Long-tail icon recognition** is weaker than text grounding.
+- **No mouse-action prediction** — this model only locates; doesn't decide click vs hover vs type. Pair with an action predictor for full computer-use loops.
+- **English-only training data**.
+## Use cases (what's this drop-in for)
+- **Claude Computer Use / Claude Code** screen-grounding tool calls
+- **OpenAI Codex CLI** screen-grounding extension
+- **browser-use / Skyvern** click-targeting (Python adapter in the GitHub repo)
+- **Custom agent stacks** that need a $0/call grounding step instead of GPT-4V per screenshot
+- **Self-hosted compound-AI systems** with a routing layer (specialist model for grounding, general LLM for planning)
 ## Citation
 ```bibtex
 @misc{browserground-2026,
+  title  = {browserground: Qwen3-VL-2B LoRA for hybrid AI agent UI grounding},
   author = {Zander, René},
   year   = {2026},
   url    = {https://huggingface.co/renezander030/browserground}