v0.2 — Tier 2 LoRA r32, 26k mixed-domain examples incl. browser, ScreenSpot-v2 60.0%

Browse files

Files changed (4) hide show

README.md +168 -161
adapter_config.json +7 -7
adapter_model.safetensors +2 -2
training_args.bin +1 -1

README.md CHANGED Viewed

@@ -1,200 +1,207 @@
 ---
-license: apache-2.0
 library_name: peft
 tags:
-  - ui-grounding
-  - screen-grounding
-  - browser-agent
-  - claude-computer-use
-  - codex
-  - hybrid-ai
-  - compound-ai
-  - specialist-model
-  - lora
-  - peft
-  - mlx
-  - apple-silicon
-  - qwen3-vl
-  - gpt-4v-alternative
-  - cost-effective-ai
-base_model: Qwen/Qwen3-VL-2B-Instruct
-pipeline_tag: image-text-to-text
-language:
-  - en
-datasets:
-  - OS-Copilot/OS-Atlas-Data
-  - agentsea/wave-ui
 ---
-<p align="center">
-  <img src="https://raw.githubusercontent.com/renezander030/browserground/main/assets/logo.svg" alt="browserground logo" width="120" height="120"/>
-</p>
-# browserground — Qwen3-VL-2B LoRA for hybrid AI agents (v0.1)
-> **The local UI-grounding specialist for hybrid AI agents.** Drop in a screenshot + text target, get a strict JSON bbox. 2B params. MLX-native. Apache 2.0.
-## Why this exists — the hybrid AI argument
-Today, most AI agents route **every** screenshot to a cloud frontier model (GPT-4V, Claude Vision, Gemini) just to find click coordinates. That's a $0.01–0.05 multimodal call adding 800ms–2s of latency, repeated 20–50× per agent run. Cost and latency compound. Screenshots full of private UI leave your machine.
-A general 200B-parameter LLM is overkill for "where is the Submit button?" — that's a narrow vision task. The right shape is a **hybrid one**: cheap fast specialist local models for the dedicated tasks they handle better, and the cloud LLM only for the planning and reasoning it's uniquely good at.
-That's exactly what browserground is — the click-grounding specialist.
-![hybrid architecture](https://raw.githubusercontent.com/renezander030/browserground/main/assets/hybrid-architecture.svg)
-| | Pure-cloud (status quo) | Hybrid (+ browserground) |
-|---|---|---|
-| Per-screenshot cost | $0.01–0.05 | **$0** |
-| Latency | 800ms–2s round-trip | **~1.8s local** |
-| Tokens billed by cloud | 1500+ multimodal | **~40 text** |
-| Screenshots leave machine | yes | **no** |
-| Rate limits | yes | **no** |
-## What it does
-Given a screenshot and a target description (`"submit form button"`, `"the red Sign Up link"`, `"the second profile picture from the left"`), this LoRA-fine-tuned Qwen3-VL-2B emits a strict JSON object:
-```json
-{"bbox_2d": [x1, y1, x2, y2]}
-```
-— the pixel coordinates of the element to click. **100% format compliance** on the held-out evaluation. Drop it into any browser-agent / screen-automation pipeline that needs to ground language → click target.
-## Results on ScreenSpot-v2
-Point-grounding accuracy, 300 held-out items (100 per split: mobile / desktop / web). A hit = predicted bbox center falls inside the ground-truth bbox.
-| Model | Params | Overall | Mobile | Desktop | Web | Format-OK |
-|---|---:|---:|---:|---:|---:|---:|
-| GPT-4o (cloud) | — | 18.3% | — | — | — | — |
-| SeeClick (Qwen-VL-Chat) | 9.6B | 55.1% | — | — | — | — |
-| ShowUI-2B | 2B | 75.5% | — | — | — | — |
-| UI-TARS-2B-SFT (ByteDance) | 2B | 89.5% | — | — | — | — |
-| OS-Atlas-Base-7B | 7B | ~91% | — | — | — | — |
-| **browserground v0.1 (this model)** | **2B** | **45.3%** | **64.0%** | **28.0%** | **44.0%** | **100%** |
-| Qwen3-VL-2B-Instruct (zero-shot baseline) | 2B | 6.3% | 7.0% | 6.0% | 6.0% | 100% |
-- Beats **GPT-4o by 2.5×** and zero-shot Qwen3-VL by **7×** on the same benchmark
-- **100% strict-JSON format compliance** — no markdown fences, no commentary
-- Sits below ShowUI/UI-TARS at this v0.1; v0.2 (Tier 2, target ≥ 60%) on the roadmap
-Numbers for SeeClick / ShowUI / UI-TARS / OS-Atlas are from the OS-Atlas paper's reported ScreenSpot-v2 leaderboard.
-## Quick start
-```bash
-npm install -g browserground
-browserground parse screenshot.png --target "Submit button"
-# {"bbox_2d": [344, 612, 478, 658]}
-```
-Full install + agent-stack integration: [github.com/renezander030/browserground](https://github.com/renezander030/browserground).
-## Use from Python directly
-```python
-from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
-from peft import PeftModel
-import torch
-from PIL import Image
-processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")
-model = Qwen3VLForConditionalGeneration.from_pretrained(
-    "Qwen/Qwen3-VL-2B-Instruct", dtype=torch.bfloat16, device_map="auto"
-)
-model = PeftModel.from_pretrained(model, "renezander030/browserground")
-model = model.merge_and_unload(); model.eval()
-img = Image.open("screenshot.png").convert("RGB")
-messages = [
-    {"role": "system", "content": [{"type": "text", "text":
-        'You are a UI-grounding model. Given a screenshot and a target description, '
-        'output the bounding box of the SINGLE UI element to click. Output ONLY a JSON '
-        'object: {"bbox_2d": [x1, y1, x2, y2]} with pixel coordinates, origin at top-left.'}]},
-    {"role": "user", "content": [
-        {"type": "image", "image": img},
-        {"type": "text", "text": "Locate the element described: Submit button"},
-    ]},
-]
-prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-inputs = processor(text=[prompt], images=[[img]], return_tensors="pt").to(model.device)
-out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
-print(processor.tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))
-```
-## Training recipe
-- **Base**: `Qwen/Qwen3-VL-2B-Instruct`
-- **Method**: LoRA rank 16, alpha 32, dropout 0.05, on all 7 linear modules of the LM (q/k/v/o/gate/up/down)
-- **Trainable params**: 17.4 M (0.81% of base)
-- **Data mix (12k examples)**:
-  - OS-Atlas-Data desktop_domain (macOS): 4k
-  - OS-Atlas-Data mobile_domain (aw_mobile, Android): 4k
-  - OS-Atlas-Data mobile_domain (UIBert): 4k
-- **Hyperparams**: bf16, LR 1e-4, cosine schedule, batch 1 × grad-accum 8 (effective batch 8), 1 epoch, gradient checkpointing on
-- **Hardware**: 1× L40S 48 GB (RunPod Secure Cloud)
-- **Compute cost**: ~$2 training + ~$0.50 eval
-- **Wall time**: ~2 hr total
-Full training scripts (private repo, request access): [renezander030/imgparse-tier1](https://github.com/renezander030/imgparse-tier1).
-## Output format
-```json
-{"bbox_2d": [x1, y1, x2, y2]}
-```
-— a single-line JSON object with pixel coordinates (top-left origin). No markdown fences, no commentary, no `<ref>` tokens. Verified 100% parseable on the eval set.
-## Limitations & next
-- **Web and desktop accuracy** lag mobile (we trained primarily on macOS + mobile UI). v0.2 adds 8k+ web records and ~2× total data.
-- **Long-tail icon recognition** is weaker than text grounding.
-- **No mouse-action prediction** — this model only locates; doesn't decide click vs hover vs type. Pair with an action predictor for full computer-use loops.
-- **English-only training data**.
-## Use cases (what's this drop-in for)
-- **Claude Computer Use / Claude Code** screen-grounding tool calls
-- **OpenAI Codex CLI** screen-grounding extension
-- **browser-use / Skyvern** click-targeting (Python adapter in the GitHub repo)
-- **Custom agent stacks** that need a $0/call grounding step instead of GPT-4V per screenshot
-- **Self-hosted compound-AI systems** with a routing layer (specialist model for grounding, general LLM for planning)
-## Work with me
-This adapter is a public reference of the recipe I deliver to freelance clients: small, fast, structured-output local specialists that slot into compound-AI agent stacks and cut cloud-LLM bills without losing capability.
-If you need one of these, I can build it:
-- a **UI-grounding model trained on your own product's screenshots** — your dashboard, your app, your customer interfaces — for higher recall on the elements your agents actually click
-- a **hybrid agent architecture** that routes narrow tasks (grounding, OCR, classification, embedding, extraction) to local specialist models and reserves cloud frontier LLMs for the reasoning that actually needs them
-- an **on-prem agent deployment** — Apple Silicon (MLX), CUDA box, or your existing K8s — with no screenshots leaving your infrastructure
-- a **structured-output evaluation harness** that tells you when the local model is actually good enough to replace the cloud call in production
-Reach out: <https://renezander.com>
-## Citation
-```bibtex
-@misc{browserground-2026,
-  title  = {browserground: Qwen3-VL-2B LoRA for hybrid AI agent UI grounding},
-  author = {Zander, René},
-  year   = {2026},
-  url    = {https://huggingface.co/renezander030/browserground}
-}
-```
-## License
-Apache 2.0, same as the base model `Qwen/Qwen3-VL-2B-Instruct`.
-## Acknowledgements
-- `Qwen/Qwen3-VL-2B-Instruct` base
-- `OS-Copilot/OS-Atlas-Data` training data
-- `agentsea/wave-ui` (for the upcoming v0.2 web slice)
-- `OS-Copilot/ScreenSpot-v2` evaluation set

 ---
+base_model: Qwen/Qwen3-VL-2B-Instruct
 library_name: peft
+pipeline_tag: text-generation
 tags:
+- base_model:adapter:Qwen/Qwen3-VL-2B-Instruct
+- lora
+- transformers
 ---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.19.1

adapter_config.json CHANGED Viewed

@@ -16,7 +16,7 @@
   "layers_pattern": null,
   "layers_to_transform": null,
   "loftq_config": {},
-  "lora_alpha": 32,
   "lora_bias": false,
   "lora_dropout": 0.05,
   "lora_ga_config": null,
@@ -26,17 +26,17 @@
   "peft_type": "LORA",
   "peft_version": "0.19.1",
   "qalora_group_size": 16,
-  "r": 16,
   "rank_pattern": {},
   "revision": null,
   "target_modules": [
-    "gate_proj",
-    "q_proj",
-    "up_proj",
     "k_proj",
-    "o_proj",
     "v_proj",
-    "down_proj"
   ],
   "target_parameters": null,
   "task_type": "CAUSAL_LM",

   "layers_pattern": null,
   "layers_to_transform": null,
   "loftq_config": {},
+  "lora_alpha": 64,
   "lora_bias": false,
   "lora_dropout": 0.05,
   "lora_ga_config": null,
   "peft_type": "LORA",
   "peft_version": "0.19.1",
   "qalora_group_size": 16,
+  "r": 32,
   "rank_pattern": {},
   "revision": null,
   "target_modules": [
     "k_proj",
+    "up_proj",
     "v_proj",
+    "o_proj",
+    "gate_proj",
+    "down_proj",
+    "q_proj"
   ],
   "target_parameters": null,
   "task_type": "CAUSAL_LM",

adapter_model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f9ea6853b82a088ff41e5a2fbbd7885982c3a6b4be1beececef1487209b34f7d
-size 69788264

 version https://git-lfs.github.com/spec/v1
+oid sha256:3e1b9394d00a106cff556f782dc94f4e04f5a14cc9c97c5caa9043779aae6d6d
+size 139518856

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e7ab4cc91c472c42681f1d6ed40e046d602559c7da5638f817629965705e2827
 size 5841

 version https://git-lfs.github.com/spec/v1
+oid sha256:776307d160109817b6b71575c48bf7ca50decf82d435feb04f35cc3037c58c46
 size 5841