Moondream2 LoRA v12 β UI grounding
LoRA adapter for Moondream2 (revision 2025-06-21) fine-tuned for UI click localization: given a screenshot and a textual target description, predict normalized click coordinates [0,1].
Used as a self-hosted grounding backend in the Magnitude browser-testing framework.
Highlights
- 96% pass rate on a 50-test end-to-end Magnitude suite across ~25 public sites (Wikipedia, GitHub, Apple, Booking, IMDB, Etsy, LEGO, MDN, NPM, etc.) β best of 7 grounding-model variants tested.
- 84.4% acc@2% on a 1869-example held-out grounding benchmark (vs 69.0% for Moondream2 base β
+15.4percentage points). - 100% pass rate (8/8) on Round 9 element-type-targeted tests (HEADING / CHECKBOX / LABEL / BUTTON), where the base model gets only 5/8.
Hyperparameters
| Param | Value |
|---|---|
| LoRA rank | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Epochs | 2 |
| LR | 1e-4 |
| Schedule | warmup 10% + cosine decay |
| dtype | bfloat16 (GPU/MPS), float32 (CPU) |
Usage
The adapter is stored as a custom adapter.pt checkpoint with two parts: lora weights (applied via Moondream's native variant_state_dict mechanism) and an optional coord_decoder state dict.
import importlib
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoModelForCausalLM
DEVICE = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
DTYPE = torch.float32 if DEVICE == "cpu" else torch.bfloat16
# Load base Moondream2
model = AutoModelForCausalLM.from_pretrained(
"vikhyatk/moondream2", revision="2025-06-21",
trust_remote_code=True, torch_dtype=DTYPE,
device_map={"": DEVICE} if DEVICE != "cpu" else None,
)
model.model._setup_caches()
# Download and apply LoRA adapter
adapter_path = hf_hub_download("Khabner/moondream-lora-v12", "adapter.pt")
ckpt = torch.load(adapter_path, map_location="cpu", weights_only=True)
def _nest(flat):
tree = {}
for k, v in flat.items():
d = tree
for p in k.split(".")[:-1]:
d = d.setdefault(p, {})
d[k.split(".")[-1]] = v
return tree
inner = model.model
pkg = inner.__class__.__module__.rsplit(".", 1)[0]
flat = {k: v.to(device=str(inner.device), dtype=DTYPE) for k, v in ckpt["lora"].items()}
lora_dict = _nest(flat)
importlib.import_module(f"{pkg}.moondream").variant_state_dict = lambda *a, **kw: lora_dict
if "coord_decoder" in ckpt:
cd = {k.removeprefix("coord_decoder."): v.to(device=str(inner.device), dtype=DTYPE)
for k, v in ckpt["coord_decoder"].items()}
inner.region.coord_decoder.load_state_dict(cd)
model.eval()
# Inference: pass settings={"variant": "custom"} to use the LoRA weights
from PIL import Image
img = Image.open("screenshot.png").convert("RGB")
result = model.model.point(img, "Send button", settings={"variant": "custom"})
print(result["points"]) # [{"x": 0.42, "y": 0.71}, ...]
The full inference helpers and a FastAPI server matching the Moondream Cloud /v1/point API contract live in the github.com/VLM-WEBTEST/magnitude_integration repo (finetune_lora.py, serve_moondream.py).
Limitations
- Wikipedia history radio (
#15in the benchmark) β fails for every model: two radio circles per row are too close together and "radio in the 3rd row" is underspecified by description (a test-design issue, not a model weakness). - Tiny inline
[3]superscript link (#19) β also fails for every model. Likely needs higher-resolution input or a dedicated dataset. - Inference latency β ~9β10s per call on M-series MPS, ~1β3s on a CUDA GPU.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for Khabner/moondream-lora-v12
Base model
vikhyatk/moondream2