Moondream2 LoRA v12 β€” UI grounding

LoRA adapter for Moondream2 (revision 2025-06-21) fine-tuned for UI click localization: given a screenshot and a textual target description, predict normalized click coordinates [0,1].

Used as a self-hosted grounding backend in the Magnitude browser-testing framework.

Highlights

  • 96% pass rate on a 50-test end-to-end Magnitude suite across ~25 public sites (Wikipedia, GitHub, Apple, Booking, IMDB, Etsy, LEGO, MDN, NPM, etc.) β€” best of 7 grounding-model variants tested.
  • 84.4% acc@2% on a 1869-example held-out grounding benchmark (vs 69.0% for Moondream2 base β€” +15.4 percentage points).
  • 100% pass rate (8/8) on Round 9 element-type-targeted tests (HEADING / CHECKBOX / LABEL / BUTTON), where the base model gets only 5/8.

Hyperparameters

Param Value
LoRA rank 16
LoRA alpha 32
LoRA dropout 0.05
Epochs 2
LR 1e-4
Schedule warmup 10% + cosine decay
dtype bfloat16 (GPU/MPS), float32 (CPU)

Usage

The adapter is stored as a custom adapter.pt checkpoint with two parts: lora weights (applied via Moondream's native variant_state_dict mechanism) and an optional coord_decoder state dict.

import importlib
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoModelForCausalLM

DEVICE = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
DTYPE = torch.float32 if DEVICE == "cpu" else torch.bfloat16

# Load base Moondream2
model = AutoModelForCausalLM.from_pretrained(
    "vikhyatk/moondream2", revision="2025-06-21",
    trust_remote_code=True, torch_dtype=DTYPE,
    device_map={"": DEVICE} if DEVICE != "cpu" else None,
)
model.model._setup_caches()

# Download and apply LoRA adapter
adapter_path = hf_hub_download("Khabner/moondream-lora-v12", "adapter.pt")
ckpt = torch.load(adapter_path, map_location="cpu", weights_only=True)


def _nest(flat):
    tree = {}
    for k, v in flat.items():
        d = tree
        for p in k.split(".")[:-1]:
            d = d.setdefault(p, {})
        d[k.split(".")[-1]] = v
    return tree


inner = model.model
pkg = inner.__class__.__module__.rsplit(".", 1)[0]
flat = {k: v.to(device=str(inner.device), dtype=DTYPE) for k, v in ckpt["lora"].items()}
lora_dict = _nest(flat)
importlib.import_module(f"{pkg}.moondream").variant_state_dict = lambda *a, **kw: lora_dict

if "coord_decoder" in ckpt:
    cd = {k.removeprefix("coord_decoder."): v.to(device=str(inner.device), dtype=DTYPE)
          for k, v in ckpt["coord_decoder"].items()}
    inner.region.coord_decoder.load_state_dict(cd)

model.eval()

# Inference: pass settings={"variant": "custom"} to use the LoRA weights
from PIL import Image
img = Image.open("screenshot.png").convert("RGB")
result = model.model.point(img, "Send button", settings={"variant": "custom"})
print(result["points"])  # [{"x": 0.42, "y": 0.71}, ...]

The full inference helpers and a FastAPI server matching the Moondream Cloud /v1/point API contract live in the github.com/VLM-WEBTEST/magnitude_integration repo (finetune_lora.py, serve_moondream.py).

Limitations

  • Wikipedia history radio (#15 in the benchmark) β€” fails for every model: two radio circles per row are too close together and "radio in the 3rd row" is underspecified by description (a test-design issue, not a model weakness).
  • Tiny inline [3] superscript link (#19) β€” also fails for every model. Likely needs higher-resolution input or a dedicated dataset.
  • Inference latency β€” ~9–10s per call on M-series MPS, ~1–3s on a CUDA GPU.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Khabner/moondream-lora-v12

Adapter
(7)
this model