Reza2kn's picture
Update: ~2.1GB INT4 payload, verified working
e53d4d5 verified
metadata
title: LocateAnything-3B WebGPU (INT4)
emoji: 🎯
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
license: apache-2.0
models:
  - Reza2kn/LocateAnything-3B-ONNX-WebGPU-INT4
  - nvidia/LocateAnything-3B
short_description: In-browser WebGPU open-vocab detection, INT4, no server

LocateAnything-3B — fully in-browser WebGPU (INT4)

Open-vocabulary object detection / visual grounding running 100% client-side in your browser with onnxruntime-web on the WebGPU execution provider. There is no server inference — the quantized model is downloaded once and runs on your own GPU.

Model: Reza2kn/LocateAnything-3B-ONNX-WebGPU-INT4 (source nvidia/LocateAnything-3B).

How it works

  1. Preprocess the image the way the original MoonViT processor does (rescale to ≤256 patches, pad to a multiple of 28, normalize, patchify) → pixel_values + grid_hws.
  2. Vision ONNXvisual_features.
  3. Build the prompt and tokenize with the model's tokenizer (transformers.js).
  4. Custom INT4 embedding gather in JS (group-wise (q-8)·scale, nibble-packed) builds inputs_embeds, splicing the visual features at the image-token positions — this is why the embedding table ships as 176 MB INT4 instead of 1.25 GB fp32.
  5. KV-cache autoregressive decode with the INT4 language graph (inputs_embeds in, past/present key-values) until the stop token.
  6. Parse <ref>label</ref><box>…</box> (coords 0–1000) and draw the boxes.

Requirements

  • A browser with WebGPU: Chrome/Edge 121+, or Safari 18+ (WebGPU enabled).
  • A GPU with a few GB free — the model is ~2.1 GB (INT4 language 1.65 GB + INT4 vision 0.25 GB + INT4 embeddings 0.18 GB). First load downloads these (chunked, with retry) and the browser caches them.

Verified end-to-end in headless Chrome (WebGPU): models load, prefill + KV-cache decode run at ~7.5 tok/s on an M2, and detection draws boxes. Detection speed depends on your GPU.