| --- |
| title: LocateAnything-3B WebGPU (INT4) |
| emoji: 🎯 |
| colorFrom: blue |
| colorTo: indigo |
| sdk: static |
| pinned: false |
| license: apache-2.0 |
| models: |
| - Reza2kn/LocateAnything-3B-ONNX-WebGPU-INT4 |
| - nvidia/LocateAnything-3B |
| short_description: In-browser WebGPU open-vocab detection, INT4, no server |
| --- |
| |
| # LocateAnything-3B — fully in-browser WebGPU (INT4) |
|
|
| Open-vocabulary object detection / visual grounding running **100% client-side** in your browser |
| with [onnxruntime-web](https://onnxruntime.ai/docs/tutorials/web/) on the **WebGPU** execution |
| provider. There is **no server inference** — the quantized model is downloaded once and runs on |
| your own GPU. |
|
|
| Model: [`Reza2kn/LocateAnything-3B-ONNX-WebGPU-INT4`](https://huggingface.co/Reza2kn/LocateAnything-3B-ONNX-WebGPU-INT4) |
| (source [`nvidia/LocateAnything-3B`](https://huggingface.co/nvidia/LocateAnything-3B)). |
|
|
| ## How it works |
|
|
| 1. **Preprocess** the image the way the original MoonViT processor does (rescale to ≤256 patches, |
| pad to a multiple of 28, normalize, patchify) → `pixel_values` + `grid_hws`. |
| 2. **Vision ONNX** → `visual_features`. |
| 3. Build the prompt and tokenize with the model's tokenizer (transformers.js). |
| 4. **Custom INT4 embedding gather** in JS (group-wise `(q-8)·scale`, nibble-packed) builds |
| `inputs_embeds`, splicing the visual features at the image-token positions — this is why the |
| embedding table ships as 176 MB INT4 instead of 1.25 GB fp32. |
| 5. **KV-cache autoregressive decode** with the INT4 language graph (`inputs_embeds` in, past/present |
| key-values) until the stop token. |
| 6. Parse `<ref>label</ref><box>…</box>` (coords 0–1000) and draw the boxes. |
|
|
| ## Requirements |
|
|
| - A browser with **WebGPU**: Chrome/Edge 121+, or Safari 18+ (WebGPU enabled). |
| - A GPU with a few GB free — the model is ~**2.1 GB** (INT4 language 1.65 GB + INT4 vision 0.25 GB + |
| INT4 embeddings 0.18 GB). First load downloads these (chunked, with retry) and the browser caches them. |
|
|
| Verified end-to-end in headless Chrome (WebGPU): models load, prefill + KV-cache decode run at |
| ~7.5 tok/s on an M2, and detection draws boxes. Detection speed depends on your GPU. |
|
|