File size: 2,144 Bytes
46f2476
08c93d8
 
 
 
46f2476
 
08c93d8
 
 
 
 
46f2476
 
08c93d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e53d4d5
 
08c93d8
e53d4d5
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
title: LocateAnything-3B WebGPU (INT4)
emoji: 🎯
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
license: apache-2.0
models:
- Reza2kn/LocateAnything-3B-ONNX-WebGPU-INT4
- nvidia/LocateAnything-3B
short_description: In-browser WebGPU open-vocab detection, INT4, no server
---

# LocateAnything-3B — fully in-browser WebGPU (INT4)

Open-vocabulary object detection / visual grounding running **100% client-side** in your browser
with [onnxruntime-web](https://onnxruntime.ai/docs/tutorials/web/) on the **WebGPU** execution
provider. There is **no server inference** — the quantized model is downloaded once and runs on
your own GPU.

Model: [`Reza2kn/LocateAnything-3B-ONNX-WebGPU-INT4`](https://huggingface.co/Reza2kn/LocateAnything-3B-ONNX-WebGPU-INT4)
(source [`nvidia/LocateAnything-3B`](https://huggingface.co/nvidia/LocateAnything-3B)).

## How it works

1. **Preprocess** the image the way the original MoonViT processor does (rescale to ≤256 patches,
   pad to a multiple of 28, normalize, patchify) → `pixel_values` + `grid_hws`.
2. **Vision ONNX**`visual_features`.
3. Build the prompt and tokenize with the model's tokenizer (transformers.js).
4. **Custom INT4 embedding gather** in JS (group-wise `(q-8)·scale`, nibble-packed) builds
   `inputs_embeds`, splicing the visual features at the image-token positions — this is why the
   embedding table ships as 176 MB INT4 instead of 1.25 GB fp32.
5. **KV-cache autoregressive decode** with the INT4 language graph (`inputs_embeds` in, past/present
   key-values) until the stop token.
6. Parse `<ref>label</ref><box>…</box>` (coords 0–1000) and draw the boxes.

## Requirements

- A browser with **WebGPU**: Chrome/Edge 121+, or Safari 18+ (WebGPU enabled).
- A GPU with a few GB free — the model is ~**2.1 GB** (INT4 language 1.65 GB + INT4 vision 0.25 GB +
  INT4 embeddings 0.18 GB). First load downloads these (chunked, with retry) and the browser caches them.

Verified end-to-end in headless Chrome (WebGPU): models load, prefill + KV-cache decode run at
~7.5 tok/s on an M2, and detection draws boxes. Detection speed depends on your GPU.