--- title: LocateAnything-3B WebGPU (INT4) emoji: 🎯 colorFrom: blue colorTo: indigo sdk: static pinned: false license: apache-2.0 models: - Reza2kn/LocateAnything-3B-ONNX-WebGPU-INT4 - nvidia/LocateAnything-3B short_description: In-browser WebGPU open-vocab detection, INT4, no server --- # LocateAnything-3B β€” fully in-browser WebGPU (INT4) Open-vocabulary object detection / visual grounding running **100% client-side** in your browser with [onnxruntime-web](https://onnxruntime.ai/docs/tutorials/web/) on the **WebGPU** execution provider. There is **no server inference** β€” the quantized model is downloaded once and runs on your own GPU. Model: [`Reza2kn/LocateAnything-3B-ONNX-WebGPU-INT4`](https://huggingface.co/Reza2kn/LocateAnything-3B-ONNX-WebGPU-INT4) (source [`nvidia/LocateAnything-3B`](https://huggingface.co/nvidia/LocateAnything-3B)). ## How it works 1. **Preprocess** the image the way the original MoonViT processor does (rescale to ≀256 patches, pad to a multiple of 28, normalize, patchify) β†’ `pixel_values` + `grid_hws`. 2. **Vision ONNX** β†’ `visual_features`. 3. Build the prompt and tokenize with the model's tokenizer (transformers.js). 4. **Custom INT4 embedding gather** in JS (group-wise `(q-8)Β·scale`, nibble-packed) builds `inputs_embeds`, splicing the visual features at the image-token positions β€” this is why the embedding table ships as 176 MB INT4 instead of 1.25 GB fp32. 5. **KV-cache autoregressive decode** with the INT4 language graph (`inputs_embeds` in, past/present key-values) until the stop token. 6. Parse `label…` (coords 0–1000) and draw the boxes. ## Requirements - A browser with **WebGPU**: Chrome/Edge 121+, or Safari 18+ (WebGPU enabled). - A GPU with a few GB free β€” the model is ~**2.1 GB** (INT4 language 1.65 GB + INT4 vision 0.25 GB + INT4 embeddings 0.18 GB). First load downloads these (chunked, with retry) and the browser caches them. Verified end-to-end in headless Chrome (WebGPU): models load, prefill + KV-cache decode run at ~7.5 tok/s on an M2, and detection draws boxes. Detection speed depends on your GPU.