Spaces:

Reza2kn
/

LocateAnything-3B-WebGPU

Running

App Files Files Community

LocateAnything-3B-WebGPU / README.md

Reza2kn

Update: ~2.1GB INT4 payload, verified working

e53d4d5 verified 3 days ago

preview code

raw

history blame contribute delete

2.14 kB

	---
	title: LocateAnything-3B WebGPU (INT4)
	emoji: 🎯
	colorFrom: blue
	colorTo: indigo
	sdk: static
	pinned: false
	license: apache-2.0
	models:
	- Reza2kn/LocateAnything-3B-ONNX-WebGPU-INT4
	- nvidia/LocateAnything-3B
	short_description: In-browser WebGPU open-vocab detection, INT4, no server
	---

	# LocateAnything-3B — fully in-browser WebGPU (INT4)

	Open-vocabulary object detection / visual grounding running 100% client-side in your browser
	with [onnxruntime-web](https://onnxruntime.ai/docs/tutorials/web/) on the WebGPU execution
	provider. There is no server inference — the quantized model is downloaded once and runs on
	your own GPU.

	Model: [`Reza2kn/LocateAnything-3B-ONNX-WebGPU-INT4`](https://huggingface.co/Reza2kn/LocateAnything-3B-ONNX-WebGPU-INT4)
	(source [`nvidia/LocateAnything-3B`](https://huggingface.co/nvidia/LocateAnything-3B)).

	## How it works

	1. Preprocess the image the way the original MoonViT processor does (rescale to ≤256 patches,
	pad to a multiple of 28, normalize, patchify) → `pixel_values` + `grid_hws`.
	2. Vision ONNX → `visual_features`.
	3. Build the prompt and tokenize with the model's tokenizer (transformers.js).
	4. Custom INT4 embedding gather in JS (group-wise `(q-8)·scale`, nibble-packed) builds
	`inputs_embeds`, splicing the visual features at the image-token positions — this is why the
	embedding table ships as 176 MB INT4 instead of 1.25 GB fp32.
	5. KV-cache autoregressive decode with the INT4 language graph (`inputs_embeds` in, past/present
	key-values) until the stop token.
	6. Parse `<ref>label</ref><box>…</box>` (coords 0–1000) and draw the boxes.

	## Requirements

	- A browser with WebGPU: Chrome/Edge 121+, or Safari 18+ (WebGPU enabled).
	- A GPU with a few GB free — the model is ~2.1 GB (INT4 language 1.65 GB + INT4 vision 0.25 GB +
	INT4 embeddings 0.18 GB). First load downloads these (chunked, with retry) and the browser caches them.

	Verified end-to-end in headless Chrome (WebGPU): models load, prefill + KV-cache decode run at
	~7.5 tok/s on an M2, and detection draws boxes. Detection speed depends on your GPU.