Moondream2 Region ONNX β Browser Object Detection & Pointing
4 lightweight ONNX models that add /detect (bounding boxes) and /point (coordinate pointing) capabilities to the existing Xenova/moondream2 ONNX models in the browser.
What This Is
Moondream2 is a vision-language model that can caption images, answer questions, detect objects, and point to things. The Xenova/moondream2 repo provides the vision encoder and text decoder as ONNX for use with Transformers.js β but it does not include the region module needed for detection and pointing.
This repo fills that gap with 4 small ONNX files that implement the region coordinate/size encoder-decoder pipeline.
Files in This Repo
| File | Input | Output | Size |
|---|---|---|---|
onnx/region_coord_encoder.onnx |
coord [1] (float 0β1) |
embed [2048] |
~2 MB |
onnx/region_coord_decoder.onnx |
hidden [2048] |
logits [1024] |
~96 MB |
onnx/region_size_encoder.onnx |
size [2] (w, h float) |
embed [2048] |
~4 MB |
onnx/region_size_decoder.onnx |
hidden [2048] |
logits [2, 1024] |
~128 MB |
Each
.onnxfile has a companion.onnx_datafile containing the weights. Both files are required.
How Moondream Detection/Pointing Works
Important: Moondream detection is not single-shot like YOLO. It is autoregressive β the text model generates coordinates one token at a time, using the region models to encode/decode each coordinate.
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββββββββββ
β Vision Encoder β βββ β Text Decoder β βββ β Region Coord/Size β
β (Xenova) β β (Xenova) β β Encoder/Decoder β
β β β β β (THIS REPO) β
β image β features β β prefill + decode β β hidden β coordinates β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββββββββββ
from Xenova/ from Xenova/ from gatorchopps/
moondream2 moondream2 moondream2-region-onnx
/detect β Bounding Boxes
Each detected object goes through this loop:
region_coord_decoder(hidden)β x-coordinate logits β argmax β x_centerregion_coord_encoder(x_center)β embedding β feed to text decoder β new hiddenregion_coord_decoder(hidden)β y-coordinate logits β argmax β y_centerregion_coord_encoder(y_center)β embedding β feed to text decoder β new hiddenregion_size_decoder(hidden)β (width, height) logits β argmax β w, hregion_size_encoder(w, h)β embedding β feed to text decoder β new hidden- Text decoder decides: emit another object (token 5) or stop (token 0/EOS)
Output per object: { x_min, y_min, x_max, y_max } (normalised 0β1)
/point β Coordinate Points
Same loop but no size step β after decoding (x, y), the y-embedding goes straight to the text decoder for the continue/stop decision:
region_coord_decoder(hidden)β xregion_coord_encoder(x)β text decoder β new hiddenregion_coord_decoder(hidden)β yregion_coord_encoder(y)β text decoder β new hidden (continue/stop decision)
Output per point: { x, y } (normalised 0β1)
Quick Start
Prerequisites
npm install @huggingface/transformers onnxruntime-web
Models You Need
You need ONNX models from two HuggingFace repos:
| From | What | Used For |
|---|---|---|
| Xenova/moondream2 | vision_encoder, embed_tokens, decoder_model_merged |
Image encoding + text generation |
| gatorchopps/moondream2-region-onnx (this repo) | region_coord_encoder/decoder, region_size_encoder/decoder |
Coordinate decoding for detect/point |
Recommended Quantization Choices
For Xenova/moondream2 (pick one variant per component):
| Component | Recommended | Alternatives |
|---|---|---|
vision_encoder |
vision_encoder_fp16.onnx (879 MB) |
_q4.onnx (280 MB), _int8.onnx (444 MB) |
embed_tokens |
embed_tokens_fp16.onnx (210 MB) |
_q4.onnx (419 MB), _int8.onnx (105 MB) |
decoder_model_merged |
decoder_model_merged_q4.onnx (824 MB) |
_q4f16.onnx (741 MB), _int8.onnx (1.32 GB) |
For this repo (region models): All files are float32 and relatively small (~230 MB total). No quantization variants needed.
Usage
Step 1: Load Xenova Models with Transformers.js
import {
AutoProcessor,
AutoTokenizer,
Moondream1ForConditionalGeneration,
RawImage,
} from "@huggingface/transformers";
const model_id = "Xenova/moondream2";
const processor = await AutoProcessor.from_pretrained(model_id);
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const model = await Moondream1ForConditionalGeneration.from_pretrained(
model_id,
{
dtype: {
embed_tokens: "fp16",
vision_encoder: "fp16",
decoder_model_merged: "q4",
},
device: "webgpu",
}
);
Step 2: Load Region ONNX Models
import * as ort from "onnxruntime-web";
const REGION_BASE =
"https://huggingface.co/gatorchopps/moondream2-region-onnx/resolve/main/onnx";
const opts = { executionProviders: ["webgpu", "wasm"] };
const regionSessions = {
coordEncoder: await ort.InferenceSession.create(
`${REGION_BASE}/region_coord_encoder.onnx`,
opts
),
coordDecoder: await ort.InferenceSession.create(
`${REGION_BASE}/region_coord_decoder.onnx`,
opts
),
sizeEncoder: await ort.InferenceSession.create(
`${REGION_BASE}/region_size_encoder.onnx`,
opts
),
sizeDecoder: await ort.InferenceSession.create(
`${REGION_BASE}/region_size_decoder.onnx`,
opts
),
};
Step 3: Run Detection or Pointing
// βββ Region helper functions βββ
const COORD_BINS = 1024;
const SIZE_BINS = 1024;
const HIDDEN_DIM = 2048;
async function decodeCoordinate(hidden) {
const { logits } = await regionSessions.coordDecoder.run({
hidden: new ort.Tensor("float32", hidden, [HIDDEN_DIM]),
});
let best = 0;
for (let i = 1; i < logits.data.length; i++)
if (logits.data[i] > logits.data[best]) best = i;
return best / logits.data.length; // normalised 0β1
}
async function encodeCoordinate(coord) {
const { embed } = await regionSessions.coordEncoder.run({
coord: new ort.Tensor("float32", new Float32Array([coord]), [1]),
});
return embed.data;
}
async function decodeSize(hidden) {
const { logits } = await regionSessions.sizeDecoder.run({
hidden: new ort.Tensor("float32", hidden, [HIDDEN_DIM]),
});
const d = logits.data;
let wIdx = 0,
hIdx = 0;
for (let i = 1; i < SIZE_BINS; i++) {
if (d[i] > d[wIdx]) wIdx = i;
if (d[SIZE_BINS + i] > d[SIZE_BINS + hIdx]) hIdx = i;
}
return {
w: Math.pow(2, (wIdx / 1023) * 10 - 10),
h: Math.pow(2, (hIdx / 1023) * 10 - 10),
};
}
async function encodeSize(w, h) {
const { embed } = await regionSessions.sizeEncoder.run({
size: new ort.Tensor("float32", new Float32Array([w, h]), [2]),
});
return embed.data;
}
// βββ The autoregressive detection/pointing loop βββ
/**
* @param {object} opts
* @param {Float32Array} opts.initialHidden - last hidden state from text prefill
* @param {number} opts.initialToken - first token after prefill (5=coord, 0=eos)
* @param {function} opts.textModelStep - async(embedding) => {hidden, nextToken}
* @param {boolean} opts.includeSize - true=/detect, false=/point
* @param {number} [opts.maxObjects=150]
*/
async function generateRegionObjects({
initialHidden,
initialToken,
textModelStep,
includeSize,
maxObjects = 150,
}) {
const results = [];
let hidden = initialHidden;
let nextToken = initialToken;
const EOS = 0;
while (nextToken !== EOS && results.length < maxObjects) {
// Decode x
const x = await decodeCoordinate(hidden);
const xEmbed = await encodeCoordinate(x);
let step = await textModelStep(xEmbed);
hidden = step.hidden;
// Decode y
const y = await decodeCoordinate(hidden);
const yEmbed = await encodeCoordinate(y);
if (includeSize) {
// /detect: decode size after y
step = await textModelStep(yEmbed);
hidden = step.hidden;
const { w, h } = await decodeSize(hidden);
const sizeEmbed = await encodeSize(w, h);
results.push({
x_min: x - w / 2,
y_min: y - h / 2,
x_max: x + w / 2,
y_max: y + h / 2,
});
step = await textModelStep(sizeEmbed);
} else {
// /point: no size, y-embed goes straight to continue/stop
results.push({ x, y });
step = await textModelStep(yEmbed);
}
hidden = step.hidden;
nextToken = step.nextToken;
}
return results;
}
The Hard Part: textModelStep
The region ONNX models handle coordinate encoding/decoding. But the autoregressive loop also needs a textModelStep callback β a function that feeds an embedding into the text decoder and returns the next hidden state.
Transformers.js does not natively expose hidden states from Moondream1ForConditionalGeneration. To wire this up, you have several options:
Option A: Load the Decoder ONNX Directly (Recommended)
Load decoder_model_merged.onnx from Xenova/moondream2 directly with onnxruntime-web, bypassing Transformers.js for the detection loop. This gives you full control over inputs/outputs including hidden states.
const decoderSession = await ort.InferenceSession.create(
"https://huggingface.co/Xenova/moondream2/resolve/main/onnx/decoder_model_merged_q4.onnx",
{ executionProviders: ["webgpu", "wasm"] }
);
// Inspect inputs/outputs to understand the decoder interface:
console.log("Inputs:", decoderSession.inputNames);
console.log("Outputs:", decoderSession.outputNames);
// The decoder typically has:
// Inputs: input_ids, attention_mask, position_ids,
// past_key_values.N.key, past_key_values.N.value, ...
// Outputs: logits, present.N.key, present.N.value, ...
//
// For the region loop, you need to:
// 1. Replace the input_ids embedding with the region-encoded embedding
// 2. Extract the last hidden state (the layer before lm_head)
// or use the logits + hidden β region decoder
Option B: Fork/Patch Transformers.js
Modify the Moondream1ForConditionalGeneration class to expose hidden_states from the decoder output. The relevant code is in @huggingface/transformers/src/models.js.
Option C: Export Your Own Decoder
Use torch.onnx.export to create a custom decoder ONNX that outputs both logits and the last hidden state. This is the most work but gives cleanest integration.
Prompt Token Format
Detection and pointing use different prompt templates. From the Moondream tokenizer config:
/detect
tokens = [1, 7235, 476, 2] + tokenize(" " + object_name) + [3]
Where [1, 7235, 476, 2] = detect prefix, [3] = answer token (triggers generation).
Example for detecting "dog":
const detectPrompt = `<image>\n\nQuestion: Detect dog.\n\nAnswer:`;
// Or construct token IDs directly:
// prefix=[1, 7235, 476, 2], suffix=[3]
// Full: [1, 7235, 476, 2, ...tokenize(" dog"), 3]
/point
tokens = [1, 2581, 2] + tokenize(" " + object_name) + [3]
Example for pointing at "cat":
const pointPrompt = `<image>\n\nQuestion: Point to cat.\n\nAnswer:`;
// Or: [1, 2581, 2, ...tokenize(" cat"), 3]
Special Token IDs
| Token | ID | Purpose |
|---|---|---|
| BOS / EOS | 0 |
Start/stop generation |
| Answer | 3 |
Triggers answer generation |
| Coord | 5 |
"Start/continue emitting coordinates" |
| Size | 6 |
"Size follows" |
When the text decoder generates token 5 (coord), the loop begins decoding coordinates.
When it generates token 0 (EOS), the loop stops.
Coordinate System
All coordinates are normalised to 0β1 relative to the image dimensions:
(0,0) ββββββββββββββββ (1,0)
β β
β (x_center, β
β y_center) β
β ββββββββ β
β β β h β
β ββββββββ β
β w β
(0,1) ββββββββββββββββ (1,1)
Coordinate Bins
Both x and y use 1024 bins. The coordinate decoder outputs 1024 logits; argmax / 1024 gives the normalised coordinate.
Size Bins (for /detect)
Width and height each use 1024 bins with a log-scale mapping:
bin β size: size = 2^((bin / 1023) * 10 - 10)
size β bin: bin = (log2(size) + 10) / 10 * 1023
This maps bin 0 β size β 0.001 (1/1024), bin 1023 β size = 1.0.
Converting to Pixel Coordinates
// For /detect bounding boxes:
const pixelBox = {
x_min: box.x_min * imageWidth,
y_min: box.y_min * imageHeight,
x_max: box.x_max * imageWidth,
y_max: box.y_max * imageHeight,
};
// For /point coordinates:
const pixelPoint = {
x: point.x * imageWidth,
y: point.y * imageHeight,
};
End-to-End Pipeline Overview
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Full Detection Pipeline β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. LOAD MODELS β
β ββ Xenova/moondream2: vision_encoder, embed_tokens, β
β β decoder_model_merged β
β ββ gatorchopps/moondream2-region-onnx: 4 region ONNX β
β β
β 2. ENCODE IMAGE β
β ββ vision_encoder(image) β visual features β
β β
β 3. PREFILL TEXT DECODER β
β ββ Feed: [image_embeddings, detect_prompt_tokens] β
β Get: initial hidden_state + first token β
β β
β 4. AUTOREGRESSIVE REGION LOOP (if first token == coord_id) β
β ββ coord_decoder(hidden) β x_center β
β ββ coord_encoder(x) β text_step β hidden β
β ββ coord_decoder(hidden) β y_center β
β ββ coord_encoder(y) β text_step β hidden β
β ββ size_decoder(hidden) β w, h β /detect only β
β ββ size_encoder(w,h) β text_step β hidden β /detect onlyβ
β ββ text_step decides: more objects or EOS β
β β
β 5. OUTPUT β
β ββ /detect: [{x_min, y_min, x_max, y_max}, ...] β
β ββ /point: [{x, y}, ...] β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Using the Provided JS Worker Module
This repo includes a ready-to-use JS module (moondream_region_worker.js) with:
loadRegionModels(baseUrl)β loads all 4 ONNX sessionsgenerateDetections({ initialHidden, initialToken, textModelStep })β returnsBBox[]generatePoints({ initialHidden, initialToken, textModelStep })β returnsPoint[]
import { loadRegionModels, generateDetections, generatePoints } from "./moondream_region_worker.js";
// Load region models
await loadRegionModels(
"https://huggingface.co/gatorchopps/moondream2-region-onnx/resolve/main/onnx"
);
// After prefilling the text model with image + detect prompt...
const boxes = await generateDetections({
initialHidden, // Float32Array[2048] from text decoder
initialToken, // first generated token (5 = start coords)
textModelStep, // your callback: async(embed) => {hidden, nextToken}
});
// Or for pointing:
const points = await generatePoints({
initialHidden,
initialToken,
textModelStep,
});
Numerical Accuracy
All 4 ONNX models were verified against the original Python region functions:
[coord_encoder] max_err < 1.2e-06 β
[coord_decoder] max_err < 2.8e-04 β
[size_encoder] max_err < 1.9e-06 β
[size_decoder] max_err < 2.5e-04 β
Models are exported in float32 for maximum ONNX Runtime compatibility.
Reproducing the Export
If you want to re-export from a different model revision:
# 1. Clone the Moondream source (needed for moondream.torch.config/weights imports)
git clone https://github.com/vikhyat/moondream.git
cd moondream
# 2. Get the export script from the companion repo
# (or download export_region_onnx.py manually)
git clone https://github.com/FinickySpider/moondream2-region-onnx.git /tmp/region-onnx
cp /tmp/region-onnx/export_region_onnx.py .
# 3. Install dependencies
pip install torch safetensors onnx onnxruntime onnxscript huggingface_hub numpy
# 4. Export + verify
python export_region_onnx.py --hf-repo vikhyatk/moondream2 --output-dir ./onnx --verify
The export script auto-detects the decoder structure (flat linear vs fc1/fc2 MLP) from the checkpoint.
See FinickySpider/moondream2-region-onnx on GitHub for the full source code, detailed step-by-step instructions, and the JS worker module.
Limitations
- Requires the text decoder β The region ONNX files alone cannot detect objects. They must be used inside the autoregressive loop driven by the text decoder.
- Hidden state access β Transformers.js does not expose hidden states out of the box. You need to load the decoder ONNX directly with
onnxruntime-webor patch Transformers.js. - Version coupling β These region weights were exported from
vikhyatk/moondream2(the latest HF revision as of March 2026). If the base model changes its region architecture, re-export may be needed. - Float32 only β No quantized variants of the region models are provided. The total size (~230 MB) is manageable for most browser applications.
License
Apache 2.0 β same as the base vikhyatk/moondream2 model.
Credits
- vikhyatk/moondream2 β base model
- Xenova/moondream2 β ONNX vision encoder + text decoder
- Export script, JS worker, and full reproduction instructions: FinickySpider/moondream2-region-onnx
Model tree for gatorchopps/moondream2-region-onnx
Base model
vikhyatk/moondream2