Moondream2 Region ONNX — Browser Object Detection & Pointing

4 lightweight ONNX models that add /detect (bounding boxes) and /point (coordinate pointing) capabilities to the existing Xenova/moondream2 ONNX models in the browser.

What This Is

Moondream2 is a vision-language model that can caption images, answer questions, detect objects, and point to things. The Xenova/moondream2 repo provides the vision encoder and text decoder as ONNX for use with Transformers.js — but it does not include the region module needed for detection and pointing.

This repo fills that gap with 4 small ONNX files that implement the region coordinate/size encoder-decoder pipeline.

Files in This Repo

File	Input	Output	Size
`onnx/region_coord_encoder.onnx`	`coord [1]` (float 0–1)	`embed [2048]`	~2 MB
`onnx/region_coord_decoder.onnx`	`hidden [2048]`	`logits [1024]`	~96 MB
`onnx/region_size_encoder.onnx`	`size [2]` (w, h float)	`embed [2048]`	~4 MB
`onnx/region_size_decoder.onnx`	`hidden [2048]`	`logits [2, 1024]`	~128 MB

Each .onnx file has a companion .onnx_data file containing the weights. Both files are required.

How Moondream Detection/Pointing Works

Important: Moondream detection is not single-shot like YOLO. It is autoregressive — the text model generates coordinates one token at a time, using the region models to encode/decode each coordinate.

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────────────────┐
│ Vision Encoder   │ ──→ │ Text Decoder     │ ──→ │ Region Coord/Size           │
│ (Xenova)         │     │ (Xenova)         │     │ Encoder/Decoder             │
│                  │     │                  │     │ (THIS REPO)                 │
│ image → features │     │ prefill + decode │     │ hidden ↔ coordinates        │
└─────────────────┘     └──────────────────┘     └─────────────────────────────┘
    from Xenova/            from Xenova/               from gatorchopps/
    moondream2              moondream2                 moondream2-region-onnx

/detect — Bounding Boxes

Each detected object goes through this loop:

region_coord_decoder(hidden) → x-coordinate logits → argmax → x_center
region_coord_encoder(x_center) → embedding → feed to text decoder → new hidden
region_coord_decoder(hidden) → y-coordinate logits → argmax → y_center
region_coord_encoder(y_center) → embedding → feed to text decoder → new hidden
region_size_decoder(hidden) → (width, height) logits → argmax → w, h
region_size_encoder(w, h) → embedding → feed to text decoder → new hidden
Text decoder decides: emit another object (token 5) or stop (token 0/EOS)

Output per object: { x_min, y_min, x_max, y_max } (normalised 0–1)

/point — Coordinate Points

Same loop but no size step — after decoding (x, y), the y-embedding goes straight to the text decoder for the continue/stop decision:

region_coord_decoder(hidden) → x
region_coord_encoder(x) → text decoder → new hidden
region_coord_decoder(hidden) → y
region_coord_encoder(y) → text decoder → new hidden (continue/stop decision)

Output per point: { x, y } (normalised 0–1)

Quick Start

Prerequisites

npm install @huggingface/transformers onnxruntime-web

Models You Need

You need ONNX models from two HuggingFace repos:

From	What	Used For
Xenova/moondream2	`vision_encoder`, `embed_tokens`, `decoder_model_merged`	Image encoding + text generation
gatorchopps/moondream2-region-onnx (this repo)	`region_coord_encoder/decoder`, `region_size_encoder/decoder`	Coordinate decoding for detect/point

Recommended Quantization Choices

For Xenova/moondream2 (pick one variant per component):

Component	Recommended	Alternatives
`vision_encoder`	`vision_encoder_fp16.onnx` (879 MB)	`_q4.onnx` (280 MB), `_int8.onnx` (444 MB)
`embed_tokens`	`embed_tokens_fp16.onnx` (210 MB)	`_q4.onnx` (419 MB), `_int8.onnx` (105 MB)
`decoder_model_merged`	`decoder_model_merged_q4.onnx` (824 MB)	`_q4f16.onnx` (741 MB), `_int8.onnx` (1.32 GB)

For this repo (region models): All files are float32 and relatively small (~230 MB total). No quantization variants needed.

Usage

Step 1: Load Xenova Models with Transformers.js

import {
  AutoProcessor,
  AutoTokenizer,
  Moondream1ForConditionalGeneration,
  RawImage,
} from "@huggingface/transformers";

const model_id = "Xenova/moondream2";

const processor = await AutoProcessor.from_pretrained(model_id);
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const model = await Moondream1ForConditionalGeneration.from_pretrained(
  model_id,
  {
    dtype: {
      embed_tokens: "fp16",
      vision_encoder: "fp16",
      decoder_model_merged: "q4",
    },
    device: "webgpu",
  }
);

Step 2: Load Region ONNX Models

import * as ort from "onnxruntime-web";

const REGION_BASE =
  "https://huggingface.co/gatorchopps/moondream2-region-onnx/resolve/main/onnx";
const opts = { executionProviders: ["webgpu", "wasm"] };

const regionSessions = {
  coordEncoder: await ort.InferenceSession.create(
    `${REGION_BASE}/region_coord_encoder.onnx`,
    opts
  ),
  coordDecoder: await ort.InferenceSession.create(
    `${REGION_BASE}/region_coord_decoder.onnx`,
    opts
  ),
  sizeEncoder: await ort.InferenceSession.create(
    `${REGION_BASE}/region_size_encoder.onnx`,
    opts
  ),
  sizeDecoder: await ort.InferenceSession.create(
    `${REGION_BASE}/region_size_decoder.onnx`,
    opts
  ),
};

Step 3: Run Detection or Pointing

// ─── Region helper functions ───

const COORD_BINS = 1024;
const SIZE_BINS = 1024;
const HIDDEN_DIM = 2048;

async function decodeCoordinate(hidden) {
  const { logits } = await regionSessions.coordDecoder.run({
    hidden: new ort.Tensor("float32", hidden, [HIDDEN_DIM]),
  });
  let best = 0;
  for (let i = 1; i < logits.data.length; i++)
    if (logits.data[i] > logits.data[best]) best = i;
  return best / logits.data.length; // normalised 0–1
}

async function encodeCoordinate(coord) {
  const { embed } = await regionSessions.coordEncoder.run({
    coord: new ort.Tensor("float32", new Float32Array([coord]), [1]),
  });
  return embed.data;
}

async function decodeSize(hidden) {
  const { logits } = await regionSessions.sizeDecoder.run({
    hidden: new ort.Tensor("float32", hidden, [HIDDEN_DIM]),
  });
  const d = logits.data;
  let wIdx = 0,
    hIdx = 0;
  for (let i = 1; i < SIZE_BINS; i++) {
    if (d[i] > d[wIdx]) wIdx = i;
    if (d[SIZE_BINS + i] > d[SIZE_BINS + hIdx]) hIdx = i;
  }
  return {
    w: Math.pow(2, (wIdx / 1023) * 10 - 10),
    h: Math.pow(2, (hIdx / 1023) * 10 - 10),
  };
}

async function encodeSize(w, h) {
  const { embed } = await regionSessions.sizeEncoder.run({
    size: new ort.Tensor("float32", new Float32Array([w, h]), [2]),
  });
  return embed.data;
}

// ─── The autoregressive detection/pointing loop ───

/**
 * @param {object} opts
 * @param {Float32Array} opts.initialHidden  - last hidden state from text prefill
 * @param {number}       opts.initialToken   - first token after prefill (5=coord, 0=eos)
 * @param {function}     opts.textModelStep  - async(embedding) => {hidden, nextToken}
 * @param {boolean}      opts.includeSize    - true=/detect, false=/point
 * @param {number}       [opts.maxObjects=150]
 */
async function generateRegionObjects({
  initialHidden,
  initialToken,
  textModelStep,
  includeSize,
  maxObjects = 150,
}) {
  const results = [];
  let hidden = initialHidden;
  let nextToken = initialToken;
  const EOS = 0;

  while (nextToken !== EOS && results.length < maxObjects) {
    // Decode x
    const x = await decodeCoordinate(hidden);
    const xEmbed = await encodeCoordinate(x);
    let step = await textModelStep(xEmbed);
    hidden = step.hidden;

    // Decode y
    const y = await decodeCoordinate(hidden);
    const yEmbed = await encodeCoordinate(y);

    if (includeSize) {
      // /detect: decode size after y
      step = await textModelStep(yEmbed);
      hidden = step.hidden;
      const { w, h } = await decodeSize(hidden);
      const sizeEmbed = await encodeSize(w, h);

      results.push({
        x_min: x - w / 2,
        y_min: y - h / 2,
        x_max: x + w / 2,
        y_max: y + h / 2,
      });

      step = await textModelStep(sizeEmbed);
    } else {
      // /point: no size, y-embed goes straight to continue/stop
      results.push({ x, y });
      step = await textModelStep(yEmbed);
    }

    hidden = step.hidden;
    nextToken = step.nextToken;
  }

  return results;
}

The Hard Part: `textModelStep`

The region ONNX models handle coordinate encoding/decoding. But the autoregressive loop also needs a textModelStep callback — a function that feeds an embedding into the text decoder and returns the next hidden state.

Transformers.js does not natively expose hidden states from Moondream1ForConditionalGeneration. To wire this up, you have several options:

Option A: Load the Decoder ONNX Directly (Recommended)

Load decoder_model_merged.onnx from Xenova/moondream2 directly with onnxruntime-web, bypassing Transformers.js for the detection loop. This gives you full control over inputs/outputs including hidden states.

const decoderSession = await ort.InferenceSession.create(
  "https://huggingface.co/Xenova/moondream2/resolve/main/onnx/decoder_model_merged_q4.onnx",
  { executionProviders: ["webgpu", "wasm"] }
);

// Inspect inputs/outputs to understand the decoder interface:
console.log("Inputs:", decoderSession.inputNames);
console.log("Outputs:", decoderSession.outputNames);

// The decoder typically has:
//   Inputs:  input_ids, attention_mask, position_ids,
//            past_key_values.N.key, past_key_values.N.value, ...
//   Outputs: logits, present.N.key, present.N.value, ...
//
// For the region loop, you need to:
//   1. Replace the input_ids embedding with the region-encoded embedding
//   2. Extract the last hidden state (the layer before lm_head)
//      or use the logits + hidden → region decoder

Option B: Fork/Patch Transformers.js

Modify the Moondream1ForConditionalGeneration class to expose hidden_states from the decoder output. The relevant code is in @huggingface/transformers/src/models.js.

Option C: Export Your Own Decoder

Use torch.onnx.export to create a custom decoder ONNX that outputs both logits and the last hidden state. This is the most work but gives cleanest integration.

Prompt Token Format

Detection and pointing use different prompt templates. From the Moondream tokenizer config:

/detect

tokens = [1, 7235, 476, 2] + tokenize(" " + object_name) + [3]

Where [1, 7235, 476, 2] = detect prefix, [3] = answer token (triggers generation).

Example for detecting "dog":

const detectPrompt = `<image>\n\nQuestion: Detect dog.\n\nAnswer:`;
// Or construct token IDs directly:
// prefix=[1, 7235, 476, 2], suffix=[3]
// Full: [1, 7235, 476, 2, ...tokenize(" dog"), 3]

/point

tokens = [1, 2581, 2] + tokenize(" " + object_name) + [3]

Example for pointing at "cat":

const pointPrompt = `<image>\n\nQuestion: Point to cat.\n\nAnswer:`;
// Or: [1, 2581, 2, ...tokenize(" cat"), 3]

Special Token IDs

Token	ID	Purpose
BOS / EOS	`0`	Start/stop generation
Answer	`3`	Triggers answer generation
Coord	`5`	"Start/continue emitting coordinates"
Size	`6`	"Size follows"

When the text decoder generates token 5 (coord), the loop begins decoding coordinates. When it generates token 0 (EOS), the loop stops.

Coordinate System

All coordinates are normalised to 0–1 relative to the image dimensions:

(0,0) ──────────────── (1,0)
  │                      │
  │    (x_center,        │
  │     y_center)        │
  │    ┌──────┐          │
  │    │      │ h        │
  │    └──────┘          │
  │       w              │
(0,1) ──────────────── (1,1)

Coordinate Bins

Both x and y use 1024 bins. The coordinate decoder outputs 1024 logits; argmax / 1024 gives the normalised coordinate.

Size Bins (for /detect)

Width and height each use 1024 bins with a log-scale mapping:

bin → size:  size = 2^((bin / 1023) * 10 - 10)
size → bin:  bin  = (log2(size) + 10) / 10 * 1023

This maps bin 0 → size ≈ 0.001 (1/1024), bin 1023 → size = 1.0.

Converting to Pixel Coordinates

// For /detect bounding boxes:
const pixelBox = {
  x_min: box.x_min * imageWidth,
  y_min: box.y_min * imageHeight,
  x_max: box.x_max * imageWidth,
  y_max: box.y_max * imageHeight,
};

// For /point coordinates:
const pixelPoint = {
  x: point.x * imageWidth,
  y: point.y * imageHeight,
};

End-to-End Pipeline Overview

┌──────────────────────────────────────────────────────────────┐
│                    Full Detection Pipeline                     │
├──────────────────────────────────────────────────────────────┤
│                                                                │
│  1. LOAD MODELS                                               │
│     ├─ Xenova/moondream2: vision_encoder, embed_tokens,       │
│     │                     decoder_model_merged                │
│     └─ gatorchopps/moondream2-region-onnx: 4 region ONNX     │
│                                                                │
│  2. ENCODE IMAGE                                              │
│     └─ vision_encoder(image) → visual features                │
│                                                                │
│  3. PREFILL TEXT DECODER                                      │
│     └─ Feed: [image_embeddings, detect_prompt_tokens]         │
│        Get:  initial hidden_state + first token               │
│                                                                │
│  4. AUTOREGRESSIVE REGION LOOP (if first token == coord_id)   │
│     ├─ coord_decoder(hidden) → x_center                      │
│     ├─ coord_encoder(x) → text_step → hidden                 │
│     ├─ coord_decoder(hidden) → y_center                      │
│     ├─ coord_encoder(y) → text_step → hidden                 │
│     ├─ size_decoder(hidden) → w, h        ← /detect only     │
│     ├─ size_encoder(w,h) → text_step → hidden  ← /detect only│
│     └─ text_step decides: more objects or EOS                 │
│                                                                │
│  5. OUTPUT                                                    │
│     ├─ /detect: [{x_min, y_min, x_max, y_max}, ...]          │
│     └─ /point:  [{x, y}, ...]                                │
└──────────────────────────────────────────────────────────────┘

Using the Provided JS Worker Module

This repo includes a ready-to-use JS module (moondream_region_worker.js) with:

loadRegionModels(baseUrl) — loads all 4 ONNX sessions
generateDetections({ initialHidden, initialToken, textModelStep }) — returns BBox[]
generatePoints({ initialHidden, initialToken, textModelStep }) — returns Point[]

import { loadRegionModels, generateDetections, generatePoints } from "./moondream_region_worker.js";

// Load region models
await loadRegionModels(
  "https://huggingface.co/gatorchopps/moondream2-region-onnx/resolve/main/onnx"
);

// After prefilling the text model with image + detect prompt...
const boxes = await generateDetections({
  initialHidden, // Float32Array[2048] from text decoder
  initialToken,  // first generated token (5 = start coords)
  textModelStep, // your callback: async(embed) => {hidden, nextToken}
});

// Or for pointing:
const points = await generatePoints({
  initialHidden,
  initialToken,
  textModelStep,
});

Numerical Accuracy

All 4 ONNX models were verified against the original Python region functions:

[coord_encoder]  max_err < 1.2e-06  ✓
[coord_decoder]  max_err < 2.8e-04  ✓
[size_encoder]   max_err < 1.9e-06  ✓
[size_decoder]   max_err < 2.5e-04  ✓

Models are exported in float32 for maximum ONNX Runtime compatibility.

Reproducing the Export

If you want to re-export from a different model revision:

# 1. Clone the Moondream source (needed for moondream.torch.config/weights imports)
git clone https://github.com/vikhyat/moondream.git
cd moondream

# 2. Get the export script from the companion repo
#    (or download export_region_onnx.py manually)
git clone https://github.com/FinickySpider/moondream2-region-onnx.git /tmp/region-onnx
cp /tmp/region-onnx/export_region_onnx.py .

# 3. Install dependencies
pip install torch safetensors onnx onnxruntime onnxscript huggingface_hub numpy

# 4. Export + verify
python export_region_onnx.py --hf-repo vikhyatk/moondream2 --output-dir ./onnx --verify

The export script auto-detects the decoder structure (flat linear vs fc1/fc2 MLP) from the checkpoint.

See FinickySpider/moondream2-region-onnx on GitHub for the full source code, detailed step-by-step instructions, and the JS worker module.

Limitations

Requires the text decoder — The region ONNX files alone cannot detect objects. They must be used inside the autoregressive loop driven by the text decoder.
Hidden state access — Transformers.js does not expose hidden states out of the box. You need to load the decoder ONNX directly with onnxruntime-web or patch Transformers.js.
Version coupling — These region weights were exported from vikhyatk/moondream2 (the latest HF revision as of March 2026). If the base model changes its region architecture, re-export may be needed.
Float32 only — No quantized variants of the region models are provided. The total size (~230 MB) is manageable for most browser applications.

License

Apache 2.0 — same as the base vikhyatk/moondream2 model.

Credits

vikhyatk/moondream2 — base model
Xenova/moondream2 — ONNX vision encoder + text decoder
Export script, JS worker, and full reproduction instructions: FinickySpider/moondream2-region-onnx

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for gatorchopps/moondream2-region-onnx

Base model

vikhyatk/moondream2

Quantized

(7)

this model