EmbeddingGemma-300M — Hexagon NPU bundle (QHexRT, v79 + v81)

On-device text embeddings on the Qualcomm Hexagon NPU. google/embeddinggemma-300m — a Gemma-3-270M backbone turned into a bidirectional text encoder producing a 768-d sentence vector — compiled to run with QHexRT via the qhx_embed tool. This is the first embedding family bundle (alongside the LLM/VLM/ASR/TTS *_HNPU repos).

Arch-pinned, multi-arch: v79/ = Hexagon v79 (Snapdragon 8 Elite / SM8750, soc_model 69); v81/ = Hexagon v81 (Snapdragon 8 Elite Gen 5 / SM8850, soc_model 87). A context binary loads only on its target arch — pick the <arch>/ dir matching your device. Both ship the identical file set (same int16-acts route).

What's optimized / how it was built

  • int16 activations + int8 per-channel weights, with the graph I/O boundary kept FLOAT_32. Plain fp16 activations overflow to non-finite on this bidirectional encoder (the upstream ONNX card warns of it), so the bundle is int16-acts quantized (--act_bitwidth 16 --weights_bitwidth 8 --use_per_channel_quantization --preserve_io datatype, calibrated on a dozen query/document samples). int8 weights ⇒ the 108 MB bin.
  • Single encoder graph (no KV cache, no decode loop), O3 + 8 MB VTCM, spill_bytes=0. The pool → 2× Dense → L2 sentence head + the sqrt(768) embed scaling run host-side (embed_gemma_encode); the 262k-row embed table is a host gather.

Files (v79/)

file role
embeddinggemma-300m.json QHexRT manifest (the declarative run plan)
embgemma_enc_f16.bin the int16-acts encoder context binary (graph name embgemma_enc_f16)
embeddinggemma_embed_f16.bin host embed table [262144, 768] fp16, pre-scaled by sqrt(768)
embeddinggemma_dense1_f32.bin sentence-head Dense 1 [3072, 768] f32
embeddinggemma_dense2_f32.bin sentence-head Dense 2 [768, 3072] f32
tokenizer.json Gemma tokenizer
embedding_gold.json fp32 reference vectors (query/pos/neg) for the device cosine gate

The QNN runtime libs (libQnnHtp.so / libQnnSystem.so + the v79 HTP skel) come from the QAIRT SDK, not this repo.

Run

hf download runanywhere/embeddinggemma_300m_HNPU --local-dir eg
adb push eg/v79 /data/local/tmp/wq/eg          # PowerShell + native paths on Windows
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq;/vendor/dsp/cdsp'; \
  LD_LIBRARY_PATH=. ./qhx_embed eg/embeddinggemma-300m.json libQnnHtp.so libQnnSystem.so eg \
  'What is the capital of France?' 'Paris is the capital of France.'"

qhx_embed prepends the task prefix (task: search result | query: for queries, title: none | text: for documents), tokenizes (BOS + text + EOS), runs the encoder, and prints the L2-normalized 768-d vector (EMB:) plus, with two texts, their cosine.

Measured (Samsung S25 / SM8750 / Hexagon v79)

  • Cosine 0.985 vs the fp32 sentence-transformers gold (int16-acts quantization noise; the offline fp32 export gate is 1.000).
  • Retrieval ranking preserved: sim(query, relevant)=0.595 ≫ sim(query, irrelevant)=0.054 — matches the gold. Ranking is what matters for retrieval, and it is intact.
  • ~22 ms end-to-end @ 256 tokens (NPU encoder 14.6 ms; the rest is host tokenize/gather/pool/Dense/L2).

Caveats (honest)

  • Quantized, not bit-exact. 0.985 cosine ≠ 1.0 — expected for a quantized embedder; use it for retrieval/ similarity, not as a drop-in fp32 oracle.
  • Fixed 256-token window. Inputs are padded/truncated to 256; longer documents are truncated (chunk to ≤256 tokens, or wait for the 512/1024/2048 buckets).
  • Matryoshka (512/256/128) is supported by the model but this bundle emits the full 768-d vector.

License

Derived from google/embeddinggemma-300m, under the Gemma Terms of Use (NOT Apache). Modified: converted to a Hexagon-v79 QHexRT bundle (int16-activation quantized; host-side pooling/Dense head). Use is subject to the Gemma Terms of Use and the Gemma Prohibited Use Policy.

Measured (SM8850 / Hexagon v81)

  • Cosine 0.9836 vs the fp32 sentence-transformers gold (matches v79's 0.985 — int16-acts quantization noise; the offline fp32 export gate is 1.000000).
  • Retrieval ranking preserved: sim(query, relevant "Paris is the capital…")=0.618 ≫ sim(query, irrelevant "Photosynthesis…")=0.056. Device-validated on SM8850 (8977b1dd).
  • NPU encoder ~17 ms @ 256 tokens. Same int16-acts bin route as v79, recompiled for v81 (soc_model 87) with the {O:3,vtcm_mb:8,dlbc:1} graph-config.
Downloads last month
37
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for runanywhere/embeddinggemma_300m_HNPU

Finetuned
(259)
this model