Gemma-3n-E4B β€” Qualcomm Hexagon NPU (QHexRT) bundle

A prebuilt QHexRT bundle that runs Gemma-3n-E4B (3.93B) fully on the Qualcomm Hexagon NPU β€” no CPU/GPU fallback in the compute loop. Decode runs W8A16 on the HTP. Validated on a Samsung S25 (SM8850 / Hexagon v81).

Arch-pinned. These context binaries are finalized for v81 (soc_model 87) and will not load on another Hexagon arch (the dsp_arch/soc_model are baked in). One flat v81/ dir = the artifacts root.

What's inside / what's optimized

Gemma-3n is unusually hard to put on an NPU: AltUp 4-stream hidden, Per-Layer-Embeddings (PLE), Laurel augmented residual, KV-sharing (donor layers 18/19), dual local/global RoPE, per-head QK-norm, gaussian-topk MLP sparsity, tied lm-head, logit softcap. This bundle handles all of it:

  • 3-part AltUp-stream decode split. The 35-layer decode exceeds the ~3.5 GiB HTP per-context serialize limit, so it is partitioned into 3 contiguous layer-range W8 context graphs β€” A0[0,10) Β· A1[10,20) Β· B[20,35) β€” chained by the C++ host-op gemma3n_split_generate (streams hand-off partβ†’part; donor 18/19 K/V routed A1β†’B).
  • NPU fp16 lm-head (x @ Eα΅€ on HMX) β€” replaces a host-side dot over the 262 400-row embed (101 β†’ 23 ms/tok), bit-identical.
  • W8A16 decode (weight-only int8, fp16 activations) at O3 / VTCM 8 MB β€” the v81 production quant floor.
  • max-abs RMSNorm in the exporter β€” gemma3n's huge post_laurel_norm weights inflate the residual past f16 mean(xΒ²) (= +inf β†’ NaN) on the f16 HTP; scaling by max|x| before squaring is math-exact and is what makes the model run at all on-device.
  • Host does scaled embed + PLE projection (threaded) + dual RoPE + causal mask + the KV ring; prefill is decode-over-prompt.

Files (v81/)

file role
gemma-3n-E4B-it.json QHexRT manifest (the declarative run plan)
gemma3ne4bit_decode_p0_w8.bin decode part A0 β€” layers 0–9 (W8)
gemma3ne4bit_decode_p1_w8.bin decode part A1 β€” layers 10–19 (W8, owns donors 18/19)
gemma3ne4bit_decode_p2_w8.bin decode part B β€” layers 20–34 (W8)
gemma3ne4bit_lmh_f16.bin tied lm-head x@Eα΅€ (fp16, NPU/HMX)
gemma3ne4bit_embed_f16.bin input embedding table (fp16, host lookup)
gemma3ne4bit_ple_table_f16.bin Per-Layer-Embedding table (fp16, host)
gemma3ne4bit_ple_proj_f16.bin PLE projection weight (fp16, host)
gemma3ne4bit_ple_norm_f16.bin PLE norm weight (fp16, host)
tokenizer.json tokenizer (Gemma BPE)

Run (QHexRT qhx_generate)

The QNN runtime libs (libQnnHtp.so, libQnnSystem.so, the v81 HTP skel) come from the QAIRT SDK, not this repo. Tool arg order is invariant: <tool> <manifest> libQnnHtp.so libQnnSystem.so <artifacts_root> ….

hf download runanywhere/gemma3n_e4b_HNPU --local-dir gemma3n_e4b_HNPU
adb push gemma3n_e4b_HNPU/v81 /data/local/tmp/wq/gemma_e4b      # PowerShell + native paths on Windows
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq/dsp;/data/local/tmp/wq;/vendor/dsp/cdsp'; \
  LD_LIBRARY_PATH=. ./qhx_generate gemma_e4b/gemma-3n-E4B-it.json libQnnHtp.so libQnnSystem.so gemma_e4b \
  64 'The capital of France is'"

Measured performance (v81 / SM8850, burst-TURBO)

  • Decode: 4.1 tok/s (243 ms/tok = 3-part graph 221 ms + NPU lm-head 22 ms) β€” stable across context length.
  • Prefill: decode-over-prompt, ~221 ms/prompt-token (a batched prefill graph is future work).
  • Memory: ~10 GB of artifacts (3 W8 decode parts + fp16 lm-head + fp16 embed + 4.7 GB fp16 PLE table).

Validation

Greedy (temp=0) vs the HF fp32 reference (unsloth/gemma-3n-E4B-it), swept over input length up to 70 % of MAXCTX (1024): 6/6 lengths coherent, zero degenerate repetition; 10/10 diverse prompts correct (factual / arithmetic / reasoning / code / translation / story). W8A16 drifts from the exact fp32 greedy chain on near-ties (the gate is coherence, not token-identity β€” at 103 tokens it matched 12/12).

Caveats

  • v81-only here (re-export for v79/v83). W8A16 decode: greedy is coherent + correct, not bit-identical to fp32 (expected for int8 weights). Long prompts are slow to ingest (linear decode-over-prompt prefill).
  • Built + run with QHexRT + the QAIRT 2.47 SDK. Base model Β© Google, under the Gemma license β€” your use must comply with it.
Downloads last month
26
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for runanywhere/gemma3n_e4b_HNPU

Finetuned
(26)
this model