Gemma-3n-E4B β Qualcomm Hexagon NPU (QHexRT) bundle
A prebuilt QHexRT bundle that runs Gemma-3n-E4B (3.93B) fully on the Qualcomm Hexagon NPU β no CPU/GPU fallback in the compute loop. Decode runs W8A16 on the HTP. Validated on a Samsung S25 (SM8850 / Hexagon v81).
Arch-pinned. These context binaries are finalized for v81 (soc_model 87) and will not load on another Hexagon arch (the
dsp_arch/soc_modelare baked in). One flatv81/dir = the artifacts root.
What's inside / what's optimized
Gemma-3n is unusually hard to put on an NPU: AltUp 4-stream hidden, Per-Layer-Embeddings (PLE), Laurel augmented residual, KV-sharing (donor layers 18/19), dual local/global RoPE, per-head QK-norm, gaussian-topk MLP sparsity, tied lm-head, logit softcap. This bundle handles all of it:
- 3-part AltUp-stream decode split. The 35-layer decode exceeds the ~3.5 GiB HTP per-context serialize
limit, so it is partitioned into 3 contiguous layer-range W8 context graphs β A0
[0,10)Β· A1[10,20)Β· B[20,35)β chained by the C++ host-opgemma3n_split_generate(streams hand-off partβpart; donor 18/19 K/V routed A1βB). - NPU fp16 lm-head (
x @ Eα΅on HMX) β replaces a host-side dot over the 262 400-row embed (101 β 23 ms/tok), bit-identical. - W8A16 decode (weight-only int8, fp16 activations) at O3 / VTCM 8 MB β the v81 production quant floor.
- max-abs RMSNorm in the exporter β gemma3n's huge
post_laurel_normweights inflate the residual past f16mean(xΒ²)(= +inf β NaN) on the f16 HTP; scaling bymax|x|before squaring is math-exact and is what makes the model run at all on-device. - Host does scaled embed + PLE projection (threaded) + dual RoPE + causal mask + the KV ring; prefill is decode-over-prompt.
Files (v81/)
| file | role |
|---|---|
gemma-3n-E4B-it.json |
QHexRT manifest (the declarative run plan) |
gemma3ne4bit_decode_p0_w8.bin |
decode part A0 β layers 0β9 (W8) |
gemma3ne4bit_decode_p1_w8.bin |
decode part A1 β layers 10β19 (W8, owns donors 18/19) |
gemma3ne4bit_decode_p2_w8.bin |
decode part B β layers 20β34 (W8) |
gemma3ne4bit_lmh_f16.bin |
tied lm-head x@Eα΅ (fp16, NPU/HMX) |
gemma3ne4bit_embed_f16.bin |
input embedding table (fp16, host lookup) |
gemma3ne4bit_ple_table_f16.bin |
Per-Layer-Embedding table (fp16, host) |
gemma3ne4bit_ple_proj_f16.bin |
PLE projection weight (fp16, host) |
gemma3ne4bit_ple_norm_f16.bin |
PLE norm weight (fp16, host) |
tokenizer.json |
tokenizer (Gemma BPE) |
Run (QHexRT qhx_generate)
The QNN runtime libs (libQnnHtp.so, libQnnSystem.so, the v81 HTP skel) come from the QAIRT SDK, not
this repo. Tool arg order is invariant: <tool> <manifest> libQnnHtp.so libQnnSystem.so <artifacts_root> β¦.
hf download runanywhere/gemma3n_e4b_HNPU --local-dir gemma3n_e4b_HNPU
adb push gemma3n_e4b_HNPU/v81 /data/local/tmp/wq/gemma_e4b # PowerShell + native paths on Windows
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq/dsp;/data/local/tmp/wq;/vendor/dsp/cdsp'; \
LD_LIBRARY_PATH=. ./qhx_generate gemma_e4b/gemma-3n-E4B-it.json libQnnHtp.so libQnnSystem.so gemma_e4b \
64 'The capital of France is'"
Measured performance (v81 / SM8850, burst-TURBO)
- Decode: 4.1 tok/s (243 ms/tok = 3-part graph 221 ms + NPU lm-head 22 ms) β stable across context length.
- Prefill: decode-over-prompt, ~221 ms/prompt-token (a batched prefill graph is future work).
- Memory: ~10 GB of artifacts (3 W8 decode parts + fp16 lm-head + fp16 embed + 4.7 GB fp16 PLE table).
Validation
Greedy (temp=0) vs the HF fp32 reference (unsloth/gemma-3n-E4B-it), swept over input length up to 70 % of
MAXCTX (1024): 6/6 lengths coherent, zero degenerate repetition; 10/10 diverse prompts correct
(factual / arithmetic / reasoning / code / translation / story). W8A16 drifts from the exact fp32 greedy
chain on near-ties (the gate is coherence, not token-identity β at 103 tokens it matched 12/12).
Caveats
- v81-only here (re-export for v79/v83). W8A16 decode: greedy is coherent + correct, not bit-identical to fp32 (expected for int8 weights). Long prompts are slow to ingest (linear decode-over-prompt prefill).
- Built + run with QHexRT + the QAIRT 2.47 SDK. Base model Β© Google, under the Gemma license β your use must comply with it.
- Downloads last month
- 26