Gemma-3n-E4B — Qualcomm Hexagon NPU (QHexRT) bundle

A prebuilt QHexRT bundle that runs Gemma-3n-E4B (3.93B) fully on the Qualcomm Hexagon NPU — no CPU/GPU fallback in the compute loop. Decode runs W8A16 on the HTP. Validated on a Samsung S25 (SM8850 / Hexagon v81).

Arch-pinned. These context binaries are finalized for v81 (soc_model 87) and will not load on another Hexagon arch (the dsp_arch/soc_model are baked in). One flat v81/ dir = the artifacts root.

What's inside / what's optimized

Gemma-3n is unusually hard to put on an NPU: AltUp 4-stream hidden, Per-Layer-Embeddings (PLE), Laurel augmented residual, KV-sharing (donor layers 18/19), dual local/global RoPE, per-head QK-norm, gaussian-topk MLP sparsity, tied lm-head, logit softcap. This bundle handles all of it:

3-part AltUp-stream decode split. The 35-layer decode exceeds the ~3.5 GiB HTP per-context serialize limit, so it is partitioned into 3 contiguous layer-range W8 context graphs — A0[0,10) · A1[10,20) · B[20,35) — chained by the C++ host-op gemma3n_split_generate (streams hand-off part→part; donor 18/19 K/V routed A1→B).
NPU fp16 lm-head (x @ Eᵀ on HMX) — replaces a host-side dot over the 262 400-row embed (101 → 23 ms/tok), bit-identical.
W8A16 decode (weight-only int8, fp16 activations) at O3 / VTCM 8 MB — the v81 production quant floor.
max-abs RMSNorm in the exporter — gemma3n's huge post_laurel_norm weights inflate the residual past f16 mean(x²) (= +inf → NaN) on the f16 HTP; scaling by max|x| before squaring is math-exact and is what makes the model run at all on-device.
Host does scaled embed + PLE projection (threaded) + dual RoPE + causal mask + the KV ring; prefill is decode-over-prompt.

Files (`v81/`)

file	role
`gemma-3n-E4B-it.json`	QHexRT manifest (the declarative run plan)
`gemma3ne4bit_decode_p0_w8.bin`	decode part A0 — layers 0–9 (W8)
`gemma3ne4bit_decode_p1_w8.bin`	decode part A1 — layers 10–19 (W8, owns donors 18/19)
`gemma3ne4bit_decode_p2_w8.bin`	decode part B — layers 20–34 (W8)
`gemma3ne4bit_lmh_f16.bin`	tied lm-head `x@Eᵀ` (fp16, NPU/HMX)
`gemma3ne4bit_embed_f16.bin`	input embedding table (fp16, host lookup)
`gemma3ne4bit_ple_table_f16.bin`	Per-Layer-Embedding table (fp16, host)
`gemma3ne4bit_ple_proj_f16.bin`	PLE projection weight (fp16, host)
`gemma3ne4bit_ple_norm_f16.bin`	PLE norm weight (fp16, host)
`tokenizer.json`	tokenizer (Gemma BPE)

Run (QHexRT `qhx_generate`)

The QNN runtime libs (libQnnHtp.so, libQnnSystem.so, the v81 HTP skel) come from the QAIRT SDK, not this repo. Tool arg order is invariant: <tool> <manifest> libQnnHtp.so libQnnSystem.so <artifacts_root> ….

hf download runanywhere/gemma3n_e4b_HNPU --local-dir gemma3n_e4b_HNPU
adb push gemma3n_e4b_HNPU/v81 /data/local/tmp/wq/gemma_e4b      # PowerShell + native paths on Windows
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq/dsp;/data/local/tmp/wq;/vendor/dsp/cdsp'; \
  LD_LIBRARY_PATH=. ./qhx_generate gemma_e4b/gemma-3n-E4B-it.json libQnnHtp.so libQnnSystem.so gemma_e4b \
  64 'The capital of France is'"

Measured performance (v81 / SM8850, burst-TURBO)

Decode: 4.1 tok/s (243 ms/tok = 3-part graph 221 ms + NPU lm-head 22 ms) — stable across context length.
Prefill: decode-over-prompt, ~221 ms/prompt-token (a batched prefill graph is future work).
Memory: ~10 GB of artifacts (3 W8 decode parts + fp16 lm-head + fp16 embed + 4.7 GB fp16 PLE table).

Validation

Greedy (temp=0) vs the HF fp32 reference (unsloth/gemma-3n-E4B-it), swept over input length up to 70 % of MAXCTX (1024): 6/6 lengths coherent, zero degenerate repetition; 10/10 diverse prompts correct (factual / arithmetic / reasoning / code / translation / story). W8A16 drifts from the exact fp32 greedy chain on near-ties (the gate is coherence, not token-identity — at 103 tokens it matched 12/12).

Caveats

v81-only here (re-export for v79/v83). W8A16 decode: greedy is coherent + correct, not bit-identical to fp32 (expected for int8 weights). Long prompts are slow to ingest (linear decode-over-prompt prefill).
Built + run with QHexRT + the QAIRT 2.47 SDK. Base model © Google, under the Gemma license — your use must comply with it.

Downloads last month: 26

Model tree for runanywhere/gemma3n_e4b_HNPU

Base model

google/gemma-3n-E4B

Finetuned

google/gemma-3n-E4B-it

Finetuned

unsloth/gemma-3n-E4B-it

Finetuned

(26)

this model

Gemma-3n-E4B — Qualcomm Hexagon NPU (QHexRT) bundle

What's inside / what's optimized

Files (v81/)

Run (QHexRT qhx_generate)

Measured performance (v81 / SM8850, burst-TURBO)

Validation

Caveats

Model tree for runanywhere/gemma3n_e4b_HNPU

Files (`v81/`)

Run (QHexRT `qhx_generate`)