Gemma-4-E2B β QHexRT NPU bundle (Hexagon v79 + v81)
Prebuilt QNN context binaries to run Google Gemma-4-E2B (text and multimodal imageβtext) on the Qualcomm Hexagon NPU with QHexRT β a thin C++ runtime that executes these binaries from a declarative manifest (no Genie, no Python in the hot path).
Arches present: v79 (SM8750 / Galaxy S25) and v81 (SM8850). A QNN context binary is arch-pinned
(a v79 bin won't load on v81), so each arch is its own flat <arch>/ dir β pick the one matching your device.
v81 (SM8850) β text + vision device-validated, audio experimental
Device-validated on SM8850 (soc_model 87) Β· 2026-06-27 Β· decode-over-prompt (no prefill chunks) Β· W8A16 decode + NPU fp16 lm-head.
| modality | manifest | status | notes |
|---|---|---|---|
| text | gemma4-e2b.json |
β validated | ~12.4 tok/s; "The capital of France is Paris", 15Γ12β"180", coherent across prompts |
| vision | gemma4-e2b-vlm.json |
β validated | image-grounded (768 px β 256 image tokens); caption depth base-model-limited |
| audio | gemma4-e2b-audio.json |
β οΈ experimental | encoder + soft tokens correct (cos 0.996 vs HF) but the on-device decode drifts β does not transcribe (hardware/toolchain limit, below) |
Audio caveat (honest). The audio pipeline is fully ported and the conformer + projected soft tokens are
correct β HF transcribes the device's own soft tokens ("Mr Quilter is the apostle of the middle class") β but
the on-device decode emits dates/numbers, not the transcript. Conclusively root-caused to the HTP
f16-accumulation in the decode MLP (the W8 weights are proven innocent: HF with per-row-int8 decoder weights
still transcribes; fp16-single decode won't load; mixed-precision is toolchain-blocked on QAIRT 2.47). Same wall
as gemma-4-E4B. Two compounding faults (deep-dive): the fp16 NPU lm-head also flips the 262k-vocab
first-token near-tie β the fp32 host lm-head variant (gemma4-e2b-audio-hostlm.json, no lm-head graph β
host fp32 dot) recovers the correct first token "Mr" on some builds; and the decode sits on the f16 edge
(compile nondeterminism flips the first token). A full fix needs both the fp32 lm-head and a more-precise
(int16/int32) decode. It will transcribe on a newer QAIRT with mixed-precision MLP accumulation β no bundle
change needed. Use audio as experimental: it runs end-to-end and the soft tokens are correct (HF transcribes
them), but the on-device transcript is not reliable on v81/QAIRT-2.47.
Reuse note. The v81 vision and audio encoders are the gemma-4 shared towers (weights are byte-identical to gemma-4-E4B); only the to-text projectors + the text decoder are E2B-specific.
What's inside (v81/):
- Manifests:
gemma4-e2b.json(text),gemma4-e2b-vlm.json(image+text),gemma4-e2b-audio.json(experimental),gemma4-e2b-audio-hostlm.json(experimental audio with the fp32 host lm-head β see the audio caveat). - LLM:
gemma4_dec_i8.bin(W8A16 35-layer/15-KV-cache decode),gemma4_lmhead_f16.bin(NPU fp16 tied lm-head). - Vision:
gemma4_vis_16blk_r2_f16.bin(16-layer rank-2 encoder, fp16). - Audio:
gemma4_audio_enc_f16.bin(12-layer conformer, fp16) +audio_host/(mel / subsample /embed_audio_proj) +audio_fix/(a staged clip's encoder features for the experimental test). - Host data:
gemma4_embed_f16.bin,gemma4_ple_table_f16.bin(~4.7 GB Per-Layer-Embeddings),gemma4_ple_projw_f16.bin,gemma4_ple_norm_f16.bin,gemma4_rope_inv_{s,f}.bin, vision host weightsg4v_*,tokenizer.json,test.jpg(a VLM demo image). (v81 has no depth-chunked prefill graphs β it decodes over the prompt.)
Run (v81):
hf download runanywhere/gemma4_e2b_HNPU --local-dir gemma4_e2b_HNPU
adb push gemma4_e2b_HNPU/v81 /data/local/tmp/wq/gemma # PowerShell + native paths on Windows
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq;/data/local/tmp/wq/dsp;/vendor/dsp/cdsp'; \
LD_LIBRARY_PATH=. ./qhx_generate gemma/gemma4-e2b.json libQnnHtp.so libQnnSystem.so gemma 32 'What is the capital of France?'"
# VLM: ./qhx_generate gemma/gemma4-e2b-vlm.json libQnnHtp.so libQnnSystem.so gemma 32 'What animal is shown?' gemma/test.jpg
v79 (SM8750 / Galaxy S25) β text + vision, device-validated
Device-validated. Benchmarks (S25 / v79):
| mode | manifest | first-token | decode | notes |
|---|---|---|---|---|
| text | gemma4-e2b.json |
0.65β1.4 s (S=16β505) | ~10β11 tok/s | depth-chunked P=512 variable-length prefill (any prompt β€512 in one pass) |
| VLM | gemma4-e2b-vlm.json |
~1.9 s (vision+prefill) | ~10 tok/s | 768 px β 256 image tokens, Sβ273 |
What's inside (v79/):
- Manifests:
gemma4-e2b.json(text),gemma4-e2b-vlm.json(image+text). - LLM graphs:
gemma4_dec_wqo_o3.bin(int8 decode, 35 layers/15 KV caches),gemma4_lmhead_f16_o3.bin(fp16 tied lm-head),gemma4_vlayer{0..34}_i8.bin(35 P=512 depth-chunked prefill layers). - Vision graph (VLM):
gemma4_vis_16blk_r2_o3.bin(16-layer encoder, fp16, rank-2). - Host data:
gemma4_embed_f16.bin,gemma4_ple_table_f16.bin(~4.7 GB Per-Layer-Embeddings),gemma4_ple_projw_f16.bin,gemma4_ple_norm_f16.bin,gemma4_fnorm.bin,gemma4_rope_inv_{s,f}.bin, vision host weightsg4v_*,tokenizer.json.
hf download runanywhere/gemma4_e2b_HNPU --local-dir gemma4_e2b_HNPU
adb push gemma4_e2b_HNPU/v79 /data/local/tmp/wq/gemma # PowerShell + native paths on Windows
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq;/data/local/tmp/wq/dsp;/vendor/dsp/cdsp'; \
LD_LIBRARY_PATH=. ./qhx_generate gemma/gemma4-e2b.json libQnnHtp.so libQnnSystem.so gemma 32 'What is the capital of France?'"
Each arch bundle is ~10β13 GB (the PLE table dominates); it's mmap'd on device β peak RSS ~3.5 GB. Architecture + recipe + the per-arch retarget knob (
forge/recipes/gemma4-e2b/convert.sh <arch>): the QHexRT repo. Private bundle β request access.
- Downloads last month
- 76