Gemma-4-E2B β€” QHexRT NPU bundle (Hexagon v79 + v81)

Prebuilt QNN context binaries to run Google Gemma-4-E2B (text and multimodal image→text) on the Qualcomm Hexagon NPU with QHexRT — a thin C++ runtime that executes these binaries from a declarative manifest (no Genie, no Python in the hot path).

Arches present: v79 (SM8750 / Galaxy S25) and v81 (SM8850). A QNN context binary is arch-pinned (a v79 bin won't load on v81), so each arch is its own flat <arch>/ dir β€” pick the one matching your device.

v81 (SM8850) β€” text + vision device-validated, audio experimental

Device-validated on SM8850 (soc_model 87) Β· 2026-06-27 Β· decode-over-prompt (no prefill chunks) Β· W8A16 decode + NPU fp16 lm-head.

modality manifest status notes
text gemma4-e2b.json βœ… validated ~12.4 tok/s; "The capital of France is Paris", 15Γ—12β†’"180", coherent across prompts
vision gemma4-e2b-vlm.json βœ… validated image-grounded (768 px β†’ 256 image tokens); caption depth base-model-limited
audio gemma4-e2b-audio.json ⚠️ experimental encoder + soft tokens correct (cos 0.996 vs HF) but the on-device decode drifts β†’ does not transcribe (hardware/toolchain limit, below)

Audio caveat (honest). The audio pipeline is fully ported and the conformer + projected soft tokens are correct β€” HF transcribes the device's own soft tokens ("Mr Quilter is the apostle of the middle class") β€” but the on-device decode emits dates/numbers, not the transcript. Conclusively root-caused to the HTP f16-accumulation in the decode MLP (the W8 weights are proven innocent: HF with per-row-int8 decoder weights still transcribes; fp16-single decode won't load; mixed-precision is toolchain-blocked on QAIRT 2.47). Same wall as gemma-4-E4B. Two compounding faults (deep-dive): the fp16 NPU lm-head also flips the 262k-vocab first-token near-tie β€” the fp32 host lm-head variant (gemma4-e2b-audio-hostlm.json, no lm-head graph β†’ host fp32 dot) recovers the correct first token "Mr" on some builds; and the decode sits on the f16 edge (compile nondeterminism flips the first token). A full fix needs both the fp32 lm-head and a more-precise (int16/int32) decode. It will transcribe on a newer QAIRT with mixed-precision MLP accumulation β€” no bundle change needed. Use audio as experimental: it runs end-to-end and the soft tokens are correct (HF transcribes them), but the on-device transcript is not reliable on v81/QAIRT-2.47.

Reuse note. The v81 vision and audio encoders are the gemma-4 shared towers (weights are byte-identical to gemma-4-E4B); only the to-text projectors + the text decoder are E2B-specific.

What's inside (v81/):

  • Manifests: gemma4-e2b.json (text), gemma4-e2b-vlm.json (image+text), gemma4-e2b-audio.json (experimental), gemma4-e2b-audio-hostlm.json (experimental audio with the fp32 host lm-head β€” see the audio caveat).
  • LLM: gemma4_dec_i8.bin (W8A16 35-layer/15-KV-cache decode), gemma4_lmhead_f16.bin (NPU fp16 tied lm-head).
  • Vision: gemma4_vis_16blk_r2_f16.bin (16-layer rank-2 encoder, fp16).
  • Audio: gemma4_audio_enc_f16.bin (12-layer conformer, fp16) + audio_host/ (mel / subsample / embed_audio_proj) + audio_fix/ (a staged clip's encoder features for the experimental test).
  • Host data: gemma4_embed_f16.bin, gemma4_ple_table_f16.bin (~4.7 GB Per-Layer-Embeddings), gemma4_ple_projw_f16.bin, gemma4_ple_norm_f16.bin, gemma4_rope_inv_{s,f}.bin, vision host weights g4v_*, tokenizer.json, test.jpg (a VLM demo image). (v81 has no depth-chunked prefill graphs β€” it decodes over the prompt.)

Run (v81):

hf download runanywhere/gemma4_e2b_HNPU --local-dir gemma4_e2b_HNPU
adb push gemma4_e2b_HNPU/v81 /data/local/tmp/wq/gemma          # PowerShell + native paths on Windows
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq;/data/local/tmp/wq/dsp;/vendor/dsp/cdsp'; \
  LD_LIBRARY_PATH=. ./qhx_generate gemma/gemma4-e2b.json libQnnHtp.so libQnnSystem.so gemma 32 'What is the capital of France?'"
# VLM:  ./qhx_generate gemma/gemma4-e2b-vlm.json libQnnHtp.so libQnnSystem.so gemma 32 'What animal is shown?' gemma/test.jpg

v79 (SM8750 / Galaxy S25) β€” text + vision, device-validated

Device-validated. Benchmarks (S25 / v79):

mode manifest first-token decode notes
text gemma4-e2b.json 0.65–1.4 s (S=16β†’505) ~10–11 tok/s depth-chunked P=512 variable-length prefill (any prompt ≀512 in one pass)
VLM gemma4-e2b-vlm.json ~1.9 s (vision+prefill) ~10 tok/s 768 px β†’ 256 image tokens, Sβ‰ˆ273

What's inside (v79/):

  • Manifests: gemma4-e2b.json (text), gemma4-e2b-vlm.json (image+text).
  • LLM graphs: gemma4_dec_wqo_o3.bin (int8 decode, 35 layers/15 KV caches), gemma4_lmhead_f16_o3.bin (fp16 tied lm-head), gemma4_vlayer{0..34}_i8.bin (35 P=512 depth-chunked prefill layers).
  • Vision graph (VLM): gemma4_vis_16blk_r2_o3.bin (16-layer encoder, fp16, rank-2).
  • Host data: gemma4_embed_f16.bin, gemma4_ple_table_f16.bin (~4.7 GB Per-Layer-Embeddings), gemma4_ple_projw_f16.bin, gemma4_ple_norm_f16.bin, gemma4_fnorm.bin, gemma4_rope_inv_{s,f}.bin, vision host weights g4v_*, tokenizer.json.
hf download runanywhere/gemma4_e2b_HNPU --local-dir gemma4_e2b_HNPU
adb push gemma4_e2b_HNPU/v79 /data/local/tmp/wq/gemma          # PowerShell + native paths on Windows
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq;/data/local/tmp/wq/dsp;/vendor/dsp/cdsp'; \
  LD_LIBRARY_PATH=. ./qhx_generate gemma/gemma4-e2b.json libQnnHtp.so libQnnSystem.so gemma 32 'What is the capital of France?'"

Each arch bundle is ~10–13 GB (the PLE table dominates); it's mmap'd on device β€” peak RSS ~3.5 GB. Architecture + recipe + the per-arch retarget knob (forge/recipes/gemma4-e2b/convert.sh <arch>): the QHexRT repo. Private bundle β€” request access.

Downloads last month
76
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for runanywhere/gemma4_e2b_HNPU

Finetuned
(253)
this model