Gemma-4-E2B — QHexRT NPU bundle (Hexagon v79 + v81)

Prebuilt QNN context binaries to run Google Gemma-4-E2B (text and multimodal image→text) on the Qualcomm Hexagon NPU with QHexRT — a thin C++ runtime that executes these binaries from a declarative manifest (no Genie, no Python in the hot path).

Arches present: v79 (SM8750 / Galaxy S25) and v81 (SM8850). A QNN context binary is arch-pinned (a v79 bin won't load on v81), so each arch is its own flat <arch>/ dir — pick the one matching your device.

v81 (SM8850) — text + vision device-validated, audio experimental

Device-validated on SM8850 (soc_model 87) · 2026-06-27 · decode-over-prompt (no prefill chunks) · W8A16 decode + NPU fp16 lm-head.

modality	manifest	status	notes
text	`gemma4-e2b.json`	✅ validated	~12.4 tok/s; "The capital of France is Paris", 15×12→"180", coherent across prompts
vision	`gemma4-e2b-vlm.json`	✅ validated	image-grounded (768 px → 256 image tokens); caption depth base-model-limited
audio	`gemma4-e2b-audio.json`	⚠️ experimental	encoder + soft tokens correct (cos 0.996 vs HF) but the on-device decode drifts → does not transcribe (hardware/toolchain limit, below)

Audio caveat (honest). The audio pipeline is fully ported and the conformer + projected soft tokens are correct — HF transcribes the device's own soft tokens ("Mr Quilter is the apostle of the middle class") — but the on-device decode emits dates/numbers, not the transcript. Conclusively root-caused to the HTP f16-accumulation in the decode MLP (the W8 weights are proven innocent: HF with per-row-int8 decoder weights still transcribes; fp16-single decode won't load; mixed-precision is toolchain-blocked on QAIRT 2.47). Same wall as gemma-4-E4B. Two compounding faults (deep-dive): the fp16 NPU lm-head also flips the 262k-vocab first-token near-tie — the fp32 host lm-head variant (gemma4-e2b-audio-hostlm.json, no lm-head graph → host fp32 dot) recovers the correct first token "Mr" on some builds; and the decode sits on the f16 edge (compile nondeterminism flips the first token). A full fix needs both the fp32 lm-head and a more-precise (int16/int32) decode. It will transcribe on a newer QAIRT with mixed-precision MLP accumulation — no bundle change needed. Use audio as experimental: it runs end-to-end and the soft tokens are correct (HF transcribes them), but the on-device transcript is not reliable on v81/QAIRT-2.47.

Reuse note. The v81 vision and audio encoders are the gemma-4 shared towers (weights are byte-identical to gemma-4-E4B); only the to-text projectors + the text decoder are E2B-specific.

What's inside (v81/):

Manifests: gemma4-e2b.json (text), gemma4-e2b-vlm.json (image+text), gemma4-e2b-audio.json (experimental), gemma4-e2b-audio-hostlm.json (experimental audio with the fp32 host lm-head — see the audio caveat).
LLM: gemma4_dec_i8.bin (W8A16 35-layer/15-KV-cache decode), gemma4_lmhead_f16.bin (NPU fp16 tied lm-head).
Vision: gemma4_vis_16blk_r2_f16.bin (16-layer rank-2 encoder, fp16).
Audio: gemma4_audio_enc_f16.bin (12-layer conformer, fp16) + audio_host/ (mel / subsample / embed_audio_proj) + audio_fix/ (a staged clip's encoder features for the experimental test).
Host data: gemma4_embed_f16.bin, gemma4_ple_table_f16.bin (~4.7 GB Per-Layer-Embeddings), gemma4_ple_projw_f16.bin, gemma4_ple_norm_f16.bin, gemma4_rope_inv_{s,f}.bin, vision host weights g4v_*, tokenizer.json, test.jpg (a VLM demo image). (v81 has no depth-chunked prefill graphs — it decodes over the prompt.)

Run (v81):

hf download runanywhere/gemma4_e2b_HNPU --local-dir gemma4_e2b_HNPU
adb push gemma4_e2b_HNPU/v81 /data/local/tmp/wq/gemma          # PowerShell + native paths on Windows
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq;/data/local/tmp/wq/dsp;/vendor/dsp/cdsp'; \
  LD_LIBRARY_PATH=. ./qhx_generate gemma/gemma4-e2b.json libQnnHtp.so libQnnSystem.so gemma 32 'What is the capital of France?'"
# VLM:  ./qhx_generate gemma/gemma4-e2b-vlm.json libQnnHtp.so libQnnSystem.so gemma 32 'What animal is shown?' gemma/test.jpg

v79 (SM8750 / Galaxy S25) — text + vision, device-validated

Device-validated. Benchmarks (S25 / v79):

mode	manifest	first-token	decode	notes
text	`gemma4-e2b.json`	0.65–1.4 s (S=16→505)	~10–11 tok/s	depth-chunked P=512 variable-length prefill (any prompt ≤512 in one pass)
VLM	`gemma4-e2b-vlm.json`	~1.9 s (vision+prefill)	~10 tok/s	768 px → 256 image tokens, S≈273

What's inside (v79/):

Manifests: gemma4-e2b.json (text), gemma4-e2b-vlm.json (image+text).
LLM graphs: gemma4_dec_wqo_o3.bin (int8 decode, 35 layers/15 KV caches), gemma4_lmhead_f16_o3.bin (fp16 tied lm-head), gemma4_vlayer{0..34}_i8.bin (35 P=512 depth-chunked prefill layers).
Vision graph (VLM): gemma4_vis_16blk_r2_o3.bin (16-layer encoder, fp16, rank-2).
Host data: gemma4_embed_f16.bin, gemma4_ple_table_f16.bin (~4.7 GB Per-Layer-Embeddings), gemma4_ple_projw_f16.bin, gemma4_ple_norm_f16.bin, gemma4_fnorm.bin, gemma4_rope_inv_{s,f}.bin, vision host weights g4v_*, tokenizer.json.

hf download runanywhere/gemma4_e2b_HNPU --local-dir gemma4_e2b_HNPU
adb push gemma4_e2b_HNPU/v79 /data/local/tmp/wq/gemma          # PowerShell + native paths on Windows
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq;/data/local/tmp/wq/dsp;/vendor/dsp/cdsp'; \
  LD_LIBRARY_PATH=. ./qhx_generate gemma/gemma4-e2b.json libQnnHtp.so libQnnSystem.so gemma 32 'What is the capital of France?'"

Each arch bundle is ~10–13 GB (the PLE table dominates); it's mmap'd on device — peak RSS ~3.5 GB. Architecture + recipe + the per-arch retarget knob (forge/recipes/gemma4-e2b/convert.sh <arch>): the QHexRT repo. Private bundle — request access.

Downloads last month: 76

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for runanywhere/gemma4_e2b_HNPU

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it

Finetuned

(253)

this model