gemma4_e4b_HNPU — Gemma-4-E4B on the Qualcomm Hexagon NPU (v81)

Prebuilt QHexRT bundle that runs google/gemma-4-E4B on the Hexagon v81 NPU (SM8850 / Snapdragon 8 Elite Gen-class, soc_model 87) — text, vision, and the audio encoder — entirely on-device (no Python in the compute loop). Download → adb push → run.

Arch-pinned. These context binaries are finalized for v81 (soc_model 87). A v81 binary will not load on another Hexagon arch (the soc/arch are baked in). Each arch is its own vXX/ dir.

Modalities & status

Modality Status on v81 Notes
Text (LLM) Working, device-validated greedy-exact vs HF; ~4.4 tok/s, W8A16
Vision (image→caption) Working, device-validated encoder cos 0.999992, soft-token 0.999996 vs HF; image-grounded captions
Audio encoder (conformer) Working, device-validated 12-layer conformer cos 0.999919 vs HF
Audio transcription (speech→text) ⚠️ Experimental / not functional on v81 see the caveat below

⚠️ Audio transcription caveat (read before using the audio path)

The audio front is correct — the encoder is cos 0.9999, the soft tokens are cos 0.996 vs HF, and HF fed the device's own soft tokens transcribes the clip perfectly. But on v81 the decode emits numbers/dates, not the transcript. Root cause (conclusively bisected): the decode MLP's (gelu(gate)·up) @ down_proj runs an f16 accumulation that drifts on this HTP; over 42 layers it flips the sensitive audio-conditioned greedy chain (TEXT greedy is unambiguous so LLM/VLM are unaffected). The one clean fix — int16/int32-accumulation on only the down_proj — compose-fails on the v81 HTP (mixed int16/float in one context is unsupported), and int16-everything is worse than f16. This is a hardware/toolchain limit, not a port bug. Full diagnosis, the complete what-was-tried table, and the ranked future-work to fix it: AUDIO_FINDINGS.md. The audio encoder + pipeline are included here as a validated, experimental base for that work.

What's optimized

  • W8A16 weight-only quant on the decode + lm-head (the v81 floor; W4 is HTP-blocked), fp16 vision/encoder.
  • 3-part split decode [0,12) [12,24) [24,42) (each ~1.1–1.7 GB, under the HTP serialize budget), chained by the C++ host-op gemma4_split_generate with an f16 hidden hand-off.
  • Cross-layer KV-sharing (shared layers 24–41 reuse a same-type donor's K/V; donors 22 sliding / 23 full), dual head_dim (sliding 256 / full 512) + partial-0.25 RoPE on full-attn layers, per-layer PLE, in-graph lm-head.
  • Vision: 16-layer ViT (768px → 256 soft tokens) with host patch-embed/pos-embed/projector.
  • Audio: 12-layer conformer encoder (max-abs RMSNorm + 2D per-head attention + dim-1-gather lightconv — three v81 device fixes; see the recipe).

Files (v81/)

file role
gemma-4-E4B.json / gemma-4-E4B-vlm.json / gemma-4-E4B-audio.json QHexRT manifests (text / vision / audio)
gemma4e4b_decode_p{0,1,2}_w8.bin W8A16 decode, 3 parts
gemma4e4b_lmh_f16.bin tied lm-head (f16)
gemma4e4b_embed_f16.bin token embedding table
gemma4e4b_ple_table_f16.bin / ple_proj_f16.bin / ple_norm_f16.bin per-layer-input (PLE) tables
gemma4_vis_16blk_r2_f16.bin + g4v_patch_embed_f32.bin / g4v_pos_table_f32.bin / g4v_proj_w_f32.bin / g4v_rope_inv.bin vision ViT graph + host weights
gemma4_audio_enc_f16.bin + audio_host/ + audio_fix/ audio conformer encoder + host fixtures + staged encoder_features (experimental)
tokenizer.json tokenizer

Run (after one-time QAIRT runtime-lib staging — see QHexRT docs/DEPLOY.md)

hf download runanywhere/gemma4_e4b_HNPU --local-dir gemma4_e4b_HNPU
adb push gemma4_e4b_HNPU/v81 /data/local/tmp/wq/g4         # PowerShell + native paths on Windows
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq/dsp;/data/local/tmp/wq;/vendor/dsp/cdsp'; \
  LD_LIBRARY_PATH=. ./qhx_generate g4/gemma-4-E4B.json libQnnHtp.so libQnnSystem.so g4 40 'The capital of France is'"
# VLM:   ./qhx_generate g4/gemma-4-E4B-vlm.json libQnnHtp.so libQnnSystem.so g4 60 'Describe this image.' g4/my.jpg
# Audio (EXPERIMENTAL — emits numbers, see caveat): ./qhx_generate g4/gemma-4-E4B-audio.json libQnnHtp.so libQnnSystem.so g4 48 $'\nTranscription: '

Tool arg order is invariant: <tool> <manifest> libQnnHtp.so libQnnSystem.so <artifacts_root> …. QNN runtime libs come from the QAIRT SDK (lib/aarch64-android/) + the v81 HTP skel — not in this repo.

Caveats

  • Base model (not -it): use the completion style shown; chat templating degenerates on the base.
  • Audio transcription is experimental/blocked on v81 (above). Raw-wav input also needs a host mel+subsample frontend (not yet shipped); audio_fix/ stages encoder_features for one test clip.
  • Built + validated with QAIRT 2.45/2.47, serial 8977b1dd (v81). Conversion SSOT: recipes/gemma-4-e4b/.
Downloads last month
37
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for runanywhere/gemma4_e4b_HNPU

Finetuned
(73)
this model