QHexRT Is Live: Full-Stack NPU Inference for Qualcomm Hexagon

Community Article
Published June 26, 2026

We shipped MetalRT for Apple Silicon — the first engine to run LLM, speech, vision, and speech-to-speech in one runtime, entirely on the GPU. Today we're launching QHexRT for Qualcomm: the same bet on the NPU.

QHexRT runs inference 100% on the Hexagon NPU. No Python in the hot path. No CPU fallback during inference. We are building the widest model catalog on any Qualcomm NPU stack, with same-day support for the models the community ships.

No one has shipped a single runtime that covers LLM, VLM, STT, TTS, and embeddings fully on Qualcomm NPUs. MetalRT did it for Apple Silicon. QHexRT does it for Hexagon.

First model: LFM 2.5 230M

LiquidAI released LFM 2.5 230M on June 25, 2026. QHexRT supports it on day one — our first catalog entry. The NPU bundle is here: runanywhere/lfm2_5_230m_HNPU.

Every tensor in the inference path stays on the HTP: decode graph, prefill graph, lm-head, embeddings. Greedy output matches the source model ("The capital of France is"" Paris.").

All modalities, one NPU runtime

Modality Status
LLM Live — LFM 2.5 230M on Hexagon v81
VLM In development
STT In development
TTS In development
Embeddings In development

One engine, one deployment model, every modality on the NPU — the same architecture MetalRT proved on Apple Silicon, now on Qualcomm.

The headline numbers (LFM 2.5 230M · Hexagon v81)

Benchmarked on Hexagon v81 (Snapdragon 8 Elite Gen-2, SM8850). We also ran llama.cpp on the CPU of the same chip, same die, same phone. Both stacks produce identical output.

  • Prefill: 12,540 tok/s on the NPU vs 871 on the CPU (Q8_0). 14.4x faster.
  • Time-to-first-token: 36ms flat for any prompt up to 512 tokens. The CPU takes 588ms at 512 tokens.
  • Decode: the CPU wins at 250 tok/s vs 172 on the NPU. At 230M params, decode is memory-bandwidth-bound.
  • End-to-end: the NPU wins once the prompt exceeds ~1.5x the generation length.

QHexRT's W8 graph matched the HuggingFace fp32 oracle at logits cosine 1.000000.

Run it

hf download runanywhere/lfm2_5_230m_HNPU --local-dir lfm2_5_230m_HNPU
adb push lfm2_5_230m_HNPU/v81 /data/local/tmp/lfm230
adb shell "cd /data/local/tmp/lfm230 && LD_LIBRARY_PATH=. \
  ./qhx_generate lfm2-5-230m.json libQnnHtp.so libQnnSystem.so . 64 'The capital of France is'"

Stage the QAIRT v81 runtime libs and the qhx_generate tool into the same directory. Context binaries are pinned to Hexagon v81.

Throughput

Engine Prefill (tok/s) Decode (tok/s) Peak RAM (MB)
QHexRT NPU (v81, W8) 12,540 172 445
llama.cpp CPU Q8_0 871 250 299
llama.cpp CPU Q4_K_M 680 264 209

Prefill throughput — LFM 2.5 230M on SM8850

Figure 1 — Prefill throughput (log scale). QHexRT on Hexagon v81 hits 12,540 tok/s prefill on LFM 2.5 230M. 14.4x faster than llama.cpp CPU Q8_0 on the same die.

Time to first token

Prompt (tokens) NPU (ms) CPU Q8_0 (ms)
16 36 17
128 36 132
512 36 588

TTFT vs prompt length — LFM 2.5 230M

Figure 2 — Time to first token (LFM 2.5 230M). Flat ~36 ms TTFT on the NPU for any prompt up to 512 tokens.

The model catalog

LFM 2.5 230M on Hexagon v81 is the first entry. We're adding models as fast as the community ships them.

Next: VLM, STT, TTS, embeddings on Hexagon, more LLM models, W4 quantization, and power metering.

Summary

  • QHexRT is live — full-stack NPU inference for Qualcomm Hexagon
  • 100% on NPU — no Python, no CPU fallback during inference
  • First model: LFM 2.5 230M (runanywhere/lfm2_5_230m_HNPU)
  • 12,540 tok/s prefill · 36ms flat TTFT
  • All modalities — LLM live; VLM, STT, TTS, embeddings in development
  • Widest model catalog for Qualcomm NPUs, expanding continuously

Full write-up: runanywhere.ai/blog


Benchmarked on Qualcomm SM8850 / Hexagon v81. Model: LiquidAI/LFM 2.5 230M. NPU: QHexRT W8 weight-only, QAIRT 2.47.0.260601.

Community

Sign up or log in to comment