QHexRT Is Live: Full-Stack NPU Inference for Qualcomm Hexagon
QHexRT runs inference 100% on the Hexagon NPU. No Python in the hot path. No CPU fallback during inference. We are building the widest model catalog on any Qualcomm NPU stack, with same-day support for the models the community ships.
No one has shipped a single runtime that covers LLM, VLM, STT, TTS, and embeddings fully on Qualcomm NPUs. MetalRT did it for Apple Silicon. QHexRT does it for Hexagon.
First model: LFM 2.5 230M
LiquidAI released LFM 2.5 230M on June 25, 2026. QHexRT supports it on day one — our first catalog entry. The NPU bundle is here: runanywhere/lfm2_5_230m_HNPU.
Every tensor in the inference path stays on the HTP: decode graph, prefill graph, lm-head, embeddings. Greedy output matches the source model ("The capital of France is" → " Paris.").
All modalities, one NPU runtime
| Modality | Status |
|---|---|
| LLM | Live — LFM 2.5 230M on Hexagon v81 |
| VLM | In development |
| STT | In development |
| TTS | In development |
| Embeddings | In development |
One engine, one deployment model, every modality on the NPU — the same architecture MetalRT proved on Apple Silicon, now on Qualcomm.
The headline numbers (LFM 2.5 230M · Hexagon v81)
Benchmarked on Hexagon v81 (Snapdragon 8 Elite Gen-2, SM8850). We also ran llama.cpp on the CPU of the same chip, same die, same phone. Both stacks produce identical output.
- Prefill: 12,540 tok/s on the NPU vs 871 on the CPU (Q8_0). 14.4x faster.
- Time-to-first-token: 36ms flat for any prompt up to 512 tokens. The CPU takes 588ms at 512 tokens.
- Decode: the CPU wins at 250 tok/s vs 172 on the NPU. At 230M params, decode is memory-bandwidth-bound.
- End-to-end: the NPU wins once the prompt exceeds ~1.5x the generation length.
QHexRT's W8 graph matched the HuggingFace fp32 oracle at logits cosine 1.000000.
Run it
hf download runanywhere/lfm2_5_230m_HNPU --local-dir lfm2_5_230m_HNPU
adb push lfm2_5_230m_HNPU/v81 /data/local/tmp/lfm230
adb shell "cd /data/local/tmp/lfm230 && LD_LIBRARY_PATH=. \
./qhx_generate lfm2-5-230m.json libQnnHtp.so libQnnSystem.so . 64 'The capital of France is'"
Stage the QAIRT v81 runtime libs and the qhx_generate tool into the same directory. Context binaries are pinned to Hexagon v81.
Throughput
| Engine | Prefill (tok/s) | Decode (tok/s) | Peak RAM (MB) |
|---|---|---|---|
| QHexRT NPU (v81, W8) | 12,540 | 172 | 445 |
| llama.cpp CPU Q8_0 | 871 | 250 | 299 |
| llama.cpp CPU Q4_K_M | 680 | 264 | 209 |
Figure 1 — Prefill throughput (log scale). QHexRT on Hexagon v81 hits 12,540 tok/s prefill on LFM 2.5 230M. 14.4x faster than llama.cpp CPU Q8_0 on the same die.
Time to first token
| Prompt (tokens) | NPU (ms) | CPU Q8_0 (ms) |
|---|---|---|
| 16 | 36 | 17 |
| 128 | 36 | 132 |
| 512 | 36 | 588 |
Figure 2 — Time to first token (LFM 2.5 230M). Flat ~36 ms TTFT on the NPU for any prompt up to 512 tokens.
The model catalog
LFM 2.5 230M on Hexagon v81 is the first entry. We're adding models as fast as the community ships them.
Next: VLM, STT, TTS, embeddings on Hexagon, more LLM models, W4 quantization, and power metering.
Summary
- QHexRT is live — full-stack NPU inference for Qualcomm Hexagon
- 100% on NPU — no Python, no CPU fallback during inference
- First model: LFM 2.5 230M (runanywhere/lfm2_5_230m_HNPU)
- 12,540 tok/s prefill · 36ms flat TTFT
- All modalities — LLM live; VLM, STT, TTS, embeddings in development
- Widest model catalog for Qualcomm NPUs, expanding continuously
Full write-up: runanywhere.ai/blog
Benchmarked on Qualcomm SM8850 / Hexagon v81. Model: LiquidAI/LFM 2.5 230M. NPU: QHexRT W8 weight-only, QAIRT 2.47.0.260601.


