QHexRT Is Live: Full-Stack NPU Inference for Qualcomm Hexagon

Published June 26, 2026

We shipped MetalRT for Apple Silicon — the first engine to run LLM, speech, vision, and speech-to-speech in one runtime, entirely on the GPU. Today we're launching QHexRT for Qualcomm: the same bet on the NPU.

QHexRT runs inference 100% on the Hexagon NPU. No Python in the hot path. No CPU fallback during inference. We are building the widest model catalog on any Qualcomm NPU stack, with same-day support for the models the community ships.

No one has shipped a single runtime that covers LLM, VLM, STT, TTS, and embeddings fully on Qualcomm NPUs. MetalRT did it for Apple Silicon. QHexRT does it for Hexagon.

First model: LFM 2.5 230M

LiquidAI released LFM 2.5 230M on June 25, 2026. QHexRT supports it on day one — our first catalog entry. The NPU bundle is here: runanywhere/lfm2_5_230m_HNPU.

Every tensor in the inference path stays on the HTP: decode graph, prefill graph, lm-head, embeddings. Greedy output matches the source model ("The capital of France is" → " Paris.").

All modalities, one NPU runtime

Modality	Status
LLM	Live — LFM 2.5 230M on Hexagon v81
VLM	In development
STT	In development
TTS	In development
Embeddings	In development

One engine, one deployment model, every modality on the NPU — the same architecture MetalRT proved on Apple Silicon, now on Qualcomm.

The headline numbers (LFM 2.5 230M · Hexagon v81)

Benchmarked on Hexagon v81 (Snapdragon 8 Elite Gen-2, SM8850). We also ran llama.cpp on the CPU of the same chip, same die, same phone. Both stacks produce identical output.

Prefill: 12,540 tok/s on the NPU vs 871 on the CPU (Q8_0). 14.4x faster.
Time-to-first-token: 36ms flat for any prompt up to 512 tokens. The CPU takes 588ms at 512 tokens.
Decode: the CPU wins at 250 tok/s vs 172 on the NPU. At 230M params, decode is memory-bandwidth-bound.
End-to-end: the NPU wins once the prompt exceeds ~1.5x the generation length.

QHexRT's W8 graph matched the HuggingFace fp32 oracle at logits cosine 1.000000.

Run it

hf download runanywhere/lfm2_5_230m_HNPU --local-dir lfm2_5_230m_HNPU
adb push lfm2_5_230m_HNPU/v81 /data/local/tmp/lfm230
adb shell "cd /data/local/tmp/lfm230 && LD_LIBRARY_PATH=. \
  ./qhx_generate lfm2-5-230m.json libQnnHtp.so libQnnSystem.so . 64 'The capital of France is'"

Stage the QAIRT v81 runtime libs and the qhx_generate tool into the same directory. Context binaries are pinned to Hexagon v81.

Throughput

Engine	Prefill (tok/s)	Decode (tok/s)	Peak RAM (MB)
QHexRT NPU (v81, W8)	12,540	172	445
llama.cpp CPU Q8_0	871	250	299
llama.cpp CPU Q4_K_M	680	264	209

Figure 1 — Prefill throughput (log scale). QHexRT on Hexagon v81 hits 12,540 tok/s prefill on LFM 2.5 230M. 14.4x faster than llama.cpp CPU Q8_0 on the same die.

Time to first token

Prompt (tokens)	NPU (ms)	CPU Q8_0 (ms)
16	36	17
128	36	132
512	36	588

Figure 2 — Time to first token (LFM 2.5 230M). Flat ~36 ms TTFT on the NPU for any prompt up to 512 tokens.

The model catalog

LFM 2.5 230M on Hexagon v81 is the first entry. We're adding models as fast as the community ships them.

Next: VLM, STT, TTS, embeddings on Hexagon, more LLM models, W4 quantization, and power metering.

Summary

QHexRT is live — full-stack NPU inference for Qualcomm Hexagon
100% on NPU — no Python, no CPU fallback during inference
First model: LFM 2.5 230M (runanywhere/lfm2_5_230m_HNPU)
12,540 tok/s prefill · 36ms flat TTFT
All modalities — LLM live; VLM, STT, TTS, embeddings in development
Widest model catalog for Qualcomm NPUs, expanding continuously

Full write-up: runanywhere.ai/blog

Benchmarked on Qualcomm SM8850 / Hexagon v81. Model: LiquidAI/LFM 2.5 230M. NPU: QHexRT W8 weight-only, QAIRT 2.47.0.260601.

Models mentioned in this article 2

RunAnywhere: Production-Grade On-Device AI Infrastructure

March 14, 2026

MetalRT: The Fastest AI Inference Engine for Apple Silicon. Here Are the Numbers.

March 12, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote