RunAnywhere Genie NPU Models
Pre-compiled QNN context binaries for Qualcomm Genie SDK, optimized for Snapdragon 8 Elite NPU inference.
Models
| Model | Params | Quantization | Size | NPU Performance |
|---|---|---|---|---|
| Llama 3.2 1B Instruct | 1B | W4 | ~1.0 GB | ~23 tok/s on S25 |
| Qwen 2.5 7B Instruct | 7B | W8A16 | ~3.9 GB | ~5-8 tok/s on S25 |
Usage
These models are designed for the RunAnywhere SDK with the Genie NPU backend.
Download
# Llama 3.2 1B (recommended for fast inference)
wget https://huggingface.co/runanywhere/genie-npu-models/resolve/main/llama-3.2-1b-instruct-genie-w4.tar.gz
# Qwen 2.5 7B (higher quality, slower)
wget https://huggingface.co/runanywhere/genie-npu-models/resolve/main/qwen2.5-7b-instruct-genie-w8a16.tar.gz
Bundle Contents
Each tar.gz contains:
genie_config.json-- Genie SDK configurationhtp_backend_ext_config.json-- HTP backend config (Snapdragon 8 Elite, v79)tokenizer.json-- Model tokenizer*_part_N_of_M.bin-- QNN context binary parts
Requirements
- Qualcomm Snapdragon 8 Elite (or compatible) device
- QAIRT SDK 2.42.0 runtime libraries
- RunAnywhere SDK with Genie backend AAR
Compilation Details
Models were compiled using Qualcomm AI Hub with:
- QAIRT SDK: 2.42.0
- Target: Snapdragon 8 Elite QRD (soc_model=69, dsp_arch=v79)
- Context length: 4096 tokens
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support