Upload Llama-3.2-3B-Instruct model for QCS9075 (HTP backend)

e148d8a verified about 2 months ago

3.69 kB

	---
	license: other
	tags:
	- qualcomm
	- qcs9075
	- edge-ai
	- genie
	- qnn-htp
	- llm
	base_model: meta-llama/Llama-3.2-3B-Instruct
	---

	# Llama-3.2-3B-Instruct-QCS9075-HTP

	This is a pre-compiled version of [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) optimized for the Qualcomm QCS9075 SoC using the Qualcomm Genie SDK.

	## Model Details

	- Base Model: [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
	- Target Hardware: Qualcomm QCS9075 (IQ-9075 EVK)
	- Backend: QnnHtp (NPU)
	- Quantization: W4A16
	- Compilation: Qualcomm AI Hub (QAIRT 2.42)

	## Performance

	\| Model \| Backend \| Performance \| Size \|
	\|-------\|---------\|-------------\|------\|
	\| Llama-3.2-3B-Instruct-QCS9075-HTP \| QnnHtp (NPU) \| ~18.7 TPS on QCS9075 \| 2.5G \|

	TPS = Tokens Per Second (generation speed)

	## Hardware Requirements

	- Device: Qualcomm IQ-9075 EVK or QCS9075-based device
	- OS: Ubuntu 22.04 (recommended)
	- SDK: Qualcomm Genie SDK
	- QAIRT: Version 2.42 or later

	## Usage

	### Prerequisites

	1. Install the Qualcomm Genie SDK on your QCS9075 device
	2. Download all model files from this repository
	3. Ensure QAIRT 2.42 libraries are available

	### Environment Setup

	For HTP models, the LD_LIBRARY_PATH ordering is critical:

	```bash
	export LD_LIBRARY_PATH=/opt/qcom/aistack/qairt/2.42.0.250923/lib/aarch64-linux-gnu:/opt/qcom/aistack/genie/qnn/libs:$LD_LIBRARY_PATH
	```

	### Configuration

	Create a `genie_config.json` file:

	```json
	{
	"model_path": "/path/to/model/files",
	"backend": "QnnHtp",
	"device": "0"
	}
	```

	### Running the Model

	```bash
	# Using the Genie server
	python3 /opt/qcom/aistack/genie/examples/server_persistent.py \
	--config genie_config.json \
	--port 8000
	```

	### Kubernetes Deployment

	For deploying on Kubernetes clusters with QCS9075 nodes, refer to the deployment pattern:

	```yaml
	apiVersion: v1
	kind: Pod
	metadata:
	name: genie-llm-server
	spec:
	containers:
	- name: genie
	image: your-registry/genie-runtime:latest
	env:
	- name: LD_LIBRARY_PATH
	value: "'/opt/qcom/aistack/qairt/2.42.0.250923/lib/aarch64-linux-gnu:/opt/qcom/aistack/genie/qnn/libs'"
	volumeMounts:
	- name: model-storage
	mountPath: /models
	- name: qcom-libs
	mountPath: /opt/qcom/aistack
	volumes:
	- name: model-storage
	hostPath:
	path: /mnt/models/llama-3.2-3b-instruct-qcs9075-htp
	- name: qcom-libs
	hostPath:
	path: /opt/qcom/aistack
	```

	## File Structure

	This repository contains:
	- Compiled model artifacts (.bin files)
	- Configuration files (genie_config.json)
	- QNN HTP context binaries

	## Benchmarking Notes

	- Performance metrics measured on Qualcomm IQ-9075 EVK
	- TPS (Tokens Per Second) measured during generation phase
	- Results may vary based on prompt length and complexity
	- HTP backend utilizes the NPU for acceleration

	## License

	This model follows the license of the base model [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct). Please refer to the original model card for license details.

	## Acknowledgments

	- Base model: [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
	- Compiled using Qualcomm AI Hub with QAIRT 2.42
	- Target hardware: Qualcomm QCS9075 SoC

	## Support

	For issues related to:
	- Model compilation: Contact Qualcomm AI Hub support
	- Genie SDK: Refer to Qualcomm Genie documentation
	- Deployment: Open an issue in this repository

	---

	This model is optimized for edge deployment on Qualcomm QCS9075 devices and may not work on other hardware platforms.