File size: 3,685 Bytes

e148d8a

---
license: other
tags:
  - qualcomm
  - qcs9075
  - edge-ai
  - genie
  - qnn-htp
  - llm
base_model: meta-llama/Llama-3.2-3B-Instruct
---

# Llama-3.2-3B-Instruct-QCS9075-HTP

This is a pre-compiled version of [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) optimized for the **Qualcomm QCS9075 SoC** using the **Qualcomm Genie SDK**.

## Model Details

- **Base Model**: [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
- **Target Hardware**: Qualcomm QCS9075 (IQ-9075 EVK)
- **Backend**: QnnHtp (NPU)
- **Quantization**: W4A16
- **Compilation**: Qualcomm AI Hub (QAIRT 2.42)

## Performance

| Model | Backend | Performance | Size |
|-------|---------|-------------|------|
| Llama-3.2-3B-Instruct-QCS9075-HTP | QnnHtp (NPU) | ~18.7 TPS on QCS9075 | 2.5G |

*TPS = Tokens Per Second (generation speed)*

## Hardware Requirements

- **Device**: Qualcomm IQ-9075 EVK or QCS9075-based device
- **OS**: Ubuntu 22.04 (recommended)
- **SDK**: Qualcomm Genie SDK
- **QAIRT**: Version 2.42 or later

## Usage

### Prerequisites

1. Install the Qualcomm Genie SDK on your QCS9075 device
2. Download all model files from this repository
3. Ensure QAIRT 2.42 libraries are available

### Environment Setup

For HTP models, the LD_LIBRARY_PATH ordering is critical:

```bash
export LD_LIBRARY_PATH=/opt/qcom/aistack/qairt/2.42.0.250923/lib/aarch64-linux-gnu:/opt/qcom/aistack/genie/qnn/libs:$LD_LIBRARY_PATH
```

### Configuration

Create a `genie_config.json` file:

```json
{
  "model_path": "/path/to/model/files",
  "backend": "QnnHtp",
  "device": "0"
}
```

### Running the Model

```bash
# Using the Genie server
python3 /opt/qcom/aistack/genie/examples/server_persistent.py \
  --config genie_config.json \
  --port 8000
```

### Kubernetes Deployment

For deploying on Kubernetes clusters with QCS9075 nodes, refer to the deployment pattern:

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: genie-llm-server
spec:
  containers:
  - name: genie
    image: your-registry/genie-runtime:latest
    env:
    - name: LD_LIBRARY_PATH
      value: "'/opt/qcom/aistack/qairt/2.42.0.250923/lib/aarch64-linux-gnu:/opt/qcom/aistack/genie/qnn/libs'"
    volumeMounts:
    - name: model-storage
      mountPath: /models
    - name: qcom-libs
      mountPath: /opt/qcom/aistack
  volumes:
  - name: model-storage
    hostPath:
      path: /mnt/models/llama-3.2-3b-instruct-qcs9075-htp
  - name: qcom-libs
    hostPath:
      path: /opt/qcom/aistack
```

## File Structure

This repository contains:
- Compiled model artifacts (.bin files)
- Configuration files (genie_config.json)
- QNN HTP context binaries

## Benchmarking Notes

- Performance metrics measured on Qualcomm IQ-9075 EVK
- TPS (Tokens Per Second) measured during generation phase
- Results may vary based on prompt length and complexity
- HTP backend utilizes the NPU for acceleration

## License

This model follows the license of the base model [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct). Please refer to the original model card for license details.

## Acknowledgments

- Base model: [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
- Compiled using Qualcomm AI Hub with QAIRT 2.42
- Target hardware: Qualcomm QCS9075 SoC

## Support

For issues related to:
- **Model compilation**: Contact Qualcomm AI Hub support
- **Genie SDK**: Refer to Qualcomm Genie documentation
- **Deployment**: Open an issue in this repository

---

*This model is optimized for edge deployment on Qualcomm QCS9075 devices and may not work on other hardware platforms.*