File size: 5,215 Bytes
62c3cc3 c5a505d b453630 c5a505d cbe27d6 fab6080 cbe27d6 edc72ed cbe27d6 94ec149 cbe27d6 94b6cdd 2fb95b8 94ec149 2fb95b8 d6d9990 2fb95b8 cbe27d6 c5a505d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | ---
license: openrail
base_model:
- Supertone/supertonic
tags:
- tts
- qualcomm
- qnn
- quantized
- qcs6490
pipeline_tag: text-to-speech
language:
- en
---
# Supertonic TTS Quantization for Qualcomm chipsets
A step-by-step guide to quantize the [Supertonic TTS](https://huggingface.co/Supertone/supertonic) model for Qualcomm QCS6490 using QAIRT/QNN.
Note: To achieve the optimal performance and accuracy consider generating ctx binaries or serialised binaries specific to the architecture (v68 - mine)
### I would be using QCS6490 chip architecture for walking through the steps.
## Sample Output
Audio generated on QCS6490 board using quantized models (10 diffusion steps, raw and noisy):
<audio controls src="https://huggingface.co/dev-ansh-r/qualcomm-Supertonic-TTS-QCS6490/resolve/main/final_output.wav"></audio>
## Requirements
- QAIRT/QNN SDK **v2.37**
- Python 3.8+
- Target device: **QCS6490**
## Pipeline Architecture
```text
text + style
β
βββββββββββββ΄ββββββββββββ
β β
duration_predictor text_encoder
β β
duration (scalar) text_emb (1,128,256)
β β
latent_mask (1,1,256) β
βββββββββββββ¬ββββββββββββ
β
vector_estimator (10 diffusion steps)
β
denoised_latent
β
vocoder
β
audio (44.1kHz)
```
The `duration_predictor` outputs a single scalar representing the total speech duration. This is post-processed into a `latent_mask` that tells the `vector_estimator` how many of the 256 fixed-size latent frames are active speech vs padding.
## Workflow
### 1. Input Preparation
Prepare calibration inputs for model quantization.
`Input_Preparation.ipynb`
### 2. Step-by-Step Quantization
Convert ONNX models to QNN format with quantization for HTP backend.
`Supertonic_TTS_StepbyStep.ipynb`
### 3. Correlation Verification
Verify quantized model outputs against reference using cosine similarity.
`Correlation_Verification.ipynb`
## Project Structure
```text
βββ Input_Preparation.ipynb # Prepare calibration inputs
βββ Supertonic_TTS_StepbyStep.ipynb # ONNX β QNN quantization guide
βββ Correlation_Verification.ipynb # Output verification
βββ assets/ # ONNX models (git submodule)
β βββ onnx/
β βββ text_encoder.onnx
β βββ duration_predictor.onnx
β βββ vector_estimator.onnx
β βββ vocoder.onnx
βββ QNN_Models/ # Quantized QNN models (.bin, .cpp)
βββ QNN_Model_lib/ # QNN runtime libraries (aarch64)
βββ qnn_calibration/ # Calibration data for verification
βββ inputs/ # Prepared input data
βββ board_output/ # Inference outputs from board
```
## Models
| Model | Description |
|--------------------|---------------------------------------------|
| text_encoder | Encodes text tokens with style embedding |
| duration_predictor | Predicts phoneme durations |
| vector_estimator | Diffusion-based latent generator (10 steps) |
| vocoder | Converts latent to audio waveform |
### ONNX Models (Source)
Located in `assets/onnx/` (git submodule from Hugging Face):
- `text_encoder.onnx`
- `duration_predictor.onnx`
- `vector_estimator.onnx`
- `vocoder.onnx`
### QNN Models (Quantized)
Located in `QNN_Models/`:
- `text_encoder_htp.bin` / `.cpp`
- `vector_estimator_htp.bin` / `.cpp`
- `vocoder_htp.bin` / `.cpp`
### Compiled Libraries (Ready for Deployment)
Located in `QNN_Model_lib/aarch64-oe-linux-gcc11.2/`:
- `libtext_encoder_htp.so`
- `libvector_estimator_htp.so`
- `libvocoder_htp.so`
- `libduration_predictor_htp.so`
These `.so` files are compiled from the `.cpp` sources and are ready to be deployed (via SCP) to the board for inference.
> **Note:** The `duration_predictor` is quantized and compiled but not used in the current calibration-based workflow since `latent_mask` is precomputed. For an end-to-end pipeline with arbitrary text input, the duration predictor must run first to dynamically generate the `latent_mask`.
## Getting Started
1. Clone with submodules:
```bash
git clone --recurse-submodules https://github.com/dev-ansh-r/Supertonic-TTS-QCS6490
```
2. Follow the notebooks in order:
- `Input_Preparation.ipynb`
- `Supertonic_TTS_StepbyStep.ipynb`
- `Correlation_Verification.ipynb`
## Note
> Inference script and sample application are not provided. Optimization work is ongoing and will be released soon.
## License
This model inherits the licensing from [Supertone/supertonic-2](https://huggingface.co/Supertone/supertonic-2):
- **Model:** OpenRAIL-M License
- **Code:** MIT License
Copyright (c) 2026 Supertone Inc. (original model) |