File size: 5,215 Bytes
62c3cc3
 
c5a505d
 
b453630
c5a505d
 
 
 
 
cbe27d6
fab6080
 
cbe27d6
 
edc72ed
cbe27d6
94ec149
 
cbe27d6
94b6cdd
 
2fb95b8
 
94ec149
2fb95b8
d6d9990
2fb95b8
cbe27d6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c5a505d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
license: openrail
base_model:
- Supertone/supertonic
tags:
- tts
- qualcomm
- qnn
- quantized
- qcs6490
pipeline_tag: text-to-speech
language:
- en
---

# Supertonic TTS Quantization for Qualcomm chipsets

A step-by-step guide to quantize the [Supertonic TTS](https://huggingface.co/Supertone/supertonic) model for Qualcomm QCS6490 using QAIRT/QNN. 
Note: To achieve the optimal performance and accuracy consider generating ctx binaries or serialised binaries specific to the architecture (v68 - mine)

### I would be using QCS6490 chip architecture for walking through the steps.

## Sample Output

Audio generated on QCS6490 board using quantized models (10 diffusion steps, raw and noisy):

<audio controls src="https://huggingface.co/dev-ansh-r/qualcomm-Supertonic-TTS-QCS6490/resolve/main/final_output.wav"></audio>

## Requirements

- QAIRT/QNN SDK **v2.37**
- Python 3.8+
- Target device: **QCS6490**

## Pipeline Architecture

```text
                text + style
                     β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚                       β”‚
  duration_predictor        text_encoder
         β”‚                       β”‚
    duration (scalar)       text_emb (1,128,256)
         β”‚                       β”‚
   latent_mask (1,1,256)         β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
              vector_estimator (10 diffusion steps)
                     β”‚
               denoised_latent
                     β”‚
                  vocoder
                     β”‚
              audio (44.1kHz)
```

The `duration_predictor` outputs a single scalar representing the total speech duration. This is post-processed into a `latent_mask` that tells the `vector_estimator` how many of the 256 fixed-size latent frames are active speech vs padding.

## Workflow

### 1. Input Preparation

Prepare calibration inputs for model quantization.

`Input_Preparation.ipynb`

### 2. Step-by-Step Quantization

Convert ONNX models to QNN format with quantization for HTP backend.

`Supertonic_TTS_StepbyStep.ipynb`

### 3. Correlation Verification

Verify quantized model outputs against reference using cosine similarity.

`Correlation_Verification.ipynb`

## Project Structure

```text
β”œβ”€β”€ Input_Preparation.ipynb         # Prepare calibration inputs
β”œβ”€β”€ Supertonic_TTS_StepbyStep.ipynb # ONNX β†’ QNN quantization guide
β”œβ”€β”€ Correlation_Verification.ipynb  # Output verification
β”œβ”€β”€ assets/                         # ONNX models (git submodule)
β”‚   └── onnx/
β”‚       β”œβ”€β”€ text_encoder.onnx
β”‚       β”œβ”€β”€ duration_predictor.onnx
β”‚       β”œβ”€β”€ vector_estimator.onnx
β”‚       └── vocoder.onnx
β”œβ”€β”€ QNN_Models/                     # Quantized QNN models (.bin, .cpp)
β”œβ”€β”€ QNN_Model_lib/                  # QNN runtime libraries (aarch64)
β”œβ”€β”€ qnn_calibration/                # Calibration data for verification
β”œβ”€β”€ inputs/                         # Prepared input data
└── board_output/                   # Inference outputs from board
```

## Models

| Model              | Description                                 |
|--------------------|---------------------------------------------|
| text_encoder       | Encodes text tokens with style embedding    |
| duration_predictor | Predicts phoneme durations                  |
| vector_estimator   | Diffusion-based latent generator (10 steps) |
| vocoder            | Converts latent to audio waveform           |

### ONNX Models (Source)

Located in `assets/onnx/` (git submodule from Hugging Face):

- `text_encoder.onnx`
- `duration_predictor.onnx`
- `vector_estimator.onnx`
- `vocoder.onnx`

### QNN Models (Quantized)

Located in `QNN_Models/`:

- `text_encoder_htp.bin` / `.cpp`
- `vector_estimator_htp.bin` / `.cpp`
- `vocoder_htp.bin` / `.cpp`

### Compiled Libraries (Ready for Deployment)

Located in `QNN_Model_lib/aarch64-oe-linux-gcc11.2/`:

- `libtext_encoder_htp.so`
- `libvector_estimator_htp.so`
- `libvocoder_htp.so`
- `libduration_predictor_htp.so`

These `.so` files are compiled from the `.cpp` sources and are ready to be deployed (via SCP) to the board for inference.

> **Note:** The `duration_predictor` is quantized and compiled but not used in the current calibration-based workflow since `latent_mask` is precomputed. For an end-to-end pipeline with arbitrary text input, the duration predictor must run first to dynamically generate the `latent_mask`.

## Getting Started

1. Clone with submodules:

   ```bash
   git clone --recurse-submodules https://github.com/dev-ansh-r/Supertonic-TTS-QCS6490
   ```

2. Follow the notebooks in order:
   - `Input_Preparation.ipynb`
   - `Supertonic_TTS_StepbyStep.ipynb`
   - `Correlation_Verification.ipynb`

## Note

> Inference script and sample application are not provided. Optimization work is ongoing and will be released soon.

## License

This model inherits the licensing from [Supertone/supertonic-2](https://huggingface.co/Supertone/supertonic-2):

- **Model:** OpenRAIL-M License
- **Code:** MIT License

Copyright (c) 2026 Supertone Inc. (original model)