dev-ansh-r's picture
Update README.md
fab6080 verified
---
license: openrail
base_model:
- Supertone/supertonic
tags:
- tts
- qualcomm
- qnn
- quantized
- qcs6490
pipeline_tag: text-to-speech
language:
- en
---
# Supertonic TTS Quantization for Qualcomm chipsets
A step-by-step guide to quantize the [Supertonic TTS](https://huggingface.co/Supertone/supertonic) model for Qualcomm QCS6490 using QAIRT/QNN.
Note: To achieve the optimal performance and accuracy consider generating ctx binaries or serialised binaries specific to the architecture (v68 - mine)
### I would be using QCS6490 chip architecture for walking through the steps.
## Sample Output
Audio generated on QCS6490 board using quantized models (10 diffusion steps, raw and noisy):
<audio controls src="https://huggingface.co/dev-ansh-r/qualcomm-Supertonic-TTS-QCS6490/resolve/main/final_output.wav"></audio>
## Requirements
- QAIRT/QNN SDK **v2.37**
- Python 3.8+
- Target device: **QCS6490**
## Pipeline Architecture
```text
text + style
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚
duration_predictor text_encoder
β”‚ β”‚
duration (scalar) text_emb (1,128,256)
β”‚ β”‚
latent_mask (1,1,256) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
vector_estimator (10 diffusion steps)
β”‚
denoised_latent
β”‚
vocoder
β”‚
audio (44.1kHz)
```
The `duration_predictor` outputs a single scalar representing the total speech duration. This is post-processed into a `latent_mask` that tells the `vector_estimator` how many of the 256 fixed-size latent frames are active speech vs padding.
## Workflow
### 1. Input Preparation
Prepare calibration inputs for model quantization.
`Input_Preparation.ipynb`
### 2. Step-by-Step Quantization
Convert ONNX models to QNN format with quantization for HTP backend.
`Supertonic_TTS_StepbyStep.ipynb`
### 3. Correlation Verification
Verify quantized model outputs against reference using cosine similarity.
`Correlation_Verification.ipynb`
## Project Structure
```text
β”œβ”€β”€ Input_Preparation.ipynb # Prepare calibration inputs
β”œβ”€β”€ Supertonic_TTS_StepbyStep.ipynb # ONNX β†’ QNN quantization guide
β”œβ”€β”€ Correlation_Verification.ipynb # Output verification
β”œβ”€β”€ assets/ # ONNX models (git submodule)
β”‚ └── onnx/
β”‚ β”œβ”€β”€ text_encoder.onnx
β”‚ β”œβ”€β”€ duration_predictor.onnx
β”‚ β”œβ”€β”€ vector_estimator.onnx
β”‚ └── vocoder.onnx
β”œβ”€β”€ QNN_Models/ # Quantized QNN models (.bin, .cpp)
β”œβ”€β”€ QNN_Model_lib/ # QNN runtime libraries (aarch64)
β”œβ”€β”€ qnn_calibration/ # Calibration data for verification
β”œβ”€β”€ inputs/ # Prepared input data
└── board_output/ # Inference outputs from board
```
## Models
| Model | Description |
|--------------------|---------------------------------------------|
| text_encoder | Encodes text tokens with style embedding |
| duration_predictor | Predicts phoneme durations |
| vector_estimator | Diffusion-based latent generator (10 steps) |
| vocoder | Converts latent to audio waveform |
### ONNX Models (Source)
Located in `assets/onnx/` (git submodule from Hugging Face):
- `text_encoder.onnx`
- `duration_predictor.onnx`
- `vector_estimator.onnx`
- `vocoder.onnx`
### QNN Models (Quantized)
Located in `QNN_Models/`:
- `text_encoder_htp.bin` / `.cpp`
- `vector_estimator_htp.bin` / `.cpp`
- `vocoder_htp.bin` / `.cpp`
### Compiled Libraries (Ready for Deployment)
Located in `QNN_Model_lib/aarch64-oe-linux-gcc11.2/`:
- `libtext_encoder_htp.so`
- `libvector_estimator_htp.so`
- `libvocoder_htp.so`
- `libduration_predictor_htp.so`
These `.so` files are compiled from the `.cpp` sources and are ready to be deployed (via SCP) to the board for inference.
> **Note:** The `duration_predictor` is quantized and compiled but not used in the current calibration-based workflow since `latent_mask` is precomputed. For an end-to-end pipeline with arbitrary text input, the duration predictor must run first to dynamically generate the `latent_mask`.
## Getting Started
1. Clone with submodules:
```bash
git clone --recurse-submodules https://github.com/dev-ansh-r/Supertonic-TTS-QCS6490
```
2. Follow the notebooks in order:
- `Input_Preparation.ipynb`
- `Supertonic_TTS_StepbyStep.ipynb`
- `Correlation_Verification.ipynb`
## Note
> Inference script and sample application are not provided. Optimization work is ongoing and will be released soon.
## License
This model inherits the licensing from [Supertone/supertonic-2](https://huggingface.co/Supertone/supertonic-2):
- **Model:** OpenRAIL-M License
- **Code:** MIT License
Copyright (c) 2026 Supertone Inc. (original model)