| | --- |
| | license: openrail |
| | base_model: |
| | - Supertone/supertonic |
| | tags: |
| | - tts |
| | - qualcomm |
| | - qnn |
| | - quantized |
| | - qcs6490 |
| | pipeline_tag: text-to-speech |
| | language: |
| | - en |
| | --- |
| | |
| | # Supertonic TTS Quantization for Qualcomm chipsets |
| |
|
| | A step-by-step guide to quantize the [Supertonic TTS](https://huggingface.co/Supertone/supertonic) model for Qualcomm QCS6490 using QAIRT/QNN. |
| | Note: To achieve the optimal performance and accuracy consider generating ctx binaries or serialised binaries specific to the architecture (v68 - mine) |
| |
|
| | ### I would be using QCS6490 chip architecture for walking through the steps. |
| |
|
| | ## Sample Output |
| |
|
| | Audio generated on QCS6490 board using quantized models (10 diffusion steps, raw and noisy): |
| |
|
| | <audio controls src="https://huggingface.co/dev-ansh-r/qualcomm-Supertonic-TTS-QCS6490/resolve/main/final_output.wav"></audio> |
| |
|
| | ## Requirements |
| |
|
| | - QAIRT/QNN SDK **v2.37** |
| | - Python 3.8+ |
| | - Target device: **QCS6490** |
| |
|
| | ## Pipeline Architecture |
| |
|
| | ```text |
| | text + style |
| | β |
| | βββββββββββββ΄ββββββββββββ |
| | β β |
| | duration_predictor text_encoder |
| | β β |
| | duration (scalar) text_emb (1,128,256) |
| | β β |
| | latent_mask (1,1,256) β |
| | βββββββββββββ¬ββββββββββββ |
| | β |
| | vector_estimator (10 diffusion steps) |
| | β |
| | denoised_latent |
| | β |
| | vocoder |
| | β |
| | audio (44.1kHz) |
| | ``` |
| |
|
| | The `duration_predictor` outputs a single scalar representing the total speech duration. This is post-processed into a `latent_mask` that tells the `vector_estimator` how many of the 256 fixed-size latent frames are active speech vs padding. |
| |
|
| | ## Workflow |
| |
|
| | ### 1. Input Preparation |
| |
|
| | Prepare calibration inputs for model quantization. |
| |
|
| | `Input_Preparation.ipynb` |
| |
|
| | ### 2. Step-by-Step Quantization |
| |
|
| | Convert ONNX models to QNN format with quantization for HTP backend. |
| |
|
| | `Supertonic_TTS_StepbyStep.ipynb` |
| |
|
| | ### 3. Correlation Verification |
| |
|
| | Verify quantized model outputs against reference using cosine similarity. |
| |
|
| | `Correlation_Verification.ipynb` |
| |
|
| | ## Project Structure |
| |
|
| | ```text |
| | βββ Input_Preparation.ipynb # Prepare calibration inputs |
| | βββ Supertonic_TTS_StepbyStep.ipynb # ONNX β QNN quantization guide |
| | βββ Correlation_Verification.ipynb # Output verification |
| | βββ assets/ # ONNX models (git submodule) |
| | β βββ onnx/ |
| | β βββ text_encoder.onnx |
| | β βββ duration_predictor.onnx |
| | β βββ vector_estimator.onnx |
| | β βββ vocoder.onnx |
| | βββ QNN_Models/ # Quantized QNN models (.bin, .cpp) |
| | βββ QNN_Model_lib/ # QNN runtime libraries (aarch64) |
| | βββ qnn_calibration/ # Calibration data for verification |
| | βββ inputs/ # Prepared input data |
| | βββ board_output/ # Inference outputs from board |
| | ``` |
| |
|
| | ## Models |
| |
|
| | | Model | Description | |
| | |--------------------|---------------------------------------------| |
| | | text_encoder | Encodes text tokens with style embedding | |
| | | duration_predictor | Predicts phoneme durations | |
| | | vector_estimator | Diffusion-based latent generator (10 steps) | |
| | | vocoder | Converts latent to audio waveform | |
| | |
| | ### ONNX Models (Source) |
| | |
| | Located in `assets/onnx/` (git submodule from Hugging Face): |
| | |
| | - `text_encoder.onnx` |
| | - `duration_predictor.onnx` |
| | - `vector_estimator.onnx` |
| | - `vocoder.onnx` |
| |
|
| | ### QNN Models (Quantized) |
| |
|
| | Located in `QNN_Models/`: |
| |
|
| | - `text_encoder_htp.bin` / `.cpp` |
| | - `vector_estimator_htp.bin` / `.cpp` |
| | - `vocoder_htp.bin` / `.cpp` |
| |
|
| | ### Compiled Libraries (Ready for Deployment) |
| |
|
| | Located in `QNN_Model_lib/aarch64-oe-linux-gcc11.2/`: |
| |
|
| | - `libtext_encoder_htp.so` |
| | - `libvector_estimator_htp.so` |
| | - `libvocoder_htp.so` |
| | - `libduration_predictor_htp.so` |
| |
|
| | These `.so` files are compiled from the `.cpp` sources and are ready to be deployed (via SCP) to the board for inference. |
| |
|
| | > **Note:** The `duration_predictor` is quantized and compiled but not used in the current calibration-based workflow since `latent_mask` is precomputed. For an end-to-end pipeline with arbitrary text input, the duration predictor must run first to dynamically generate the `latent_mask`. |
| | |
| | ## Getting Started |
| | |
| | 1. Clone with submodules: |
| | |
| | ```bash |
| | git clone --recurse-submodules https://github.com/dev-ansh-r/Supertonic-TTS-QCS6490 |
| | ``` |
| | |
| | 2. Follow the notebooks in order: |
| | - `Input_Preparation.ipynb` |
| | - `Supertonic_TTS_StepbyStep.ipynb` |
| | - `Correlation_Verification.ipynb` |
| |
|
| | ## Note |
| |
|
| | > Inference script and sample application are not provided. Optimization work is ongoing and will be released soon. |
| |
|
| | ## License |
| |
|
| | This model inherits the licensing from [Supertone/supertonic-2](https://huggingface.co/Supertone/supertonic-2): |
| |
|
| | - **Model:** OpenRAIL-M License |
| | - **Code:** MIT License |
| |
|
| | Copyright (c) 2026 Supertone Inc. (original model) |