--- license: openrail base_model: - Supertone/supertonic tags: - tts - qualcomm - qnn - quantized - qcs6490 pipeline_tag: text-to-speech language: - en --- # Supertonic TTS Quantization for Qualcomm chipsets A step-by-step guide to quantize the [Supertonic TTS](https://huggingface.co/Supertone/supertonic) model for Qualcomm QCS6490 using QAIRT/QNN. Note: To achieve the optimal performance and accuracy consider generating ctx binaries or serialised binaries specific to the architecture (v68 - mine) ### I would be using QCS6490 chip architecture for walking through the steps. ## Sample Output Audio generated on QCS6490 board using quantized models (10 diffusion steps, raw and noisy): ## Requirements - QAIRT/QNN SDK **v2.37** - Python 3.8+ - Target device: **QCS6490** ## Pipeline Architecture ```text text + style │ ┌───────────┴───────────┐ │ │ duration_predictor text_encoder │ │ duration (scalar) text_emb (1,128,256) │ │ latent_mask (1,1,256) │ └───────────┬───────────┘ │ vector_estimator (10 diffusion steps) │ denoised_latent │ vocoder │ audio (44.1kHz) ``` The `duration_predictor` outputs a single scalar representing the total speech duration. This is post-processed into a `latent_mask` that tells the `vector_estimator` how many of the 256 fixed-size latent frames are active speech vs padding. ## Workflow ### 1. Input Preparation Prepare calibration inputs for model quantization. `Input_Preparation.ipynb` ### 2. Step-by-Step Quantization Convert ONNX models to QNN format with quantization for HTP backend. `Supertonic_TTS_StepbyStep.ipynb` ### 3. Correlation Verification Verify quantized model outputs against reference using cosine similarity. `Correlation_Verification.ipynb` ## Project Structure ```text ├── Input_Preparation.ipynb # Prepare calibration inputs ├── Supertonic_TTS_StepbyStep.ipynb # ONNX → QNN quantization guide ├── Correlation_Verification.ipynb # Output verification ├── assets/ # ONNX models (git submodule) │ └── onnx/ │ ├── text_encoder.onnx │ ├── duration_predictor.onnx │ ├── vector_estimator.onnx │ └── vocoder.onnx ├── QNN_Models/ # Quantized QNN models (.bin, .cpp) ├── QNN_Model_lib/ # QNN runtime libraries (aarch64) ├── qnn_calibration/ # Calibration data for verification ├── inputs/ # Prepared input data └── board_output/ # Inference outputs from board ``` ## Models | Model | Description | |--------------------|---------------------------------------------| | text_encoder | Encodes text tokens with style embedding | | duration_predictor | Predicts phoneme durations | | vector_estimator | Diffusion-based latent generator (10 steps) | | vocoder | Converts latent to audio waveform | ### ONNX Models (Source) Located in `assets/onnx/` (git submodule from Hugging Face): - `text_encoder.onnx` - `duration_predictor.onnx` - `vector_estimator.onnx` - `vocoder.onnx` ### QNN Models (Quantized) Located in `QNN_Models/`: - `text_encoder_htp.bin` / `.cpp` - `vector_estimator_htp.bin` / `.cpp` - `vocoder_htp.bin` / `.cpp` ### Compiled Libraries (Ready for Deployment) Located in `QNN_Model_lib/aarch64-oe-linux-gcc11.2/`: - `libtext_encoder_htp.so` - `libvector_estimator_htp.so` - `libvocoder_htp.so` - `libduration_predictor_htp.so` These `.so` files are compiled from the `.cpp` sources and are ready to be deployed (via SCP) to the board for inference. > **Note:** The `duration_predictor` is quantized and compiled but not used in the current calibration-based workflow since `latent_mask` is precomputed. For an end-to-end pipeline with arbitrary text input, the duration predictor must run first to dynamically generate the `latent_mask`. ## Getting Started 1. Clone with submodules: ```bash git clone --recurse-submodules https://github.com/dev-ansh-r/Supertonic-TTS-QCS6490 ``` 2. Follow the notebooks in order: - `Input_Preparation.ipynb` - `Supertonic_TTS_StepbyStep.ipynb` - `Correlation_Verification.ipynb` ## Note > Inference script and sample application are not provided. Optimization work is ongoing and will be released soon. ## License This model inherits the licensing from [Supertone/supertonic-2](https://huggingface.co/Supertone/supertonic-2): - **Model:** OpenRAIL-M License - **Code:** MIT License Copyright (c) 2026 Supertone Inc. (original model)