Update README.md

fab6080 verified 20 days ago

5.22 kB

	---
	license: openrail
	base_model:
	- Supertone/supertonic
	tags:
	- tts
	- qualcomm
	- qnn
	- quantized
	- qcs6490
	pipeline_tag: text-to-speech
	language:
	- en
	---

	# Supertonic TTS Quantization for Qualcomm chipsets

	A step-by-step guide to quantize the [Supertonic TTS](https://huggingface.co/Supertone/supertonic) model for Qualcomm QCS6490 using QAIRT/QNN.
	Note: To achieve the optimal performance and accuracy consider generating ctx binaries or serialised binaries specific to the architecture (v68 - mine)

	### I would be using QCS6490 chip architecture for walking through the steps.

	## Sample Output

	Audio generated on QCS6490 board using quantized models (10 diffusion steps, raw and noisy):

	<audio controls src="https://huggingface.co/dev-ansh-r/qualcomm-Supertonic-TTS-QCS6490/resolve/main/final_output.wav"></audio>

	## Requirements

	- QAIRT/QNN SDK v2.37
	- Python 3.8+
	- Target device: QCS6490

	## Pipeline Architecture

	```text
	text + style
	│
	┌───────────┴───────────┐
	│ │
	duration_predictor text_encoder
	│ │
	duration (scalar) text_emb (1,128,256)
	│ │
	latent_mask (1,1,256) │
	└───────────┬───────────┘
	│
	vector_estimator (10 diffusion steps)
	│
	denoised_latent
	│
	vocoder
	│
	audio (44.1kHz)
	```

	The `duration_predictor` outputs a single scalar representing the total speech duration. This is post-processed into a `latent_mask` that tells the `vector_estimator` how many of the 256 fixed-size latent frames are active speech vs padding.

	## Workflow

	### 1. Input Preparation

	Prepare calibration inputs for model quantization.

	`Input_Preparation.ipynb`

	### 2. Step-by-Step Quantization

	Convert ONNX models to QNN format with quantization for HTP backend.

	`Supertonic_TTS_StepbyStep.ipynb`

	### 3. Correlation Verification

	Verify quantized model outputs against reference using cosine similarity.

	`Correlation_Verification.ipynb`

	## Project Structure

	```text
	├── Input_Preparation.ipynb # Prepare calibration inputs
	├── Supertonic_TTS_StepbyStep.ipynb # ONNX → QNN quantization guide
	├── Correlation_Verification.ipynb # Output verification
	├── assets/ # ONNX models (git submodule)
	│ └── onnx/
	│ ├── text_encoder.onnx
	│ ├── duration_predictor.onnx
	│ ├── vector_estimator.onnx
	│ └── vocoder.onnx
	├── QNN_Models/ # Quantized QNN models (.bin, .cpp)
	├── QNN_Model_lib/ # QNN runtime libraries (aarch64)
	├── qnn_calibration/ # Calibration data for verification
	├── inputs/ # Prepared input data
	└── board_output/ # Inference outputs from board
	```

	## Models

	\| Model \| Description \|
	\|--------------------\|---------------------------------------------\|
	\| text_encoder \| Encodes text tokens with style embedding \|
	\| duration_predictor \| Predicts phoneme durations \|
	\| vector_estimator \| Diffusion-based latent generator (10 steps) \|
	\| vocoder \| Converts latent to audio waveform \|

	### ONNX Models (Source)

	Located in `assets/onnx/` (git submodule from Hugging Face):

	- `text_encoder.onnx`
	- `duration_predictor.onnx`
	- `vector_estimator.onnx`
	- `vocoder.onnx`

	### QNN Models (Quantized)

	Located in `QNN_Models/`:

	- `text_encoder_htp.bin` / `.cpp`
	- `vector_estimator_htp.bin` / `.cpp`
	- `vocoder_htp.bin` / `.cpp`

	### Compiled Libraries (Ready for Deployment)

	Located in `QNN_Model_lib/aarch64-oe-linux-gcc11.2/`:

	- `libtext_encoder_htp.so`
	- `libvector_estimator_htp.so`
	- `libvocoder_htp.so`
	- `libduration_predictor_htp.so`

	These `.so` files are compiled from the `.cpp` sources and are ready to be deployed (via SCP) to the board for inference.

	> Note: The `duration_predictor` is quantized and compiled but not used in the current calibration-based workflow since `latent_mask` is precomputed. For an end-to-end pipeline with arbitrary text input, the duration predictor must run first to dynamically generate the `latent_mask`.

	## Getting Started

	1. Clone with submodules:

	```bash
	git clone --recurse-submodules https://github.com/dev-ansh-r/Supertonic-TTS-QCS6490
	```

	2. Follow the notebooks in order:
	- `Input_Preparation.ipynb`
	- `Supertonic_TTS_StepbyStep.ipynb`
	- `Correlation_Verification.ipynb`

	## Note

	> Inference script and sample application are not provided. Optimization work is ongoing and will be released soon.

	## License

	This model inherits the licensing from [Supertone/supertonic-2](https://huggingface.co/Supertone/supertonic-2):

	- Model: OpenRAIL-M License
	- Code: MIT License

	Copyright (c) 2026 Supertone Inc. (original model)