dev-ansh-r Claude Opus 4.6 commited on
Commit
cbe27d6
Β·
1 Parent(s): b453630

Add quantized QNN model libraries for QCS6490

Browse files

Compiled .so libraries for HTP backend deployment:
- libtext_encoder_htp.so
- libvector_estimator_htp.so
- libvocoder_htp.so
- libduration_predictor_htp.so

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.so filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,11 +1,152 @@
1
  ---
2
  license: openrail
3
- base_model:
4
- - Supertone/supertonic
5
- language:
6
- - en
7
- pipeline_tag: text-to-speech
8
  tags:
9
- - qualcomm
10
- - LPAI
11
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: openrail
3
+ base_model: Supertone/supertonic-2
 
 
 
 
4
  tags:
5
+ - tts
6
+ - text-to-speech
7
+ - qualcomm
8
+ - qnn
9
+ - quantized
10
+ - qcs6490
11
+ - hexagon
12
+ pipeline_tag: text-to-speech
13
+ ---
14
+
15
+ # Supertonic TTS Quantization for QCS6490
16
+
17
+ A step-by-step guide to quantize the [Supertonic TTS](https://huggingface.co/Supertone/supertonic) model for Qualcomm QCS6490 using QAIRT/QNN.
18
+
19
+ ## Requirements
20
+
21
+ - QAIRT/QNN SDK **v2.37**
22
+ - Python 3.8+
23
+ - Target device: **QCS6490**
24
+
25
+ ## Pipeline Architecture
26
+
27
+ ```text
28
+ text + style
29
+ β”‚
30
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
31
+ β”‚ β”‚
32
+ duration_predictor text_encoder
33
+ β”‚ β”‚
34
+ duration (scalar) text_emb (1,128,256)
35
+ β”‚ β”‚
36
+ latent_mask (1,1,256) β”‚
37
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
38
+ β”‚
39
+ vector_estimator (10 diffusion steps)
40
+ β”‚
41
+ denoised_latent
42
+ β”‚
43
+ vocoder
44
+ β”‚
45
+ audio (44.1kHz)
46
+ ```
47
+
48
+ The `duration_predictor` outputs a single scalar representing the total speech duration. This is post-processed into a `latent_mask` that tells the `vector_estimator` how many of the 256 fixed-size latent frames are active speech vs padding.
49
+
50
+ ## Workflow
51
+
52
+ ### 1. Input Preparation
53
+
54
+ Prepare calibration inputs for model quantization.
55
+
56
+ `Input_Preparation.ipynb`
57
+
58
+ ### 2. Step-by-Step Quantization
59
+
60
+ Convert ONNX models to QNN format with quantization for HTP backend.
61
+
62
+ `Supertonic_TTS_StepbyStep.ipynb`
63
+
64
+ ### 3. Correlation Verification
65
+
66
+ Verify quantized model outputs against reference using cosine similarity.
67
+
68
+ `Correlation_Verification.ipynb`
69
+
70
+ ## Project Structure
71
+
72
+ ```text
73
+ β”œβ”€β”€ Input_Preparation.ipynb # Prepare calibration inputs
74
+ β”œβ”€β”€ Supertonic_TTS_StepbyStep.ipynb # ONNX β†’ QNN quantization guide
75
+ β”œβ”€β”€ Correlation_Verification.ipynb # Output verification
76
+ β”œβ”€β”€ assets/ # ONNX models (git submodule)
77
+ β”‚ └── onnx/
78
+ β”‚ β”œβ”€β”€ text_encoder.onnx
79
+ β”‚ β”œβ”€β”€ duration_predictor.onnx
80
+ β”‚ β”œβ”€β”€ vector_estimator.onnx
81
+ β”‚ └── vocoder.onnx
82
+ β”œβ”€β”€ QNN_Models/ # Quantized QNN models (.bin, .cpp)
83
+ β”œβ”€β”€ QNN_Model_lib/ # QNN runtime libraries (aarch64)
84
+ β”œβ”€β”€ qnn_calibration/ # Calibration data for verification
85
+ β”œβ”€β”€ inputs/ # Prepared input data
86
+ └── board_output/ # Inference outputs from board
87
+ ```
88
+
89
+ ## Models
90
+
91
+ | Model | Description |
92
+ |--------------------|---------------------------------------------|
93
+ | text_encoder | Encodes text tokens with style embedding |
94
+ | duration_predictor | Predicts phoneme durations |
95
+ | vector_estimator | Diffusion-based latent generator (10 steps) |
96
+ | vocoder | Converts latent to audio waveform |
97
+
98
+ ### ONNX Models (Source)
99
+
100
+ Located in `assets/onnx/` (git submodule from Hugging Face):
101
+
102
+ - `text_encoder.onnx`
103
+ - `duration_predictor.onnx`
104
+ - `vector_estimator.onnx`
105
+ - `vocoder.onnx`
106
+
107
+ ### QNN Models (Quantized)
108
+
109
+ Located in `QNN_Models/`:
110
+
111
+ - `text_encoder_htp.bin` / `.cpp`
112
+ - `vector_estimator_htp.bin` / `.cpp`
113
+ - `vocoder_htp.bin` / `.cpp`
114
+
115
+ ### Compiled Libraries (Ready for Deployment)
116
+
117
+ Located in `QNN_Model_lib/aarch64-oe-linux-gcc11.2/`:
118
+
119
+ - `libtext_encoder_htp.so`
120
+ - `libvector_estimator_htp.so`
121
+ - `libvocoder_htp.so`
122
+ - `libduration_predictor_htp.so`
123
+
124
+ These `.so` files are compiled from the `.cpp` sources and are ready to be deployed (via SCP) to the board for inference.
125
+
126
+ > **Note:** The `duration_predictor` is quantized and compiled but not used in the current calibration-based workflow since `latent_mask` is precomputed. For an end-to-end pipeline with arbitrary text input, the duration predictor must run first to dynamically generate the `latent_mask`.
127
+
128
+ ## Getting Started
129
+
130
+ 1. Clone with submodules:
131
+
132
+ ```bash
133
+ git clone --recurse-submodules https://github.com/dev-ansh-r/Supertonic-TTS-QCS6490
134
+ ```
135
+
136
+ 2. Follow the notebooks in order:
137
+ - `Input_Preparation.ipynb`
138
+ - `Supertonic_TTS_StepbyStep.ipynb`
139
+ - `Correlation_Verification.ipynb`
140
+
141
+ ## Note
142
+
143
+ > Inference script and sample application are not provided. Optimization work is ongoing and will be released soon.
144
+
145
+ ## License
146
+
147
+ This model inherits the licensing from [Supertone/supertonic-2](https://huggingface.co/Supertone/supertonic-2):
148
+
149
+ - **Model:** OpenRAIL-M License
150
+ - **Code:** MIT License
151
+
152
+ Copyright (c) 2026 Supertone Inc. (original model)
libduration_predictor_htp.so ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0b581701a8b4a10cbca55d428fe94de005dc8e6322e4de596228b45da5d2bee6
3
+ size 1027296
libtext_encoder_htp.so ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f8f8d69ce4c134f340ddf4bee8a2f2be3e43f47e55c446dfe68a2d46d19b5d8d
3
+ size 7819640
libvector_estimator_htp.so ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7ea703c8e37afd08f37d27d975ac110b1b20c78fa4d467d5e1c7c89f5ec1c036
3
+ size 34901904
libvocoder_htp.so ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f1aba5afbd21acb5fe69ef7b048a3b794ee410907458b9180470067e604796db
3
+ size 25864496