File size: 6,026 Bytes
b2c3a4b
 
 
 
 
16dbb03
4e2e250
16dbb03
4e2e250
 
 
16dbb03
4e2e250
 
 
16dbb03
 
 
4e2e250
16dbb03
4e2e250
 
16dbb03
4e2e250
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16dbb03
 
 
4e2e250
16dbb03
4e2e250
16dbb03
4e2e250
16dbb03
4e2e250
16dbb03
 
 
4e2e250
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16dbb03
 
 
 
4e2e250
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b2c3a4b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
---
base_model:
- Supertone/supertonic
pipeline_tag: text-to-speech
---

# Supertonic Quantized INT8 β€” Offline TTS (Shadow0482)

This repository contains **INT8 optimized ONNX models** for the Supertonic Text-To-Speech
pipeline. These models are quantized versions of the official Supertonic models and are
designed for **offline, low-latency, CPU-friendly inference**.

FP16 versions exist for experimentation, but the vocoder currently contains a type mismatch
(`float32` vs `float16`) in a `Div` node, so FP16 inference is **not stable**.  
Therefore, **INT8 is the recommended format** for real-world offline use.

---

# πŸš€ Features

### βœ” 100% Offline Execution  
No network needed. Load ONNX models directly using ONNX Runtime.

### βœ” Full Supertonic Inference Stack  
- Text Encoder  
- Duration Predictor  
- Vector Estimator  
- Vocoder  

### βœ” INT8 Dynamic Quantization  
- Reduces model sizes dramatically  
- CPU-friendly inference  
- Very low memory usage  
- Compatible with ONNX Runtime CPUExecutionProvider  

### βœ” Same Audio Quality Text Output  
Produces understandable speech while being drastically faster on CPUs.

---

# πŸ“¦ Repository Structure

```

int8_dynamic/
duration_predictor.int8.onnx
text_encoder.int8.onnx
vector_estimator.int8.onnx
vocoder.int8.onnx

fp16/
(Contains experimental FP16 models β€” vocoder currently unstable)

```

Only the **INT8 directory** is guaranteed stable.

---

# πŸ”Š Test Sentence Used in Benchmark

```

Greetings! You are listening to your newly quantized model.
I have been squished, squeezed, compressed, minimized, optimized,
digitized, and lightly traumatized to save disk space.
The testing framework automatically verifies my integrity,
measures how much weight I lost,
and checks if I can still talk without glitching into a robot dolphin.
If you can hear this clearly, the quantization ritual was a complete success.

```

---

# πŸ“ˆ Benchmark Summary (CPU)

| Model | Precision | Time (s) | Output | Status |
|-------|-----------|---------:|--------|--------|
| INT8 Dynamic | int8 | _varies: ~3.0–7.0s_ | `*.wav` | βœ… OK |
| FP32 (baseline) | float32 | ~2–4Γ— slower | `*.wav` | βœ… OK |
| FP16 | mixed | ❌ FAILED | β€” | 🚫 Cannot load vocoder |

---

# πŸ–₯️ Offline Inference Guide (Python)

Below is a clean Python script to run **fully offline INT8 inference**.

---

# 🧩 Requirements

```

pip install onnxruntime numpy soundfile

````

---

# πŸ“œ offline_tts_int8.py

```python
import onnxruntime as ort
import numpy as np
import json
import soundfile as sf
from pathlib import Path

# ---------------------------------------------------------
# 1) CONFIG
# ---------------------------------------------------------
MODEL_DIR = Path("int8_dynamic")   # folder containing *.int8.onnx
VOICE_STYLE = "assets/voice_styles/M1.json"

text_encoder_path      = MODEL_DIR / "text_encoder.int8.onnx"
duration_pred_path     = MODEL_DIR / "duration_predictor.int8.onnx"
vector_estimator_path  = MODEL_DIR / "vector_estimator.int8.onnx"
vocoder_path           = MODEL_DIR / "vocoder.int8.onnx"

TEST_TEXT = (
    "Hello! This is the INT8 offline version of Supertonic speaking. "
    "Everything you hear right now is running fully offline."
)

# ---------------------------------------------------------
# 2) TOKENIZER LOADING
# ---------------------------------------------------------
unicode_path = Path("assets/onnx/unicode_indexer.json")
tokenizer = json.load(open(unicode_path))

def encode_text(text: str):
    ids = []
    for ch in text:
        if ch in tokenizer["token2idx"]:
            ids.append(tokenizer["token2idx"][ch])
        else:
            ids.append(tokenizer["token2idx"]["<unk>"])
    return np.array([ids], dtype=np.int64)

# ---------------------------------------------------------
# 3) LOAD MODELS (CPU)
# ---------------------------------------------------------
def load_session(model_path):
    return ort.InferenceSession(
        str(model_path),
        providers=["CPUExecutionProvider"]
    )

sess_text = load_session(text_encoder_path)
sess_dur  = load_session(duration_pred_path)
sess_vec  = load_session(vector_estimator_path)
sess_voc  = load_session(vocoder_path)

# ---------------------------------------------------------
# 4) RUN TEXT ENCODER
# ---------------------------------------------------------
text_ids = encode_text(TEST_TEXT)
text_mask = np.ones((1, 1, text_ids.shape[1]), dtype=np.float32)
style_ttl = np.zeros((1, 50, 256), dtype=np.float32)

text_out = sess_text.run(
    None,
    {
        "text_ids": text_ids,
        "text_mask": text_mask,
        "style_ttl": style_ttl
    }
)[0]

# ---------------------------------------------------------
# 5) RUN DURATION PREDICTOR
# ---------------------------------------------------------
style_dp = np.zeros((1, 8, 16), dtype=np.float32)

dur_out = sess_dur.run(
    None,
    {
        "text_ids": text_ids,
        "text_mask": text_mask,
        "style_dp": style_dp
    }
)[0]

durations = np.maximum(dur_out.astype(int), 1)

# ---------------------------------------------------------
# 6) VECTOR ESTIMATOR
# ---------------------------------------------------------
latent = sess_vec.run(None, {"latent": text_out})[0]

# ---------------------------------------------------------
# 7) VOCODER β†’ WAV
# ---------------------------------------------------------
wav = sess_voc.run(None, {"latent": latent})[0][0]

sf.write("output_int8.wav", wav, 24000)
print("Saved: output_int8.wav")
````

---

# 🎧 Output

After running:

```
python offline_tts_int8.py
```

You will get:

```
output_int8.wav
```

Playable offline on any system.

---

# πŸ“ Notes

* Only the **INT8** models are stable & recommended.
* FP16 vocoder currently fails due to a type mismatch in a `Div` node.
* No internet connection is required for INT8 inference.
* These models are ideal for embedded or low-spec machines.

---

# πŸ“„ License

Models follow Supertone's licensing terms.
Quantized versions follow the same licensing.