Text-to-Speech
Supertonic
ONNX
tts
speech-synthesis
quantized
fp16
multilingual
on-device
diffusion
flow-matching
Instructions to use Kyumdroid/supertonic-3-quant with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Supertonic
How to use Kyumdroid/supertonic-3-quant with Supertonic:
from supertonic import TTS tts = TTS(auto_download=True) style = tts.get_voice_style(voice_name="M1") text = "The train delay was announced at 4:45 PM on Wed, Apr 3, 2024 due to track maintenance." wav, duration = tts.synthesize(text, voice_style=style) tts.save_audio(wav, "output.wav")
- Notebooks
- Google Colab
- Kaggle
Update README: add int8 variant, voice catalog, conversion details
Browse files
README.md
CHANGED
|
@@ -2,24 +2,80 @@
|
|
| 2 |
license: openrail
|
| 3 |
base_model: Supertone/supertonic-3
|
| 4 |
base_model_relation: quantized
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
tags:
|
| 6 |
- text-to-speech
|
|
|
|
|
|
|
| 7 |
- onnx
|
| 8 |
- quantized
|
|
|
|
|
|
|
| 9 |
- supertonic
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
# Supertonic-3 Quantized
|
| 13 |
|
| 14 |
-
Quantized derivatives of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) for on-device TTS.
|
|
|
|
|
|
|
| 15 |
|
| 16 |
## Variants
|
| 17 |
|
| 18 |
-
| Folder |
|
| 19 |
-
|--------|------
|
| 20 |
-
| `fp16/` |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
## Layout
|
| 25 |
|
|
@@ -36,38 +92,69 @@ voice_styles/
|
|
| 36 |
{F1,F2,F3,F4,F5,M1,M2,M3,M4,M5}.json
|
| 37 |
```
|
| 38 |
|
| 39 |
-
- **`<variant>/onnx/`** β
|
| 40 |
-
- **`voice_styles/`** β variant-independent voice embeddings shared
|
| 41 |
|
| 42 |
## Download
|
| 43 |
|
| 44 |
-
Snapshot the `fp16` variant only:
|
| 45 |
-
|
| 46 |
```bash
|
|
|
|
| 47 |
hf download Kyumdroid/supertonic-3-quant \
|
| 48 |
--include="fp16/onnx/**" --include="voice_styles/**" \
|
| 49 |
--local-dir ./supertonic
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
```
|
| 51 |
|
| 52 |
-
|
| 53 |
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
voice_styles/... # 10 JSON
|
| 58 |
-
```
|
| 59 |
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
## Conversion
|
| 63 |
|
| 64 |
-
`fp16/`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
## License
|
| 67 |
|
| 68 |
-
OpenRAIL-M
|
|
|
|
|
|
|
| 69 |
|
| 70 |
## Credits
|
| 71 |
|
| 72 |
-
- Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
|
| 73 |
-
-
|
|
|
|
|
|
| 2 |
license: openrail
|
| 3 |
base_model: Supertone/supertonic-3
|
| 4 |
base_model_relation: quantized
|
| 5 |
+
language:
|
| 6 |
+
- en
|
| 7 |
+
- ko
|
| 8 |
+
- ja
|
| 9 |
+
- ar
|
| 10 |
+
- bg
|
| 11 |
+
- cs
|
| 12 |
+
- da
|
| 13 |
+
- de
|
| 14 |
+
- el
|
| 15 |
+
- es
|
| 16 |
+
- et
|
| 17 |
+
- fi
|
| 18 |
+
- fr
|
| 19 |
+
- hi
|
| 20 |
+
- hr
|
| 21 |
+
- hu
|
| 22 |
+
- id
|
| 23 |
+
- it
|
| 24 |
+
- lt
|
| 25 |
+
- lv
|
| 26 |
+
- nl
|
| 27 |
+
- pl
|
| 28 |
+
- pt
|
| 29 |
+
- ro
|
| 30 |
+
- ru
|
| 31 |
+
- sk
|
| 32 |
+
- sl
|
| 33 |
+
- sv
|
| 34 |
+
- tr
|
| 35 |
+
- uk
|
| 36 |
+
- vi
|
| 37 |
+
pipeline_tag: text-to-speech
|
| 38 |
+
library_name: supertonic
|
| 39 |
tags:
|
| 40 |
- text-to-speech
|
| 41 |
+
- tts
|
| 42 |
+
- speech-synthesis
|
| 43 |
- onnx
|
| 44 |
- quantized
|
| 45 |
+
- fp16
|
| 46 |
+
- int8
|
| 47 |
- supertonic
|
| 48 |
+
- multilingual
|
| 49 |
+
- on-device
|
| 50 |
+
- diffusion
|
| 51 |
+
- flow-matching
|
| 52 |
---
|
| 53 |
|
| 54 |
+
# Supertonic-3 Quantized (ONNX)
|
| 55 |
|
| 56 |
+
Quantized ONNX derivatives of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) for on-device TTS. Drop-in replacements for the official ONNX assets β same Python/C++/Node SDK, smaller and faster.
|
| 57 |
+
|
| 58 |
+
31 languages (en, ko, ja, ar, bg, cs, da, de, el, es, et, fi, fr, hi, hr, hu, id, it, lt, lv, nl, pl, pt, ro, ru, sk, sl, sv, tr, uk, vi).
|
| 59 |
|
| 60 |
## Variants
|
| 61 |
|
| 62 |
+
| Folder | Total size | Method | Quality | Use case |
|
| 63 |
+
|--------|---:|---|---|---|
|
| 64 |
+
| **`fp16/`** | **191 MB** | All 4 models float16 | Reference (β99% of fp32) | Highest quality on CoreML/DirectML EP |
|
| 65 |
+
| **`int8/`** | **131 MB** | `vector_estimator` int8 dynamic + others fp16 (selective) | Near-identical to fp16 by ear | Smallest viable for production |
|
| 66 |
+
|
| 67 |
+
Both variants share `voice_styles/` (unchanged from upstream).
|
| 68 |
+
|
| 69 |
+
### Why selective quantization for `int8/`?
|
| 70 |
|
| 71 |
+
Full dynamic int8 on all 4 models causes audible artifacts on `vocoder` (conv-based waveform generation) and `text_encoder` (attention/LayerNorm). Selective quantization applies int8 only to `vector_estimator` (a diffusion U-Net with built-in redundancy that tolerates weight-only int8), keeping the sensitive layers in fp16. This mirrors the production configuration used in [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert).
|
| 72 |
+
|
| 73 |
+
| Model | Role | `int8/` precision | Sensitivity to int8 |
|
| 74 |
+
|---|---|---|---|
|
| 75 |
+
| `vector_estimator` | Diffusion U-Net (8Γ denoising) | **int8 dynamic** | Low (redundancy across steps) |
|
| 76 |
+
| `vocoder` | Vocos-style waveform decoder | fp16 | **High** (direct audio output) |
|
| 77 |
+
| `text_encoder` | Multilingual transformer | fp16 | High (attention + LayerNorm) |
|
| 78 |
+
| `duration_predictor` | Length regressor | fp16 | Low (but tiny, no win from int8) |
|
| 79 |
|
| 80 |
## Layout
|
| 81 |
|
|
|
|
| 92 |
{F1,F2,F3,F4,F5,M1,M2,M3,M4,M5}.json
|
| 93 |
```
|
| 94 |
|
| 95 |
+
- **`<variant>/onnx/`** β 4 ONNX weights + architecture config (`tts.json`) + tokenizer table (`unicode_indexer.json`). Filenames have no variant infix β the folder is the variant.
|
| 96 |
+
- **`voice_styles/`** β variant-independent voice embeddings, shared across all variants.
|
| 97 |
|
| 98 |
## Download
|
| 99 |
|
|
|
|
|
|
|
| 100 |
```bash
|
| 101 |
+
# fp16 variant (highest quality)
|
| 102 |
hf download Kyumdroid/supertonic-3-quant \
|
| 103 |
--include="fp16/onnx/**" --include="voice_styles/**" \
|
| 104 |
--local-dir ./supertonic
|
| 105 |
+
|
| 106 |
+
# int8 variant (smallest, near-identical quality)
|
| 107 |
+
hf download Kyumdroid/supertonic-3-quant \
|
| 108 |
+
--include="int8/onnx/**" --include="voice_styles/**" \
|
| 109 |
+
--local-dir ./supertonic
|
| 110 |
```
|
| 111 |
|
| 112 |
+
`voice_styles/` is shared β if you fetch both variants, you only need it once.
|
| 113 |
|
| 114 |
+
## Voice catalog
|
| 115 |
+
|
| 116 |
+
Display names follow the official [Supertonic demo Space](https://huggingface.co/spaces/Supertone/supertonic-3):
|
|
|
|
|
|
|
| 117 |
|
| 118 |
+
| File | Name | Description |
|
| 119 |
+
|---|---|---|
|
| 120 |
+
| `M1.json` | **Alex** | Lively, upbeat male |
|
| 121 |
+
| `M2.json` | **James** | Deep, composed male |
|
| 122 |
+
| `M3.json` | **Robert** | Polished, authoritative male *(demo default)* |
|
| 123 |
+
| `M4.json` | **Sam** | Soft, neutral, youthful male |
|
| 124 |
+
| `M5.json` | **Daniel** | Warm, soothing male |
|
| 125 |
+
| `F1.json` | **Sarah** | Calm, steady female |
|
| 126 |
+
| `F2.json` | **Lily** | Bright, cheerful female |
|
| 127 |
+
| `F3.json` | **Jessica** | Broadcast-style female |
|
| 128 |
+
| `F4.json` | **Olivia** | Crisp, confident female |
|
| 129 |
+
| `F5.json` | **Emily** | Gentle, soothing female |
|
| 130 |
|
| 131 |
## Conversion
|
| 132 |
|
| 133 |
+
- **`fp16/`** β `onnxruntime.transformers.float16.convert_float_to_float16` with `keep_io_types=True`, `op_block_list=['Cast']`, and ONNX shape inference applied first.
|
| 134 |
+
- **`int8/`** β `vector_estimator` only via `onnxruntime.quantization.quantize_dynamic(QInt8, per_channel=True)`; others copied from the fp16 variant. Identical method to [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert)'s `vector_estimator_int8.onnx`.
|
| 135 |
+
|
| 136 |
+
Conversion scripts available in the project repository.
|
| 137 |
+
|
| 138 |
+
## Performance (Apple Silicon CPU, M-series)
|
| 139 |
+
|
| 140 |
+
Short Korean utterance ("μλ
νμΈμ. μ€λ λ μ¨κ° μ λ§ μ’λ€μ."), CPU EP only:
|
| 141 |
+
|
| 142 |
+
| Variant | Size | Synthesis time | Quality (auditory) |
|
| 143 |
+
|---|---:|---:|---|
|
| 144 |
+
| fp32 baseline (upstream) | 380 MB | ~0.7 s | Reference |
|
| 145 |
+
| **fp16/** | 191 MB | ~0.7 s | Indistinguishable from fp32 |
|
| 146 |
+
| **int8/** | 131 MB | ~0.7-5 s | Indistinguishable from fp16 |
|
| 147 |
+
|
| 148 |
+
> CPU EP performs int8 weight-only as fp32 dequant + matmul, so int8 is not faster on CPU. Use CoreML EP (macOS) or DirectML EP (Windows) for fp16-native acceleration β int8/fp16 then run faster than fp32 with significantly lower memory.
|
| 149 |
|
| 150 |
## License
|
| 151 |
|
| 152 |
+
OpenRAIL-M, inherited from [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3). See [LICENSE](./LICENSE).
|
| 153 |
+
|
| 154 |
+
Use restrictions (Attachment A) apply: no impersonation/deepfakes without consent, no AI-generated content without disclosure, no medical advice, no illegal activities, etc.
|
| 155 |
|
| 156 |
## Credits
|
| 157 |
|
| 158 |
+
- Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) by Supertone Inc.
|
| 159 |
+
- Reference quantization pattern: [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert)
|
| 160 |
+
- Quantization (this repo): selective fp16/int8 ONNX for Electron / desktop on-device deployment
|