Kyumdroid commited on
Commit
99e5935
Β·
verified Β·
1 Parent(s): 56ec6b6

Update README: add int8 variant, voice catalog, conversion details

Browse files
Files changed (1) hide show
  1. README.md +108 -21
README.md CHANGED
@@ -2,24 +2,80 @@
2
  license: openrail
3
  base_model: Supertone/supertonic-3
4
  base_model_relation: quantized
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  tags:
6
  - text-to-speech
 
 
7
  - onnx
8
  - quantized
 
 
9
  - supertonic
 
 
 
 
10
  ---
11
 
12
- # Supertonic-3 Quantized
13
 
14
- Quantized derivatives of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) for on-device TTS.
 
 
15
 
16
  ## Variants
17
 
18
- | Folder | Format | Notes |
19
- |--------|--------|-------|
20
- | `fp16/` | ONNX fp16 | float16 weights, CPU-friendly |
 
 
 
 
 
21
 
22
- More variants (`int8/`, `int4/`, `mixed/`) may be added later as sibling folders.
 
 
 
 
 
 
 
23
 
24
  ## Layout
25
 
@@ -36,38 +92,69 @@ voice_styles/
36
  {F1,F2,F3,F4,F5,M1,M2,M3,M4,M5}.json
37
  ```
38
 
39
- - **`<variant>/onnx/`** β€” variant-scoped: 4 ONNX weights + the architecture config (`tts.json`) and tokenizer table (`unicode_indexer.json`). Both JSON files live next to the weights because future variants may carry a different `latent_dim` / `n_style` / unicode coverage. Filenames carry no variant infix β€” the folder is the variant.
40
- - **`voice_styles/`** β€” variant-independent voice embeddings shared by every variant. Quantizing the ONNX graph does not change the style vectors, so the same files work across `fp16`, `int8`, `int4`, and any future mixed variant.
41
 
42
  ## Download
43
 
44
- Snapshot the `fp16` variant only:
45
-
46
  ```bash
 
47
  hf download Kyumdroid/supertonic-3-quant \
48
  --include="fp16/onnx/**" --include="voice_styles/**" \
49
  --local-dir ./supertonic
 
 
 
 
 
50
  ```
51
 
52
- Resulting tree (16 files, ~203 MB):
53
 
54
- ```
55
- supertonic/
56
- fp16/onnx/... # 4 ONNX + tts.json + unicode_indexer.json
57
- voice_styles/... # 10 JSON
58
- ```
59
 
60
- When a new variant is published (e.g. `int8`), only the `<new-variant>/onnx/` tree needs to be fetched β€” `voice_styles/` is reused.
 
 
 
 
 
 
 
 
 
 
 
61
 
62
  ## Conversion
63
 
64
- `fp16/` was produced via `onnxconverter_common.float16.convert_float_to_float16_model_path` against the fp32 ONNX models in `Supertone/supertonic-3`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
  ## License
67
 
68
- OpenRAIL-M (inherited from base model). See [LICENSE](./LICENSE).
 
 
69
 
70
  ## Credits
71
 
72
- - Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
73
- - Quantization: this repo
 
 
2
  license: openrail
3
  base_model: Supertone/supertonic-3
4
  base_model_relation: quantized
5
+ language:
6
+ - en
7
+ - ko
8
+ - ja
9
+ - ar
10
+ - bg
11
+ - cs
12
+ - da
13
+ - de
14
+ - el
15
+ - es
16
+ - et
17
+ - fi
18
+ - fr
19
+ - hi
20
+ - hr
21
+ - hu
22
+ - id
23
+ - it
24
+ - lt
25
+ - lv
26
+ - nl
27
+ - pl
28
+ - pt
29
+ - ro
30
+ - ru
31
+ - sk
32
+ - sl
33
+ - sv
34
+ - tr
35
+ - uk
36
+ - vi
37
+ pipeline_tag: text-to-speech
38
+ library_name: supertonic
39
  tags:
40
  - text-to-speech
41
+ - tts
42
+ - speech-synthesis
43
  - onnx
44
  - quantized
45
+ - fp16
46
+ - int8
47
  - supertonic
48
+ - multilingual
49
+ - on-device
50
+ - diffusion
51
+ - flow-matching
52
  ---
53
 
54
+ # Supertonic-3 Quantized (ONNX)
55
 
56
+ Quantized ONNX derivatives of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) for on-device TTS. Drop-in replacements for the official ONNX assets β€” same Python/C++/Node SDK, smaller and faster.
57
+
58
+ 31 languages (en, ko, ja, ar, bg, cs, da, de, el, es, et, fi, fr, hi, hr, hu, id, it, lt, lv, nl, pl, pt, ro, ru, sk, sl, sv, tr, uk, vi).
59
 
60
  ## Variants
61
 
62
+ | Folder | Total size | Method | Quality | Use case |
63
+ |--------|---:|---|---|---|
64
+ | **`fp16/`** | **191 MB** | All 4 models float16 | Reference (β‰ˆ99% of fp32) | Highest quality on CoreML/DirectML EP |
65
+ | **`int8/`** | **131 MB** | `vector_estimator` int8 dynamic + others fp16 (selective) | Near-identical to fp16 by ear | Smallest viable for production |
66
+
67
+ Both variants share `voice_styles/` (unchanged from upstream).
68
+
69
+ ### Why selective quantization for `int8/`?
70
 
71
+ Full dynamic int8 on all 4 models causes audible artifacts on `vocoder` (conv-based waveform generation) and `text_encoder` (attention/LayerNorm). Selective quantization applies int8 only to `vector_estimator` (a diffusion U-Net with built-in redundancy that tolerates weight-only int8), keeping the sensitive layers in fp16. This mirrors the production configuration used in [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert).
72
+
73
+ | Model | Role | `int8/` precision | Sensitivity to int8 |
74
+ |---|---|---|---|
75
+ | `vector_estimator` | Diffusion U-Net (8Γ— denoising) | **int8 dynamic** | Low (redundancy across steps) |
76
+ | `vocoder` | Vocos-style waveform decoder | fp16 | **High** (direct audio output) |
77
+ | `text_encoder` | Multilingual transformer | fp16 | High (attention + LayerNorm) |
78
+ | `duration_predictor` | Length regressor | fp16 | Low (but tiny, no win from int8) |
79
 
80
  ## Layout
81
 
 
92
  {F1,F2,F3,F4,F5,M1,M2,M3,M4,M5}.json
93
  ```
94
 
95
+ - **`<variant>/onnx/`** β€” 4 ONNX weights + architecture config (`tts.json`) + tokenizer table (`unicode_indexer.json`). Filenames have no variant infix β€” the folder is the variant.
96
+ - **`voice_styles/`** β€” variant-independent voice embeddings, shared across all variants.
97
 
98
  ## Download
99
 
 
 
100
  ```bash
101
+ # fp16 variant (highest quality)
102
  hf download Kyumdroid/supertonic-3-quant \
103
  --include="fp16/onnx/**" --include="voice_styles/**" \
104
  --local-dir ./supertonic
105
+
106
+ # int8 variant (smallest, near-identical quality)
107
+ hf download Kyumdroid/supertonic-3-quant \
108
+ --include="int8/onnx/**" --include="voice_styles/**" \
109
+ --local-dir ./supertonic
110
  ```
111
 
112
+ `voice_styles/` is shared β€” if you fetch both variants, you only need it once.
113
 
114
+ ## Voice catalog
115
+
116
+ Display names follow the official [Supertonic demo Space](https://huggingface.co/spaces/Supertone/supertonic-3):
 
 
117
 
118
+ | File | Name | Description |
119
+ |---|---|---|
120
+ | `M1.json` | **Alex** | Lively, upbeat male |
121
+ | `M2.json` | **James** | Deep, composed male |
122
+ | `M3.json` | **Robert** | Polished, authoritative male *(demo default)* |
123
+ | `M4.json` | **Sam** | Soft, neutral, youthful male |
124
+ | `M5.json` | **Daniel** | Warm, soothing male |
125
+ | `F1.json` | **Sarah** | Calm, steady female |
126
+ | `F2.json` | **Lily** | Bright, cheerful female |
127
+ | `F3.json` | **Jessica** | Broadcast-style female |
128
+ | `F4.json` | **Olivia** | Crisp, confident female |
129
+ | `F5.json` | **Emily** | Gentle, soothing female |
130
 
131
  ## Conversion
132
 
133
+ - **`fp16/`** β€” `onnxruntime.transformers.float16.convert_float_to_float16` with `keep_io_types=True`, `op_block_list=['Cast']`, and ONNX shape inference applied first.
134
+ - **`int8/`** β€” `vector_estimator` only via `onnxruntime.quantization.quantize_dynamic(QInt8, per_channel=True)`; others copied from the fp16 variant. Identical method to [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert)'s `vector_estimator_int8.onnx`.
135
+
136
+ Conversion scripts available in the project repository.
137
+
138
+ ## Performance (Apple Silicon CPU, M-series)
139
+
140
+ Short Korean utterance ("μ•ˆλ…•ν•˜μ„Έμš”. 였늘 날씨가 정말 μ’‹λ„€μš”."), CPU EP only:
141
+
142
+ | Variant | Size | Synthesis time | Quality (auditory) |
143
+ |---|---:|---:|---|
144
+ | fp32 baseline (upstream) | 380 MB | ~0.7 s | Reference |
145
+ | **fp16/** | 191 MB | ~0.7 s | Indistinguishable from fp32 |
146
+ | **int8/** | 131 MB | ~0.7-5 s | Indistinguishable from fp16 |
147
+
148
+ > CPU EP performs int8 weight-only as fp32 dequant + matmul, so int8 is not faster on CPU. Use CoreML EP (macOS) or DirectML EP (Windows) for fp16-native acceleration β€” int8/fp16 then run faster than fp32 with significantly lower memory.
149
 
150
  ## License
151
 
152
+ OpenRAIL-M, inherited from [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3). See [LICENSE](./LICENSE).
153
+
154
+ Use restrictions (Attachment A) apply: no impersonation/deepfakes without consent, no AI-generated content without disclosure, no medical advice, no illegal activities, etc.
155
 
156
  ## Credits
157
 
158
+ - Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) by Supertone Inc.
159
+ - Reference quantization pattern: [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert)
160
+ - Quantization (this repo): selective fp16/int8 ONNX for Electron / desktop on-device deployment