Kyumdroid commited on
Commit
d9bfd0a
Β·
verified Β·
1 Parent(s): 049b896

Simplify README: fp16-only variant. Document int8 ConvInteger limitation.

Browse files
Files changed (1) hide show
  1. README.md +27 -38
README.md CHANGED
@@ -43,7 +43,6 @@ tags:
43
  - onnx
44
  - quantized
45
  - fp16
46
- - int8
47
  - supertonic
48
  - multilingual
49
  - on-device
@@ -53,7 +52,7 @@ tags:
53
 
54
  # Supertonic-3 Quantized (ONNX)
55
 
56
- Quantized ONNX derivatives of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) for on-device TTS. Drop-in replacements for the official ONNX assets β€” same Python/C++/Node SDK, smaller and faster.
57
 
58
  31 languages (en, ko, ja, ar, bg, cs, da, de, el, es, et, fi, fr, hi, hr, hu, id, it, lt, lv, nl, pl, pt, ro, ru, sk, sl, sv, tr, uk, vi).
59
 
@@ -61,26 +60,24 @@ Quantized ONNX derivatives of [Supertone/supertonic-3](https://huggingface.co/Su
61
 
62
  | Folder | Total size | Method | Quality | Use case |
63
  |--------|---:|---|---|---|
64
- | **`fp16/`** | **191 MB** | All 4 models float16 | Reference (β‰ˆ99% of fp32) | Highest quality on CoreML/DirectML EP |
65
- | **`int8/`** | **131 MB** | `vector_estimator` int8 dynamic + others fp16 (selective) | Near-identical to fp16 by ear | Smallest viable for production |
66
 
67
- Both variants share `voice_styles/` (unchanged from upstream).
68
 
69
- ### Why selective quantization for `int8/`?
70
 
71
- Full dynamic int8 on all 4 models causes audible artifacts on `vocoder` (conv-based waveform generation) and `text_encoder` (attention/LayerNorm). Selective quantization applies int8 only to `vector_estimator` (a diffusion U-Net with built-in redundancy that tolerates weight-only int8), keeping the sensitive layers in fp16. This mirrors the production configuration used in [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert).
 
 
72
 
73
- | Model | Role | `int8/` precision | Sensitivity to int8 |
74
- |---|---|---|---|
75
- | `vector_estimator` | Diffusion U-Net (8Γ— denoising) | **int8 dynamic** | Low (redundancy across steps) |
76
- | `vocoder` | Vocos-style waveform decoder | fp16 | **High** (direct audio output) |
77
- | `text_encoder` | Multilingual transformer | fp16 | High (attention + LayerNorm) |
78
- | `duration_predictor` | Length regressor | fp16 | Low (but tiny, no win from int8) |
79
 
80
  ## Layout
81
 
82
  ```
83
- <variant>/onnx/
84
  text_encoder.onnx
85
  duration_predictor.onnx
86
  vector_estimator.onnx
@@ -92,28 +89,20 @@ voice_styles/
92
  {F1,F2,F3,F4,F5,M1,M2,M3,M4,M5}.json
93
  ```
94
 
95
- - **`<variant>/onnx/`** β€” 4 ONNX weights + architecture config (`tts.json`) + tokenizer table (`unicode_indexer.json`). Filenames have no variant infix β€” the folder is the variant.
96
- - **`voice_styles/`** β€” variant-independent voice embeddings, shared across all variants.
97
 
98
  ## Download
99
 
100
  ```bash
101
- # fp16 variant (highest quality)
102
  hf download Kyumdroid/supertonic-3-quant \
103
  --include="fp16/onnx/**" --include="voice_styles/**" \
104
  --local-dir ./supertonic
105
-
106
- # int8 variant (smallest, near-identical quality)
107
- hf download Kyumdroid/supertonic-3-quant \
108
- --include="int8/onnx/**" --include="voice_styles/**" \
109
- --local-dir ./supertonic
110
  ```
111
 
112
- `voice_styles/` is shared β€” if you fetch both variants, you only need it once.
113
-
114
  ## Voice catalog
115
 
116
- Display names follow the official [Supertonic demo Space](https://huggingface.co/spaces/Supertone/supertonic-3):
117
 
118
  | File | Name | Description |
119
  |---|---|---|
@@ -130,22 +119,23 @@ Display names follow the official [Supertonic demo Space](https://huggingface.co
130
 
131
  ## Conversion
132
 
133
- - **`fp16/`** β€” `onnxruntime.transformers.float16.convert_float_to_float16` with `keep_io_types=True`, `op_block_list=['Cast']`, and ONNX shape inference applied first.
134
- - **`int8/`** β€” `vector_estimator` only via `onnxruntime.quantization.quantize_dynamic(QInt8, per_channel=True)`; others copied from the fp16 variant. Identical method to [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert)'s `vector_estimator_int8.onnx`.
 
 
135
 
136
- Conversion scripts available in the project repository.
137
 
138
- ## Performance (Apple Silicon CPU, M-series)
139
 
140
- Short Korean utterance ("μ•ˆλ…•ν•˜μ„Έμš”. 였늘 날씨가 정말 μ’‹λ„€μš”."), CPU EP only:
141
 
142
- | Variant | Size | Synthesis time | Quality (auditory) |
143
- |---|---:|---:|---|
144
- | fp32 baseline (upstream) | 380 MB | ~0.7 s | Reference |
145
- | **fp16/** | 191 MB | ~0.7 s | Indistinguishable from fp32 |
146
- | **int8/** | 131 MB | ~0.7-5 s | Indistinguishable from fp16 |
147
 
148
- > CPU EP performs int8 weight-only as fp32 dequant + matmul, so int8 is not faster on CPU. Use CoreML EP (macOS) or DirectML EP (Windows) for fp16-native acceleration β€” int8/fp16 then run faster than fp32 with significantly lower memory.
149
 
150
  ## License
151
 
@@ -156,5 +146,4 @@ Use restrictions (Attachment A) apply: no impersonation/deepfakes without consen
156
  ## Credits
157
 
158
  - Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) by Supertone Inc.
159
- - Reference quantization pattern: [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert)
160
- - Quantization (this repo): selective fp16/int8 ONNX for Electron / desktop on-device deployment
 
43
  - onnx
44
  - quantized
45
  - fp16
 
46
  - supertonic
47
  - multilingual
48
  - on-device
 
52
 
53
  # Supertonic-3 Quantized (ONNX)
54
 
55
+ Quantized ONNX derivative of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) for on-device TTS. Drop-in replacement for the official ONNX assets β€” same Python / C++ / Node SDK, smaller weights.
56
 
57
  31 languages (en, ko, ja, ar, bg, cs, da, de, el, es, et, fi, fr, hi, hr, hu, id, it, lt, lv, nl, pl, pt, ro, ru, sk, sl, sv, tr, uk, vi).
58
 
 
60
 
61
  | Folder | Total size | Method | Quality | Use case |
62
  |--------|---:|---|---|---|
63
+ | **`fp16/`** | **191 MB** | All 4 models float16 (`onnxruntime.transformers.float16`) | β‰ˆ99% of fp32 | On-device desktop/mobile, ORT/CoreML/DirectML |
 
64
 
65
+ `voice_styles/` is shared and unchanged from upstream.
66
 
67
+ ### Why no int8 variant?
68
 
69
+ Tested dynamic int8 on `vector_estimator` (the largest model, a ConvNeXt-based diffusion U-Net) but the resulting model emits `ConvInteger` op nodes, which are **not implemented in many ORT CPU builds**:
70
+ - Common error: `NOT_IMPLEMENTED: Could not find an implementation for ConvInteger(10) node`
71
+ - Affects: `onnxruntime-node`, minimal builds, older ORT versions, some mobile builds
72
 
73
+ Restricting dynamic quantization to MatMul ops (skipping Conv) gives only ~6% size reduction because `vector_estimator` is Conv-dominated. Static int8 (QDQ) with calibration would work universally but requires capturing intermediate diffusion states β€” out of scope for this repo.
74
+
75
+ For now, `fp16` is the recommended on-device variant: universal ORT compatibility, near-lossless quality, ~50% smaller than fp32.
 
 
 
76
 
77
  ## Layout
78
 
79
  ```
80
+ fp16/onnx/
81
  text_encoder.onnx
82
  duration_predictor.onnx
83
  vector_estimator.onnx
 
89
  {F1,F2,F3,F4,F5,M1,M2,M3,M4,M5}.json
90
  ```
91
 
92
+ - **`fp16/onnx/`** β€” 4 ONNX weights + architecture config (`tts.json`) + tokenizer table (`unicode_indexer.json`).
93
+ - **`voice_styles/`** β€” voice embeddings, identical to upstream.
94
 
95
  ## Download
96
 
97
  ```bash
 
98
  hf download Kyumdroid/supertonic-3-quant \
99
  --include="fp16/onnx/**" --include="voice_styles/**" \
100
  --local-dir ./supertonic
 
 
 
 
 
101
  ```
102
 
 
 
103
  ## Voice catalog
104
 
105
+ Display names from the official [Supertonic demo Space](https://huggingface.co/spaces/Supertone/supertonic-3):
106
 
107
  | File | Name | Description |
108
  |---|---|---|
 
119
 
120
  ## Conversion
121
 
122
+ `fp16/` was produced via `onnxruntime.transformers.float16.convert_float_to_float16` with:
123
+ - `keep_io_types=True` (fp32 IO for SDK compatibility)
124
+ - `op_block_list=['Cast']` (avoid Cast type mismatch)
125
+ - ONNX `shape_inference.infer_shapes_path` applied to upstream fp32 first
126
 
127
+ Conversion script available in the project repository.
128
 
129
+ ## Performance (Apple Silicon CPU)
130
 
131
+ Short Korean utterance, ORT CPU EP only:
132
 
133
+ | Variant | Size | Synthesis time |
134
+ |---|---:|---:|
135
+ | fp32 baseline (upstream) | 380 MB | ~0.7 s |
136
+ | **fp16** | 191 MB | ~0.7 s |
 
137
 
138
+ CPU EP performs fp16 as fp32 upcast, so wall-clock time is similar. Use **CoreML EP** (macOS) or **DirectML EP** (Windows) for fp16-native acceleration: 2-3Γ— faster + ~50% lower RAM.
139
 
140
  ## License
141
 
 
146
  ## Credits
147
 
148
  - Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) by Supertone Inc.
149
+ - Quantization (this repo): fp16 ONNX for Electron / desktop on-device deployment