Kyumdroid
/

supertonic-3-quant

@@ -43,7 +43,6 @@ tags:
 - onnx
 - quantized
 - fp16
-- int8
 - supertonic
 - multilingual
 - on-device
@@ -53,7 +52,7 @@ tags:
 # Supertonic-3 Quantized (ONNX)
-Quantized ONNX derivatives of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) for on-device TTS. Drop-in replacements for the official ONNX assets — same Python/C++/Node SDK, smaller and faster.
 31 languages (en, ko, ja, ar, bg, cs, da, de, el, es, et, fi, fr, hi, hr, hu, id, it, lt, lv, nl, pl, pt, ro, ru, sk, sl, sv, tr, uk, vi).
@@ -61,26 +60,24 @@ Quantized ONNX derivatives of [Supertone/supertonic-3](https://huggingface.co/Su
 | Folder | Total size | Method | Quality | Use case |
 |--------|---:|---|---|---|
-| **`fp16/`** | **191 MB** | All 4 models float16 | Reference (≈99% of fp32) | Highest quality on CoreML/DirectML EP |
-| **`int8/`** | **131 MB** | `vector_estimator` int8 dynamic + others fp16 (selective) | Near-identical to fp16 by ear | Smallest viable for production |
-Both variants share `voice_styles/` (unchanged from upstream).
-### Why selective quantization for `int8/`?
-Full dynamic int8 on all 4 models causes audible artifacts on `vocoder` (conv-based waveform generation) and `text_encoder` (attention/LayerNorm). Selective quantization applies int8 only to `vector_estimator` (a diffusion U-Net with built-in redundancy that tolerates weight-only int8), keeping the sensitive layers in fp16. This mirrors the production configuration used in [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert).
-| Model | Role | `int8/` precision | Sensitivity to int8 |
-|---|---|---|---|
-| `vector_estimator` | Diffusion U-Net (8× denoising) | **int8 dynamic** | Low (redundancy across steps) |
-| `vocoder` | Vocos-style waveform decoder | fp16 | **High** (direct audio output) |
-| `text_encoder` | Multilingual transformer | fp16 | High (attention + LayerNorm) |
-| `duration_predictor` | Length regressor | fp16 | Low (but tiny, no win from int8) |
 ## Layout
 ```
-<variant>/onnx/
     text_encoder.onnx
     duration_predictor.onnx
     vector_estimator.onnx
@@ -92,28 +89,20 @@ voice_styles/
     {F1,F2,F3,F4,F5,M1,M2,M3,M4,M5}.json
 ```
-- **`<variant>/onnx/`** — 4 ONNX weights + architecture config (`tts.json`) + tokenizer table (`unicode_indexer.json`). Filenames have no variant infix — the folder is the variant.
-- **`voice_styles/`** — variant-independent voice embeddings, shared across all variants.
 ## Download
 ```bash
-# fp16 variant (highest quality)
 hf download Kyumdroid/supertonic-3-quant \
   --include="fp16/onnx/**" --include="voice_styles/**" \
   --local-dir ./supertonic
-# int8 variant (smallest, near-identical quality)
-hf download Kyumdroid/supertonic-3-quant \
-  --include="int8/onnx/**" --include="voice_styles/**" \
-  --local-dir ./supertonic
 ```
-`voice_styles/` is shared — if you fetch both variants, you only need it once.
 ## Voice catalog
-Display names follow the official [Supertonic demo Space](https://huggingface.co/spaces/Supertone/supertonic-3):
 | File | Name | Description |
 |---|---|---|
@@ -130,22 +119,23 @@ Display names follow the official [Supertonic demo Space](https://huggingface.co
 ## Conversion
-- **`fp16/`** — `onnxruntime.transformers.float16.convert_float_to_float16` with `keep_io_types=True`, `op_block_list=['Cast']`, and ONNX shape inference applied first.
-- **`int8/`** — `vector_estimator` only via `onnxruntime.quantization.quantize_dynamic(QInt8, per_channel=True)`; others copied from the fp16 variant. Identical method to [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert)'s `vector_estimator_int8.onnx`.
-Conversion scripts available in the project repository.
-## Performance (Apple Silicon CPU, M-series)
-Short Korean utterance ("안녕하세요. 오늘 날씨가 정말 좋네요."), CPU EP only:
-| Variant | Size | Synthesis time | Quality (auditory) |
-|---|---:|---:|---|
-| fp32 baseline (upstream) | 380 MB | ~0.7 s | Reference |
-| **fp16/** | 191 MB | ~0.7 s | Indistinguishable from fp32 |
-| **int8/** | 131 MB | ~0.7-5 s | Indistinguishable from fp16 |
-> CPU EP performs int8 weight-only as fp32 dequant + matmul, so int8 is not faster on CPU. Use CoreML EP (macOS) or DirectML EP (Windows) for fp16-native acceleration — int8/fp16 then run faster than fp32 with significantly lower memory.
 ## License
@@ -156,5 +146,4 @@ Use restrictions (Attachment A) apply: no impersonation/deepfakes without consen
 ## Credits
 - Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) by Supertone Inc.
-- Reference quantization pattern: [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert)
-- Quantization (this repo): selective fp16/int8 ONNX for Electron / desktop on-device deployment

 - onnx
 - quantized
 - fp16
 - supertonic
 - multilingual
 - on-device
 # Supertonic-3 Quantized (ONNX)
+Quantized ONNX derivative of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) for on-device TTS. Drop-in replacement for the official ONNX assets — same Python / C++ / Node SDK, smaller weights.
 31 languages (en, ko, ja, ar, bg, cs, da, de, el, es, et, fi, fr, hi, hr, hu, id, it, lt, lv, nl, pl, pt, ro, ru, sk, sl, sv, tr, uk, vi).
 | Folder | Total size | Method | Quality | Use case |
 |--------|---:|---|---|---|
+| **`fp16/`** | **191 MB** | All 4 models float16 (`onnxruntime.transformers.float16`) | ≈99% of fp32 | On-device desktop/mobile, ORT/CoreML/DirectML |
+`voice_styles/` is shared and unchanged from upstream.
+### Why no int8 variant?
+Tested dynamic int8 on `vector_estimator` (the largest model, a ConvNeXt-based diffusion U-Net) but the resulting model emits `ConvInteger` op nodes, which are **not implemented in many ORT CPU builds**:
+- Common error: `NOT_IMPLEMENTED: Could not find an implementation for ConvInteger(10) node`
+- Affects: `onnxruntime-node`, minimal builds, older ORT versions, some mobile builds
+Restricting dynamic quantization to MatMul ops (skipping Conv) gives only ~6% size reduction because `vector_estimator` is Conv-dominated. Static int8 (QDQ) with calibration would work universally but requires capturing intermediate diffusion states — out of scope for this repo.
+For now, `fp16` is the recommended on-device variant: universal ORT compatibility, near-lossless quality, ~50% smaller than fp32.
 ## Layout
 ```
+fp16/onnx/
     text_encoder.onnx
     duration_predictor.onnx
     vector_estimator.onnx
     {F1,F2,F3,F4,F5,M1,M2,M3,M4,M5}.json
 ```
+- **`fp16/onnx/`** — 4 ONNX weights + architecture config (`tts.json`) + tokenizer table (`unicode_indexer.json`).
+- **`voice_styles/`** — voice embeddings, identical to upstream.
 ## Download
 ```bash
 hf download Kyumdroid/supertonic-3-quant \
   --include="fp16/onnx/**" --include="voice_styles/**" \
   --local-dir ./supertonic
 ```
 ## Voice catalog
+Display names from the official [Supertonic demo Space](https://huggingface.co/spaces/Supertone/supertonic-3):
 | File | Name | Description |
 |---|---|---|
 ## Conversion
+`fp16/` was produced via `onnxruntime.transformers.float16.convert_float_to_float16` with:
+- `keep_io_types=True` (fp32 IO for SDK compatibility)
+- `op_block_list=['Cast']` (avoid Cast type mismatch)
+- ONNX `shape_inference.infer_shapes_path` applied to upstream fp32 first
+Conversion script available in the project repository.
+## Performance (Apple Silicon CPU)
+Short Korean utterance, ORT CPU EP only:
+| Variant | Size | Synthesis time |
+|---|---:|---:|
+| fp32 baseline (upstream) | 380 MB | ~0.7 s |
+| **fp16** | 191 MB | ~0.7 s |
+CPU EP performs fp16 as fp32 upcast, so wall-clock time is similar. Use **CoreML EP** (macOS) or **DirectML EP** (Windows) for fp16-native acceleration: 2-3× faster + ~50% lower RAM.
 ## License
 ## Credits
 - Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) by Supertone Inc.
+- Quantization (this repo): fp16 ONNX for Electron / desktop on-device deployment