Kyumdroid
/

supertonic-3-quant

@@ -2,24 +2,80 @@
 license: openrail
 base_model: Supertone/supertonic-3
 base_model_relation: quantized
 tags:
 - text-to-speech
 - onnx
 - quantized
 - supertonic
 ---
-# Supertonic-3 Quantized
-Quantized derivatives of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) for on-device TTS.
 ## Variants
-| Folder | Format | Notes |
-|--------|--------|-------|
-| `fp16/` | ONNX fp16 | float16 weights, CPU-friendly |
-More variants (`int8/`, `int4/`, `mixed/`) may be added later as sibling folders.
 ## Layout
@@ -36,38 +92,69 @@ voice_styles/
     {F1,F2,F3,F4,F5,M1,M2,M3,M4,M5}.json
 ```
-- **`<variant>/onnx/`** — variant-scoped: 4 ONNX weights + the architecture config (`tts.json`) and tokenizer table (`unicode_indexer.json`). Both JSON files live next to the weights because future variants may carry a different `latent_dim` / `n_style` / unicode coverage. Filenames carry no variant infix — the folder is the variant.
-- **`voice_styles/`** — variant-independent voice embeddings shared by every variant. Quantizing the ONNX graph does not change the style vectors, so the same files work across `fp16`, `int8`, `int4`, and any future mixed variant.
 ## Download
-Snapshot the `fp16` variant only:
 ```bash
 hf download Kyumdroid/supertonic-3-quant \
   --include="fp16/onnx/**" --include="voice_styles/**" \
   --local-dir ./supertonic
 ```
-Resulting tree (16 files, ~203 MB):
-```
-supertonic/
-  fp16/onnx/...        # 4 ONNX + tts.json + unicode_indexer.json
-  voice_styles/...     # 10 JSON
-```
-When a new variant is published (e.g. `int8`), only the `<new-variant>/onnx/` tree needs to be fetched — `voice_styles/` is reused.
 ## Conversion
-`fp16/` was produced via `onnxconverter_common.float16.convert_float_to_float16_model_path` against the fp32 ONNX models in `Supertone/supertonic-3`.
 ## License
-OpenRAIL-M (inherited from base model). See [LICENSE](./LICENSE).
 ## Credits
-- Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)
-- Quantization: this repo

 license: openrail
 base_model: Supertone/supertonic-3
 base_model_relation: quantized
+language:
+- en
+- ko
+- ja
+- ar
+- bg
+- cs
+- da
+- de
+- el
+- es
+- et
+- fi
+- fr
+- hi
+- hr
+- hu
+- id
+- it
+- lt
+- lv
+- nl
+- pl
+- pt
+- ro
+- ru
+- sk
+- sl
+- sv
+- tr
+- uk
+- vi
+pipeline_tag: text-to-speech
+library_name: supertonic
 tags:
 - text-to-speech
+- tts
+- speech-synthesis
 - onnx
 - quantized
+- fp16
+- int8
 - supertonic
+- multilingual
+- on-device
+- diffusion
+- flow-matching
 ---
+# Supertonic-3 Quantized (ONNX)
+Quantized ONNX derivatives of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) for on-device TTS. Drop-in replacements for the official ONNX assets — same Python/C++/Node SDK, smaller and faster.
+31 languages (en, ko, ja, ar, bg, cs, da, de, el, es, et, fi, fr, hi, hr, hu, id, it, lt, lv, nl, pl, pt, ro, ru, sk, sl, sv, tr, uk, vi).
 ## Variants
+| Folder | Total size | Method | Quality | Use case |
+|--------|---:|---|---|---|
+| **`fp16/`** | **191 MB** | All 4 models float16 | Reference (≈99% of fp32) | Highest quality on CoreML/DirectML EP |
+| **`int8/`** | **131 MB** | `vector_estimator` int8 dynamic + others fp16 (selective) | Near-identical to fp16 by ear | Smallest viable for production |
+Both variants share `voice_styles/` (unchanged from upstream).
+### Why selective quantization for `int8/`?
+Full dynamic int8 on all 4 models causes audible artifacts on `vocoder` (conv-based waveform generation) and `text_encoder` (attention/LayerNorm). Selective quantization applies int8 only to `vector_estimator` (a diffusion U-Net with built-in redundancy that tolerates weight-only int8), keeping the sensitive layers in fp16. This mirrors the production configuration used in [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert).
+| Model | Role | `int8/` precision | Sensitivity to int8 |
+|---|---|---|---|
+| `vector_estimator` | Diffusion U-Net (8× denoising) | **int8 dynamic** | Low (redundancy across steps) |
+| `vocoder` | Vocos-style waveform decoder | fp16 | **High** (direct audio output) |
+| `text_encoder` | Multilingual transformer | fp16 | High (attention + LayerNorm) |
+| `duration_predictor` | Length regressor | fp16 | Low (but tiny, no win from int8) |
 ## Layout
     {F1,F2,F3,F4,F5,M1,M2,M3,M4,M5}.json
 ```
+- **`<variant>/onnx/`** — 4 ONNX weights + architecture config (`tts.json`) + tokenizer table (`unicode_indexer.json`). Filenames have no variant infix — the folder is the variant.
+- **`voice_styles/`** — variant-independent voice embeddings, shared across all variants.
 ## Download
 ```bash
+# fp16 variant (highest quality)
 hf download Kyumdroid/supertonic-3-quant \
   --include="fp16/onnx/**" --include="voice_styles/**" \
   --local-dir ./supertonic
+# int8 variant (smallest, near-identical quality)
+hf download Kyumdroid/supertonic-3-quant \
+  --include="int8/onnx/**" --include="voice_styles/**" \
+  --local-dir ./supertonic
 ```
+`voice_styles/` is shared — if you fetch both variants, you only need it once.
+## Voice catalog
+Display names follow the official [Supertonic demo Space](https://huggingface.co/spaces/Supertone/supertonic-3):
+| File | Name | Description |
+|---|---|---|
+| `M1.json` | **Alex** | Lively, upbeat male |
+| `M2.json` | **James** | Deep, composed male |
+| `M3.json` | **Robert** | Polished, authoritative male *(demo default)* |
+| `M4.json` | **Sam** | Soft, neutral, youthful male |
+| `M5.json` | **Daniel** | Warm, soothing male |
+| `F1.json` | **Sarah** | Calm, steady female |
+| `F2.json` | **Lily** | Bright, cheerful female |
+| `F3.json` | **Jessica** | Broadcast-style female |
+| `F4.json` | **Olivia** | Crisp, confident female |
+| `F5.json` | **Emily** | Gentle, soothing female |
 ## Conversion
+- **`fp16/`** — `onnxruntime.transformers.float16.convert_float_to_float16` with `keep_io_types=True`, `op_block_list=['Cast']`, and ONNX shape inference applied first.
+- **`int8/`** — `vector_estimator` only via `onnxruntime.quantization.quantize_dynamic(QInt8, per_channel=True)`; others copied from the fp16 variant. Identical method to [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert)'s `vector_estimator_int8.onnx`.
+Conversion scripts available in the project repository.
+## Performance (Apple Silicon CPU, M-series)
+Short Korean utterance ("안녕하세요. 오늘 날씨가 정말 좋네요."), CPU EP only:
+| Variant | Size | Synthesis time | Quality (auditory) |
+|---|---:|---:|---|
+| fp32 baseline (upstream) | 380 MB | ~0.7 s | Reference |
+| **fp16/** | 191 MB | ~0.7 s | Indistinguishable from fp32 |
+| **int8/** | 131 MB | ~0.7-5 s | Indistinguishable from fp16 |
+> CPU EP performs int8 weight-only as fp32 dequant + matmul, so int8 is not faster on CPU. Use CoreML EP (macOS) or DirectML EP (Windows) for fp16-native acceleration — int8/fp16 then run faster than fp32 with significantly lower memory.
 ## License
+OpenRAIL-M, inherited from [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3). See [LICENSE](./LICENSE).
+Use restrictions (Attachment A) apply: no impersonation/deepfakes without consent, no AI-generated content without disclosure, no medical advice, no illegal activities, etc.
 ## Credits
+- Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) by Supertone Inc.
+- Reference quantization pattern: [Reza2kn/supertonic-3-litert](https://huggingface.co/Reza2kn/supertonic-3-litert)
+- Quantization (this repo): selective fp16/int8 ONNX for Electron / desktop on-device deployment