Supertonic-3 Quantized (ONNX)

Quantized ONNX derivative of Supertone/supertonic-3 for on-device TTS. Drop-in replacement for the official ONNX assets โ€” same Python / C++ / Node SDK, smaller weights.

31 languages (en, ko, ja, ar, bg, cs, da, de, el, es, et, fi, fr, hi, hr, hu, id, it, lt, lv, nl, pl, pt, ro, ru, sk, sl, sv, tr, uk, vi).

Variants

Folder Total size Method Quality Use case
fp16/ 191 MB All 4 models float16 (onnxruntime.transformers.float16) โ‰ˆ99% of fp32 On-device desktop/mobile, ORT/CoreML/DirectML

voice_styles/ is shared and unchanged from upstream.

Why no int8 variant?

Tested dynamic int8 on vector_estimator (the largest model, a ConvNeXt-based diffusion U-Net) but the resulting model emits ConvInteger op nodes, which are not implemented in many ORT CPU builds:

  • Common error: NOT_IMPLEMENTED: Could not find an implementation for ConvInteger(10) node
  • Affects: onnxruntime-node, minimal builds, older ORT versions, some mobile builds

Restricting dynamic quantization to MatMul ops (skipping Conv) gives only ~6% size reduction because vector_estimator is Conv-dominated. Static int8 (QDQ) with calibration would work universally but requires capturing intermediate diffusion states โ€” out of scope for this repo.

For now, fp16 is the recommended on-device variant: universal ORT compatibility, near-lossless quality, ~50% smaller than fp32.

Layout

fp16/onnx/
    text_encoder.onnx
    duration_predictor.onnx
    vector_estimator.onnx
    vocoder.onnx
    tts.json
    unicode_indexer.json

voice_styles/
    {F1,F2,F3,F4,F5,M1,M2,M3,M4,M5}.json
  • fp16/onnx/ โ€” 4 ONNX weights + architecture config (tts.json) + tokenizer table (unicode_indexer.json).
  • voice_styles/ โ€” voice embeddings, identical to upstream.

Download

hf download Kyumdroid/supertonic-3-quant \
  --include="fp16/onnx/**" --include="voice_styles/**" \
  --local-dir ./supertonic

Voice catalog

Display names from the official Supertonic demo Space:

File Name Description
M1.json Alex Lively, upbeat male
M2.json James Deep, composed male
M3.json Robert Polished, authoritative male (demo default)
M4.json Sam Soft, neutral, youthful male
M5.json Daniel Warm, soothing male
F1.json Sarah Calm, steady female
F2.json Lily Bright, cheerful female
F3.json Jessica Broadcast-style female
F4.json Olivia Crisp, confident female
F5.json Emily Gentle, soothing female

Conversion

fp16/ was produced via onnxruntime.transformers.float16.convert_float_to_float16 with:

  • keep_io_types=True (fp32 IO for SDK compatibility)
  • op_block_list=['Cast'] (avoid Cast type mismatch)
  • ONNX shape_inference.infer_shapes_path applied to upstream fp32 first

Conversion script available in the project repository.

Performance (Apple Silicon CPU)

Short Korean utterance, ORT CPU EP only:

Variant Size Synthesis time
fp32 baseline (upstream) 380 MB ~0.7 s
fp16 191 MB ~0.7 s

CPU EP performs fp16 as fp32 upcast, so wall-clock time is similar. Use CoreML EP (macOS) or DirectML EP (Windows) for fp16-native acceleration: 2-3ร— faster + ~50% lower RAM.

License

OpenRAIL-M, inherited from Supertone/supertonic-3. See LICENSE.

Use restrictions (Attachment A) apply: no impersonation/deepfakes without consent, no AI-generated content without disclosure, no medical advice, no illegal activities, etc.

Credits

  • Original model: Supertone/supertonic-3 by Supertone Inc.
  • Quantization (this repo): fp16 ONNX for Electron / desktop on-device deployment
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Kyumdroid/supertonic-3-quant

Quantized
(5)
this model