Simplify README: fp16-only variant. Document int8 ConvInteger limitation.

d9bfd0a verified 6 days ago

4.5 kB

	---
	license: openrail
	base_model: Supertone/supertonic-3
	base_model_relation: quantized
	language:
	- en
	- ko
	- ja
	- ar
	- bg
	- cs
	- da
	- de
	- el
	- es
	- et
	- fi
	- fr
	- hi
	- hr
	- hu
	- id
	- it
	- lt
	- lv
	- nl
	- pl
	- pt
	- ro
	- ru
	- sk
	- sl
	- sv
	- tr
	- uk
	- vi
	pipeline_tag: text-to-speech
	library_name: supertonic
	tags:
	- text-to-speech
	- tts
	- speech-synthesis
	- onnx
	- quantized
	- fp16
	- supertonic
	- multilingual
	- on-device
	- diffusion
	- flow-matching
	---

	# Supertonic-3 Quantized (ONNX)

	Quantized ONNX derivative of [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) for on-device TTS. Drop-in replacement for the official ONNX assets — same Python / C++ / Node SDK, smaller weights.

	31 languages (en, ko, ja, ar, bg, cs, da, de, el, es, et, fi, fr, hi, hr, hu, id, it, lt, lv, nl, pl, pt, ro, ru, sk, sl, sv, tr, uk, vi).

	## Variants

	\| Folder \| Total size \| Method \| Quality \| Use case \|
	\|--------\|---:\|---\|---\|---\|
	\| `fp16/` \| 191 MB \| All 4 models float16 (`onnxruntime.transformers.float16`) \| ≈99% of fp32 \| On-device desktop/mobile, ORT/CoreML/DirectML \|

	`voice_styles/` is shared and unchanged from upstream.

	### Why no int8 variant?

	Tested dynamic int8 on `vector_estimator` (the largest model, a ConvNeXt-based diffusion U-Net) but the resulting model emits `ConvInteger` op nodes, which are not implemented in many ORT CPU builds:
	- Common error: `NOT_IMPLEMENTED: Could not find an implementation for ConvInteger(10) node`
	- Affects: `onnxruntime-node`, minimal builds, older ORT versions, some mobile builds

	Restricting dynamic quantization to MatMul ops (skipping Conv) gives only ~6% size reduction because `vector_estimator` is Conv-dominated. Static int8 (QDQ) with calibration would work universally but requires capturing intermediate diffusion states — out of scope for this repo.

	For now, `fp16` is the recommended on-device variant: universal ORT compatibility, near-lossless quality, ~50% smaller than fp32.

	## Layout

	```
	fp16/onnx/
	text_encoder.onnx
	duration_predictor.onnx
	vector_estimator.onnx
	vocoder.onnx
	tts.json
	unicode_indexer.json

	voice_styles/
	{F1,F2,F3,F4,F5,M1,M2,M3,M4,M5}.json
	```

	- `fp16/onnx/` — 4 ONNX weights + architecture config (`tts.json`) + tokenizer table (`unicode_indexer.json`).
	- `voice_styles/` — voice embeddings, identical to upstream.

	## Download

	```bash
	hf download Kyumdroid/supertonic-3-quant \
	--include="fp16/onnx/" --include="voice_styles/" \
	--local-dir ./supertonic
	```

	## Voice catalog

	Display names from the official [Supertonic demo Space](https://huggingface.co/spaces/Supertone/supertonic-3):

	\| File \| Name \| Description \|
	\|---\|---\|---\|
	\| `M1.json` \| Alex \| Lively, upbeat male \|
	\| `M2.json` \| James \| Deep, composed male \|
	\| `M3.json` \| Robert \| Polished, authoritative male (demo default) \|
	\| `M4.json` \| Sam \| Soft, neutral, youthful male \|
	\| `M5.json` \| Daniel \| Warm, soothing male \|
	\| `F1.json` \| Sarah \| Calm, steady female \|
	\| `F2.json` \| Lily \| Bright, cheerful female \|
	\| `F3.json` \| Jessica \| Broadcast-style female \|
	\| `F4.json` \| Olivia \| Crisp, confident female \|
	\| `F5.json` \| Emily \| Gentle, soothing female \|

	## Conversion

	`fp16/` was produced via `onnxruntime.transformers.float16.convert_float_to_float16` with:
	- `keep_io_types=True` (fp32 IO for SDK compatibility)
	- `op_block_list=['Cast']` (avoid Cast type mismatch)
	- ONNX `shape_inference.infer_shapes_path` applied to upstream fp32 first

	Conversion script available in the project repository.

	## Performance (Apple Silicon CPU)

	Short Korean utterance, ORT CPU EP only:

	\| Variant \| Size \| Synthesis time \|
	\|---\|---:\|---:\|
	\| fp32 baseline (upstream) \| 380 MB \| ~0.7 s \|
	\| fp16 \| 191 MB \| ~0.7 s \|

	CPU EP performs fp16 as fp32 upcast, so wall-clock time is similar. Use CoreML EP (macOS) or DirectML EP (Windows) for fp16-native acceleration: 2-3× faster + ~50% lower RAM.

	## License

	OpenRAIL-M, inherited from [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3). See [LICENSE](./LICENSE).

	Use restrictions (Attachment A) apply: no impersonation/deepfakes without consent, no AI-generated content without disclosure, no medical advice, no illegal activities, etc.

	## Credits

	- Original model: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3) by Supertone Inc.
	- Quantization (this repo): fp16 ONNX for Electron / desktop on-device deployment