Duplicate from Supertone/supertonic-2

58cb83c 6 days ago

5.77 kB

	---
	license: openrail
	language:
	- en
	- ko
	- es
	- pt
	- fr
	pipeline_tag: text-to-speech
	tags:
	- text-to-speech
	- speech-synthesis
	- tts
	- onnx
	library_name: supertonic
	---

	# Supertonic 2 — Lightning Fast, On-Device TTS, Multilingual TTS

	![Supertonic Preview](img/supertonic_preview_0.1.jpg)

	<p align="center">
	<a href="https://huggingface.co/spaces/Supertone/supertonic-2"><img src="https://img.shields.io/badge/🤗_Demo-Hugging_Face-yellow?style=for-the-badge" alt="Demo"></a>
	<a href="https://github.com/supertone-inc/supertonic"><img src="https://img.shields.io/badge/💻_Code-GitHub-black?style=for-the-badge&logo=github" alt="Code"></a>
	</p>

	Supertonic is a lightning-fast, on-device text-to-speech system designed for extreme performance with minimal computational overhead. Powered by ONNX Runtime, it runs entirely on your device—no cloud, no API calls, no privacy concerns.

	## What's New in Supertonic 2

	Supertonic 2 extends multilingual capabilities while maintaining the same inference speed and efficiency as the original.

	### 🌍 Multilingual Support

	\| Language \| Code \|
	\|----------\|------\|
	\| English \| `en` \|
	\| Korean \| `ko` \|
	\| Spanish \| `es` \|
	\| Portuguese \| `pt` \|
	\| French \| `fr` \|

	### ⚡ Same Speed, More Languages

	- No speed degradation: Supertonic 2 delivers the same ultra-fast inference speed as the original—up to 167× faster than real-time
	- Efficient architecture: Only 66M parameters, optimized for on-device deployment
	- Cross-language consistency: All supported languages share the same model architecture and inference pipeline

	## Performance

	We evaluated Supertonic's performance (with 2 inference steps) using two key metrics across input texts of varying lengths: Short (59 chars), Mid (152 chars), and Long (266 chars).

	Metrics:
	- Characters per Second: Measures throughput by dividing the number of input characters by the time required to generate audio. Higher is better.
	- Real-time Factor (RTF): Measures the time taken to synthesize audio relative to its duration. Lower is better (e.g., RTF of 0.1 means it takes 0.1 seconds to generate one second of audio).

	### Characters per Second
	\| System \| Short (59 chars) \| Mid (152 chars) \| Long (266 chars) \|
	\|--------\|-----------------\|----------------\|-----------------\|
	\| Supertonic (M4 pro - CPU) \| 912 \| 1048 \| 1263 \|
	\| Supertonic (M4 pro - WebGPU) \| 996 \| 1801 \| 2509 \|
	\| Supertonic (RTX4090) \| 2615 \| 6548 \| 12164 \|
	\| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) \| 144 \| 209 \| 287 \|
	\| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) \| 37 \| 55 \| 82 \|
	\| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) \| 12 \| 18 \| 24 \|
	\| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) \| 38 \| 64 \| 92 \|
	\| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) \| 104 \| 107 \| 117 \|
	\| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) \| 37 \| 42 \| 47 \|

	> Notes:
	> `API` = Cloud-based API services (measured from Seoul)
	> `Open` = Open-source models
	> Supertonic (M4 pro - CPU) and (M4 pro - WebGPU): Tested with ONNX
	> Supertonic (RTX4090): Tested with PyTorch model
	> Kokoro: Tested on M4 Pro CPU with ONNX
	> NeuTTS Air: Tested on M4 Pro CPU with Q8-GGUF

	### Real-time Factor

	\| System \| Short (59 chars) \| Mid (152 chars) \| Long (266 chars) \|
	\|--------\|-----------------\|----------------\|-----------------\|
	\| Supertonic (M4 pro - CPU) \| 0.015 \| 0.013 \| 0.012 \|
	\| Supertonic (M4 pro - WebGPU) \| 0.014 \| 0.007 \| 0.006 \|
	\| Supertonic (RTX4090) \| 0.005 \| 0.002 \| 0.001 \|
	\| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) \| 0.133 \| 0.077 \| 0.057 \|
	\| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) \| 0.471 \| 0.302 \| 0.201 \|
	\| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) \| 1.060 \| 0.673 \| 0.541 \|
	\| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) \| 0.372 \| 0.206 \| 0.163 \|
	\| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) \| 0.144 \| 0.124 \| 0.126 \|
	\| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) \| 0.390 \| 0.338 \| 0.343 \|

	<details>
	<summary><b>Additional Performance Data (5-step inference)</b></summary>

	<br>

	Characters per Second (5-step)

	\| System \| Short (59 chars) \| Mid (152 chars) \| Long (266 chars) \|
	\|--------\|-----------------\|----------------\|-----------------\|
	\| Supertonic (M4 pro - CPU) \| 596 \| 691 \| 850 \|
	\| Supertonic (M4 pro - WebGPU) \| 570 \| 1118 \| 1546 \|
	\| Supertonic (RTX4090) \| 1286 \| 3757 \| 6242 \|

	Real-time Factor (5-step)

	\| System \| Short (59 chars) \| Mid (152 chars) \| Long (266 chars) \|
	\|--------\|-----------------\|----------------\|-----------------\|
	\| Supertonic (M4 pro - CPU) \| 0.023 \| 0.019 \| 0.018 \|
	\| Supertonic (M4 pro - WebGPU) \| 0.024 \| 0.012 \| 0.010 \|
	\| Supertonic (RTX4090) \| 0.011 \| 0.004 \| 0.002 \|

	</details>

	## License

	This project’s sample code is released under the MIT License. - see the [LICENSE](https://github.com/supertone-inc/supertonic?tab=MIT-1-ov-file) for details.

	The accompanying model is released under the OpenRAIL-M License. - see the [LICENSE](https://huggingface.co/Supertone/supertonic-2/blob/main/LICENSE) file for details.

	This model was trained using PyTorch, which is licensed under the BSD 3-Clause License but is not redistributed with this project. - see the [LICENSE](https://docs.pytorch.org/FBGEMM/general/License.html) for details.

	Copyright (c) 2026 Supertone Inc.

	---
	license: openrail
	language:
	- en
	- ko
	- es
	- pt
	- fr
	pipeline_tag: text-to-speech
	tags:
	- text-to-speech
	- speech-synthesis
	- tts
	- onnx
	library_name: supertonic
	---

	# Supertonic 2 — Lightning Fast, On-Device TTS, Multilingual TTS

	![Supertonic Preview](img/supertonic_preview_0.1.jpg)

	<p align="center">
	<a href="https://huggingface.co/spaces/Supertone/supertonic-2"><img src="https://img.shields.io/badge/🤗_Demo-Hugging_Face-yellow?style=for-the-badge" alt="Demo"></a>
	<a href="https://github.com/supertone-inc/supertonic"><img src="https://img.shields.io/badge/💻_Code-GitHub-black?style=for-the-badge&logo=github" alt="Code"></a>
	</p>

	Supertonic is a lightning-fast, on-device text-to-speech system designed for extreme performance with minimal computational overhead. Powered by ONNX Runtime, it runs entirely on your device—no cloud, no API calls, no privacy concerns.

	## What's New in Supertonic 2

	Supertonic 2 extends multilingual capabilities while maintaining the same inference speed and efficiency as the original.

	### 🌍 Multilingual Support

	\| Language \| Code \|
	\|----------\|------\|
	\| English \| `en` \|
	\| Korean \| `ko` \|
	\| Spanish \| `es` \|
	\| Portuguese \| `pt` \|
	\| French \| `fr` \|

	### ⚡ Same Speed, More Languages

	- No speed degradation: Supertonic 2 delivers the same ultra-fast inference speed as the original—up to 167× faster than real-time
	- Efficient architecture: Only 66M parameters, optimized for on-device deployment
	- Cross-language consistency: All supported languages share the same model architecture and inference pipeline

	## Performance

	We evaluated Supertonic's performance (with 2 inference steps) using two key metrics across input texts of varying lengths: Short (59 chars), Mid (152 chars), and Long (266 chars).

	Metrics:
	- Characters per Second: Measures throughput by dividing the number of input characters by the time required to generate audio. Higher is better.
	- Real-time Factor (RTF): Measures the time taken to synthesize audio relative to its duration. Lower is better (e.g., RTF of 0.1 means it takes 0.1 seconds to generate one second of audio).

	### Characters per Second
	\| System \| Short (59 chars) \| Mid (152 chars) \| Long (266 chars) \|
	\|--------\|-----------------\|----------------\|-----------------\|
	\| Supertonic (M4 pro - CPU) \| 912 \| 1048 \| 1263 \|
	\| Supertonic (M4 pro - WebGPU) \| 996 \| 1801 \| 2509 \|
	\| Supertonic (RTX4090) \| 2615 \| 6548 \| 12164 \|
	\| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) \| 144 \| 209 \| 287 \|
	\| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) \| 37 \| 55 \| 82 \|
	\| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) \| 12 \| 18 \| 24 \|
	\| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) \| 38 \| 64 \| 92 \|
	\| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) \| 104 \| 107 \| 117 \|
	\| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) \| 37 \| 42 \| 47 \|

	> Notes:
	> `API` = Cloud-based API services (measured from Seoul)
	> `Open` = Open-source models
	> Supertonic (M4 pro - CPU) and (M4 pro - WebGPU): Tested with ONNX
	> Supertonic (RTX4090): Tested with PyTorch model
	> Kokoro: Tested on M4 Pro CPU with ONNX
	> NeuTTS Air: Tested on M4 Pro CPU with Q8-GGUF

	### Real-time Factor

	\| System \| Short (59 chars) \| Mid (152 chars) \| Long (266 chars) \|
	\|--------\|-----------------\|----------------\|-----------------\|
	\| Supertonic (M4 pro - CPU) \| 0.015 \| 0.013 \| 0.012 \|
	\| Supertonic (M4 pro - WebGPU) \| 0.014 \| 0.007 \| 0.006 \|
	\| Supertonic (RTX4090) \| 0.005 \| 0.002 \| 0.001 \|
	\| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) \| 0.133 \| 0.077 \| 0.057 \|
	\| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) \| 0.471 \| 0.302 \| 0.201 \|
	\| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) \| 1.060 \| 0.673 \| 0.541 \|
	\| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) \| 0.372 \| 0.206 \| 0.163 \|
	\| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) \| 0.144 \| 0.124 \| 0.126 \|
	\| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) \| 0.390 \| 0.338 \| 0.343 \|

	<details>
	<summary><b>Additional Performance Data (5-step inference)</b></summary>

	<br>

	Characters per Second (5-step)

	\| System \| Short (59 chars) \| Mid (152 chars) \| Long (266 chars) \|
	\|--------\|-----------------\|----------------\|-----------------\|
	\| Supertonic (M4 pro - CPU) \| 596 \| 691 \| 850 \|
	\| Supertonic (M4 pro - WebGPU) \| 570 \| 1118 \| 1546 \|
	\| Supertonic (RTX4090) \| 1286 \| 3757 \| 6242 \|

	Real-time Factor (5-step)

	\| System \| Short (59 chars) \| Mid (152 chars) \| Long (266 chars) \|
	\|--------\|-----------------\|----------------\|-----------------\|
	\| Supertonic (M4 pro - CPU) \| 0.023 \| 0.019 \| 0.018 \|
	\| Supertonic (M4 pro - WebGPU) \| 0.024 \| 0.012 \| 0.010 \|
	\| Supertonic (RTX4090) \| 0.011 \| 0.004 \| 0.002 \|

	</details>

	## License

	This project’s sample code is released under the MIT License. - see the [LICENSE](https://github.com/supertone-inc/supertonic?tab=MIT-1-ov-file) for details.

	The accompanying model is released under the OpenRAIL-M License. - see the [LICENSE](https://huggingface.co/Supertone/supertonic-2/blob/main/LICENSE) file for details.

	This model was trained using PyTorch, which is licensed under the BSD 3-Clause License but is not redistributed with this project. - see the [LICENSE](https://docs.pytorch.org/FBGEMM/general/License.html) for details.

	Copyright (c) 2026 Supertone Inc.