--- license: openrail language: - en - ko - es - pt - fr pipeline_tag: text-to-speech tags: - text-to-speech - speech-synthesis - tts - mlx - mlx-audio library_name: mlx-audio base_model: Supertone/supertonic-2 --- # Supertonic-2 (MLX) **Supertonic-2-MLX** is a pure-MLX port of [Supertone/supertonic-2](https://huggingface.co/Supertone/supertonic-2), a lightning-fast on-device TTS system. It runs natively on Apple Silicon through [`mlx-audio`](https://github.com/typomonster/mlx-audio) — no ONNX Runtime, no Python inference server, just `mx.load` + Metal. - **66M params**, 4 sub-models (duration predictor, text encoder, flow-matching vector estimator, Vocos-style vocoder). - **5 languages**: English, Korean, Spanish, Portuguese, French. - **10 preset voices**: `M1`–`M5` (male), `F1`–`F5` (female). - **44.1 kHz** output, ~0.03 RTF on M4 Pro with 5 Euler steps. - **float32 parity** with the upstream ONNX Runtime pipeline. ## Install Supertonic support hasn't been upstreamed to `mlx-audio` yet — install the fork [`typomonster/mlx-audio`](https://github.com/typomonster/mlx-audio): ```bash pip install git+https://github.com/typomonster/mlx-audio.git ``` ## Quick start ```python from mlx_audio.tts import load # Downloads this repo on first run and caches under ~/.cache/huggingface/. model = load("typomonster/supertonic-2-mlx") for r in model.generate("Hello world.", voice="M1", lang="en"): # r.audio is an mx.array at model.sample_rate (44100 Hz) print(r.samples, r.real_time_factor) ``` ## Save to WAV ```python import numpy as np, soundfile as sf from mlx_audio.tts import load model = load("typomonster/supertonic-2-mlx") pieces = [np.asarray(r.audio) for r in model.generate("오늘 날씨가 정말 좋네요.", voice="F1", lang="ko")] wav = np.concatenate(pieces) if len(pieces) > 1 else pieces[0] sf.write("out.wav", wav, model.sample_rate) ``` ## Multi-language, multi-voice ```python from mlx_audio.tts import load model = load("typomonster/supertonic-2-mlx") cases = [ ("en", "M1", "The quick brown fox jumps over the lazy dog."), ("ko", "F1", "말듣쓰는 음성 비서입니다."), ("es", "F3", "Hola, ¿cómo estás hoy?"), ("pt", "M2", "Bom dia, tudo bem?"), ("fr", "F5", "Bonjour, comment ça va ?"), ] for lang, voice, text in cases: for r in model.generate(text, voice=voice, lang=lang): print(lang, voice, r.samples, r.real_time_factor) ``` ## Performance Measured on **Apple M1 Max** with 5 Euler steps, post-warmup: | Input | Audio | Wall | RTF | | ----------------------------------------- | ------ | ----- | ------ | | `"Hello world."` (en, M1) | 1.46 s | 42 ms | 0.029× | | `"오늘 아침 공원을 산책했어요."` (ko, F1) | 2.63 s | 47 ms | 0.018× | Lower RTF is better (<1× means faster than real-time). More audio samples generated with MLX: ## Generation options ```python model.generate( text, voice="M1", # one of M1–M5, F1–F5 lang="en", # en | ko | es | pt | fr speed=1.05, # >1 speaks faster (scales predicted duration) steps=5, # Euler steps; more = higher quality, slower seed=0, # deterministic given the same seed + input chunk_max_len=None, # override default (ko=120 chars, others=300) silence_between_chunks=0.3, # seconds between chunks in long texts ) ``` ## Files - `config.json` — mlx-audio model config - `{duration_predictor,text_encoder,vector_estimator,vocoder}.safetensors` — MLX weights - `unicode_indexer.json`, `voice_styles/*.json` — runtime assets - `tts.json` — upstream pipeline config (preserved for reference) ## References - Upstream model: [Supertone/supertonic-2](https://huggingface.co/Supertone/supertonic-2) - Upstream code: [supertone-inc/supertonic](https://github.com/supertone-inc/supertonic) · fork with MLX integration: [typomonster/supertonic](https://github.com/typomonster/supertonic) - mlx-audio (with Supertonic support): [typomonster/mlx-audio](https://github.com/typomonster/mlx-audio) · upstream: [Blaizzy/mlx-audio](https://github.com/Blaizzy/mlx-audio) ## License OpenRAIL-M (inherited from the upstream model). See `LICENSE` for the full terms — redistribution must carry the use-based restrictions (Attachment A) forward.