Instructions to use typomonster/supertonic-2-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use typomonster/supertonic-2-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir supertonic-2-mlx typomonster/supertonic-2-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
| license: openrail | |
| language: | |
| - en | |
| - ko | |
| - es | |
| - pt | |
| - fr | |
| pipeline_tag: text-to-speech | |
| tags: | |
| - text-to-speech | |
| - speech-synthesis | |
| - tts | |
| - mlx | |
| - mlx-audio | |
| library_name: mlx-audio | |
| base_model: Supertone/supertonic-2 | |
| # Supertonic-2 (MLX) | |
| **Supertonic-2-MLX** is a pure-MLX port of | |
| [Supertone/supertonic-2](https://huggingface.co/Supertone/supertonic-2), | |
| a lightning-fast on-device TTS system. It runs natively on Apple Silicon | |
| through [`mlx-audio`](https://github.com/typomonster/mlx-audio) β no ONNX | |
| Runtime, no Python inference server, just `mx.load` + Metal. | |
| - **66M params**, 4 sub-models (duration predictor, text encoder, | |
| flow-matching vector estimator, Vocos-style vocoder). | |
| - **5 languages**: English, Korean, Spanish, Portuguese, French. | |
| - **10 preset voices**: `M1`β`M5` (male), `F1`β`F5` (female). | |
| - **44.1 kHz** output, ~0.03 RTF on M4 Pro with 5 Euler steps. | |
| - **float32 parity** with the upstream ONNX Runtime pipeline. | |
| ## Install | |
| Supertonic support hasn't been upstreamed to `mlx-audio` yet β install the | |
| fork [`typomonster/mlx-audio`](https://github.com/typomonster/mlx-audio): | |
| ```bash | |
| pip install git+https://github.com/typomonster/mlx-audio.git | |
| ``` | |
| ## Quick start | |
| ```python | |
| from mlx_audio.tts import load | |
| # Downloads this repo on first run and caches under ~/.cache/huggingface/. | |
| model = load("typomonster/supertonic-2-mlx") | |
| for r in model.generate("Hello world.", voice="M1", lang="en"): | |
| # r.audio is an mx.array at model.sample_rate (44100 Hz) | |
| print(r.samples, r.real_time_factor) | |
| ``` | |
| ## Save to WAV | |
| ```python | |
| import numpy as np, soundfile as sf | |
| from mlx_audio.tts import load | |
| model = load("typomonster/supertonic-2-mlx") | |
| pieces = [np.asarray(r.audio) for r in | |
| model.generate("μ€λ λ μ¨κ° μ λ§ μ’λ€μ.", voice="F1", lang="ko")] | |
| wav = np.concatenate(pieces) if len(pieces) > 1 else pieces[0] | |
| sf.write("out.wav", wav, model.sample_rate) | |
| ``` | |
| ## Multi-language, multi-voice | |
| ```python | |
| from mlx_audio.tts import load | |
| model = load("typomonster/supertonic-2-mlx") | |
| cases = [ | |
| ("en", "M1", "The quick brown fox jumps over the lazy dog."), | |
| ("ko", "F1", "λ§λ£μ°λ μμ± λΉμμ λλ€."), | |
| ("es", "F3", "Hola, ΒΏcΓ³mo estΓ‘s hoy?"), | |
| ("pt", "M2", "Bom dia, tudo bem?"), | |
| ("fr", "F5", "Bonjour, comment Γ§a va ?"), | |
| ] | |
| for lang, voice, text in cases: | |
| for r in model.generate(text, voice=voice, lang=lang): | |
| print(lang, voice, r.samples, r.real_time_factor) | |
| ``` | |
| ## Performance | |
| Measured on **Apple M1 Max** with 5 Euler steps, post-warmup: | |
| | Input | Audio | Wall | RTF | | |
| | ----------------------------------------- | ------ | ----- | ------ | | |
| | `"Hello world."` (en, M1) | 1.46 s | 42 ms | 0.029Γ | | |
| | `"μ€λ μμΉ¨ 곡μμ μ°μ± νμ΄μ."` (ko, F1) | 2.63 s | 47 ms | 0.018Γ | | |
| Lower RTF is better (<1Γ means faster than real-time). | |
| More audio samples generated with MLX: | |
| <https://github.com/typomonster/mlx-audio/tree/main/docs/supertonic> | |
| ## Generation options | |
| ```python | |
| model.generate( | |
| text, | |
| voice="M1", # one of M1βM5, F1βF5 | |
| lang="en", # en | ko | es | pt | fr | |
| speed=1.05, # >1 speaks faster (scales predicted duration) | |
| steps=5, # Euler steps; more = higher quality, slower | |
| seed=0, # deterministic given the same seed + input | |
| chunk_max_len=None, # override default (ko=120 chars, others=300) | |
| silence_between_chunks=0.3, # seconds between chunks in long texts | |
| ) | |
| ``` | |
| ## Files | |
| - `config.json` β mlx-audio model config | |
| - `{duration_predictor,text_encoder,vector_estimator,vocoder}.safetensors` β MLX weights | |
| - `unicode_indexer.json`, `voice_styles/*.json` β runtime assets | |
| - `tts.json` β upstream pipeline config (preserved for reference) | |
| ## References | |
| - Upstream model: [Supertone/supertonic-2](https://huggingface.co/Supertone/supertonic-2) | |
| - Upstream code: [supertone-inc/supertonic](https://github.com/supertone-inc/supertonic) Β· fork with MLX integration: [typomonster/supertonic](https://github.com/typomonster/supertonic) | |
| - mlx-audio (with Supertonic support): [typomonster/mlx-audio](https://github.com/typomonster/mlx-audio) Β· upstream: [Blaizzy/mlx-audio](https://github.com/Blaizzy/mlx-audio) | |
| ## License | |
| OpenRAIL-M (inherited from the upstream model). See `LICENSE` for the full | |
| terms β redistribution must carry the use-based restrictions (Attachment A) | |
| forward. | |