supertonic-2-mlx / README.md
typomonster's picture
Add MLX weights, configs, and model card
b28e8c3 verified
---
license: openrail
language:
- en
- ko
- es
- pt
- fr
pipeline_tag: text-to-speech
tags:
- text-to-speech
- speech-synthesis
- tts
- mlx
- mlx-audio
library_name: mlx-audio
base_model: Supertone/supertonic-2
---
# Supertonic-2 (MLX)
**Supertonic-2-MLX** is a pure-MLX port of
[Supertone/supertonic-2](https://huggingface.co/Supertone/supertonic-2),
a lightning-fast on-device TTS system. It runs natively on Apple Silicon
through [`mlx-audio`](https://github.com/typomonster/mlx-audio) β€” no ONNX
Runtime, no Python inference server, just `mx.load` + Metal.
- **66M params**, 4 sub-models (duration predictor, text encoder,
flow-matching vector estimator, Vocos-style vocoder).
- **5 languages**: English, Korean, Spanish, Portuguese, French.
- **10 preset voices**: `M1`–`M5` (male), `F1`–`F5` (female).
- **44.1 kHz** output, ~0.03 RTF on M4 Pro with 5 Euler steps.
- **float32 parity** with the upstream ONNX Runtime pipeline.
## Install
Supertonic support hasn't been upstreamed to `mlx-audio` yet β€” install the
fork [`typomonster/mlx-audio`](https://github.com/typomonster/mlx-audio):
```bash
pip install git+https://github.com/typomonster/mlx-audio.git
```
## Quick start
```python
from mlx_audio.tts import load
# Downloads this repo on first run and caches under ~/.cache/huggingface/.
model = load("typomonster/supertonic-2-mlx")
for r in model.generate("Hello world.", voice="M1", lang="en"):
# r.audio is an mx.array at model.sample_rate (44100 Hz)
print(r.samples, r.real_time_factor)
```
## Save to WAV
```python
import numpy as np, soundfile as sf
from mlx_audio.tts import load
model = load("typomonster/supertonic-2-mlx")
pieces = [np.asarray(r.audio) for r in
model.generate("였늘 날씨가 정말 μ’‹λ„€μš”.", voice="F1", lang="ko")]
wav = np.concatenate(pieces) if len(pieces) > 1 else pieces[0]
sf.write("out.wav", wav, model.sample_rate)
```
## Multi-language, multi-voice
```python
from mlx_audio.tts import load
model = load("typomonster/supertonic-2-mlx")
cases = [
("en", "M1", "The quick brown fox jumps over the lazy dog."),
("ko", "F1", "λ§λ“£μ“°λŠ” μŒμ„± λΉ„μ„œμž…λ‹ˆλ‹€."),
("es", "F3", "Hola, ΒΏcΓ³mo estΓ‘s hoy?"),
("pt", "M2", "Bom dia, tudo bem?"),
("fr", "F5", "Bonjour, comment Γ§a va ?"),
]
for lang, voice, text in cases:
for r in model.generate(text, voice=voice, lang=lang):
print(lang, voice, r.samples, r.real_time_factor)
```
## Performance
Measured on **Apple M1 Max** with 5 Euler steps, post-warmup:
| Input | Audio | Wall | RTF |
| ----------------------------------------- | ------ | ----- | ------ |
| `"Hello world."` (en, M1) | 1.46 s | 42 ms | 0.029Γ— |
| `"였늘 μ•„μΉ¨ 곡원을 μ‚°μ±…ν–ˆμ–΄μš”."` (ko, F1) | 2.63 s | 47 ms | 0.018Γ— |
Lower RTF is better (<1Γ— means faster than real-time).
More audio samples generated with MLX:
<https://github.com/typomonster/mlx-audio/tree/main/docs/supertonic>
## Generation options
```python
model.generate(
text,
voice="M1", # one of M1–M5, F1–F5
lang="en", # en | ko | es | pt | fr
speed=1.05, # >1 speaks faster (scales predicted duration)
steps=5, # Euler steps; more = higher quality, slower
seed=0, # deterministic given the same seed + input
chunk_max_len=None, # override default (ko=120 chars, others=300)
silence_between_chunks=0.3, # seconds between chunks in long texts
)
```
## Files
- `config.json` β€” mlx-audio model config
- `{duration_predictor,text_encoder,vector_estimator,vocoder}.safetensors` β€” MLX weights
- `unicode_indexer.json`, `voice_styles/*.json` β€” runtime assets
- `tts.json` β€” upstream pipeline config (preserved for reference)
## References
- Upstream model: [Supertone/supertonic-2](https://huggingface.co/Supertone/supertonic-2)
- Upstream code: [supertone-inc/supertonic](https://github.com/supertone-inc/supertonic) Β· fork with MLX integration: [typomonster/supertonic](https://github.com/typomonster/supertonic)
- mlx-audio (with Supertonic support): [typomonster/mlx-audio](https://github.com/typomonster/mlx-audio) Β· upstream: [Blaizzy/mlx-audio](https://github.com/Blaizzy/mlx-audio)
## License
OpenRAIL-M (inherited from the upstream model). See `LICENSE` for the full
terms β€” redistribution must carry the use-based restrictions (Attachment A)
forward.