Add MLX weights, configs, and model card

b28e8c3 verified about 2 months ago

4.49 kB

	---
	license: openrail
	language:
	- en
	- ko
	- es
	- pt
	- fr
	pipeline_tag: text-to-speech
	tags:
	- text-to-speech
	- speech-synthesis
	- tts
	- mlx
	- mlx-audio
	library_name: mlx-audio
	base_model: Supertone/supertonic-2
	---

	# Supertonic-2 (MLX)

	Supertonic-2-MLX is a pure-MLX port of
	[Supertone/supertonic-2](https://huggingface.co/Supertone/supertonic-2),
	a lightning-fast on-device TTS system. It runs natively on Apple Silicon
	through [`mlx-audio`](https://github.com/typomonster/mlx-audio) — no ONNX
	Runtime, no Python inference server, just `mx.load` + Metal.

	- 66M params, 4 sub-models (duration predictor, text encoder,
	flow-matching vector estimator, Vocos-style vocoder).
	- 5 languages: English, Korean, Spanish, Portuguese, French.
	- 10 preset voices: `M1`–`M5` (male), `F1`–`F5` (female).
	- 44.1 kHz output, ~0.03 RTF on M4 Pro with 5 Euler steps.
	- float32 parity with the upstream ONNX Runtime pipeline.

	## Install

	Supertonic support hasn't been upstreamed to `mlx-audio` yet — install the
	fork [`typomonster/mlx-audio`](https://github.com/typomonster/mlx-audio):

	```bash
	pip install git+https://github.com/typomonster/mlx-audio.git
	```

	## Quick start

	```python
	from mlx_audio.tts import load

	# Downloads this repo on first run and caches under ~/.cache/huggingface/.
	model = load("typomonster/supertonic-2-mlx")

	for r in model.generate("Hello world.", voice="M1", lang="en"):
	# r.audio is an mx.array at model.sample_rate (44100 Hz)
	print(r.samples, r.real_time_factor)
	```

	## Save to WAV

	```python
	import numpy as np, soundfile as sf
	from mlx_audio.tts import load

	model = load("typomonster/supertonic-2-mlx")
	pieces = [np.asarray(r.audio) for r in
	model.generate("오늘 날씨가 정말 좋네요.", voice="F1", lang="ko")]
	wav = np.concatenate(pieces) if len(pieces) > 1 else pieces[0]
	sf.write("out.wav", wav, model.sample_rate)
	```

	## Multi-language, multi-voice

	```python
	from mlx_audio.tts import load

	model = load("typomonster/supertonic-2-mlx")

	cases = [
	("en", "M1", "The quick brown fox jumps over the lazy dog."),
	("ko", "F1", "말듣쓰는 음성 비서입니다."),
	("es", "F3", "Hola, ¿cómo estás hoy?"),
	("pt", "M2", "Bom dia, tudo bem?"),
	("fr", "F5", "Bonjour, comment ça va ?"),
	]
	for lang, voice, text in cases:
	for r in model.generate(text, voice=voice, lang=lang):
	print(lang, voice, r.samples, r.real_time_factor)
	```

	## Performance

	Measured on Apple M1 Max with 5 Euler steps, post-warmup:

	\| Input \| Audio \| Wall \| RTF \|
	\| ----------------------------------------- \| ------ \| ----- \| ------ \|
	\| `"Hello world."` (en, M1) \| 1.46 s \| 42 ms \| 0.029× \|
	\| `"오늘 아침 공원을 산책했어요."` (ko, F1) \| 2.63 s \| 47 ms \| 0.018× \|

	Lower RTF is better (<1× means faster than real-time).

	More audio samples generated with MLX:
	<https://github.com/typomonster/mlx-audio/tree/main/docs/supertonic>

	## Generation options

	```python
	model.generate(
	text,
	voice="M1", # one of M1–M5, F1–F5
	lang="en", # en \| ko \| es \| pt \| fr
	speed=1.05, # >1 speaks faster (scales predicted duration)
	steps=5, # Euler steps; more = higher quality, slower
	seed=0, # deterministic given the same seed + input
	chunk_max_len=None, # override default (ko=120 chars, others=300)
	silence_between_chunks=0.3, # seconds between chunks in long texts
	)
	```

	## Files
	- `config.json` — mlx-audio model config
	- `{duration_predictor,text_encoder,vector_estimator,vocoder}.safetensors` — MLX weights
	- `unicode_indexer.json`, `voice_styles/*.json` — runtime assets
	- `tts.json` — upstream pipeline config (preserved for reference)

	## References
	- Upstream model: [Supertone/supertonic-2](https://huggingface.co/Supertone/supertonic-2)
	- Upstream code: [supertone-inc/supertonic](https://github.com/supertone-inc/supertonic) · fork with MLX integration: [typomonster/supertonic](https://github.com/typomonster/supertonic)
	- mlx-audio (with Supertonic support): [typomonster/mlx-audio](https://github.com/typomonster/mlx-audio) · upstream: [Blaizzy/mlx-audio](https://github.com/Blaizzy/mlx-audio)

	## License
	OpenRAIL-M (inherited from the upstream model). See `LICENSE` for the full
	terms — redistribution must carry the use-based restrictions (Attachment A)
	forward.