supertonic-2-mlx / README.md
typomonster's picture
Add MLX weights, configs, and model card
b28e8c3 verified
metadata
license: openrail
language:
  - en
  - ko
  - es
  - pt
  - fr
pipeline_tag: text-to-speech
tags:
  - text-to-speech
  - speech-synthesis
  - tts
  - mlx
  - mlx-audio
library_name: mlx-audio
base_model: Supertone/supertonic-2

Supertonic-2 (MLX)

Supertonic-2-MLX is a pure-MLX port of Supertone/supertonic-2, a lightning-fast on-device TTS system. It runs natively on Apple Silicon through mlx-audio โ€” no ONNX Runtime, no Python inference server, just mx.load + Metal.

  • 66M params, 4 sub-models (duration predictor, text encoder, flow-matching vector estimator, Vocos-style vocoder).
  • 5 languages: English, Korean, Spanish, Portuguese, French.
  • 10 preset voices: M1โ€“M5 (male), F1โ€“F5 (female).
  • 44.1 kHz output, ~0.03 RTF on M4 Pro with 5 Euler steps.
  • float32 parity with the upstream ONNX Runtime pipeline.

Install

Supertonic support hasn't been upstreamed to mlx-audio yet โ€” install the fork typomonster/mlx-audio:

pip install git+https://github.com/typomonster/mlx-audio.git

Quick start

from mlx_audio.tts import load

# Downloads this repo on first run and caches under ~/.cache/huggingface/.
model = load("typomonster/supertonic-2-mlx")

for r in model.generate("Hello world.", voice="M1", lang="en"):
    # r.audio is an mx.array at model.sample_rate (44100 Hz)
    print(r.samples, r.real_time_factor)

Save to WAV

import numpy as np, soundfile as sf
from mlx_audio.tts import load

model = load("typomonster/supertonic-2-mlx")
pieces = [np.asarray(r.audio) for r in
          model.generate("์˜ค๋Š˜ ๋‚ ์”จ๊ฐ€ ์ •๋ง ์ข‹๋„ค์š”.", voice="F1", lang="ko")]
wav = np.concatenate(pieces) if len(pieces) > 1 else pieces[0]
sf.write("out.wav", wav, model.sample_rate)

Multi-language, multi-voice

from mlx_audio.tts import load

model = load("typomonster/supertonic-2-mlx")

cases = [
    ("en", "M1", "The quick brown fox jumps over the lazy dog."),
    ("ko", "F1", "๋ง๋“ฃ์“ฐ๋Š” ์Œ์„ฑ ๋น„์„œ์ž…๋‹ˆ๋‹ค."),
    ("es", "F3", "Hola, ยฟcรณmo estรกs hoy?"),
    ("pt", "M2", "Bom dia, tudo bem?"),
    ("fr", "F5", "Bonjour, comment รงa va ?"),
]
for lang, voice, text in cases:
    for r in model.generate(text, voice=voice, lang=lang):
        print(lang, voice, r.samples, r.real_time_factor)

Performance

Measured on Apple M1 Max with 5 Euler steps, post-warmup:

Input Audio Wall RTF
"Hello world." (en, M1) 1.46 s 42 ms 0.029ร—
"์˜ค๋Š˜ ์•„์นจ ๊ณต์›์„ ์‚ฐ์ฑ…ํ–ˆ์–ด์š”." (ko, F1) 2.63 s 47 ms 0.018ร—

Lower RTF is better (<1ร— means faster than real-time).

More audio samples generated with MLX: https://github.com/typomonster/mlx-audio/tree/main/docs/supertonic

Generation options

model.generate(
    text,
    voice="M1",           # one of M1โ€“M5, F1โ€“F5
    lang="en",            # en | ko | es | pt | fr
    speed=1.05,           # >1 speaks faster (scales predicted duration)
    steps=5,              # Euler steps; more = higher quality, slower
    seed=0,               # deterministic given the same seed + input
    chunk_max_len=None,   # override default (ko=120 chars, others=300)
    silence_between_chunks=0.3,  # seconds between chunks in long texts
)

Files

  • config.json โ€” mlx-audio model config
  • {duration_predictor,text_encoder,vector_estimator,vocoder}.safetensors โ€” MLX weights
  • unicode_indexer.json, voice_styles/*.json โ€” runtime assets
  • tts.json โ€” upstream pipeline config (preserved for reference)

References

License

OpenRAIL-M (inherited from the upstream model). See LICENSE for the full terms โ€” redistribution must carry the use-based restrictions (Attachment A) forward.