Add MLX weights, configs, and model card

b28e8c3 verified about 2 months ago

4.49 kB

license: openrail
language:
  - en
  - ko
  - es
  - pt
  - fr
pipeline_tag: text-to-speech
tags:
  - text-to-speech
  - speech-synthesis
  - tts
  - mlx
  - mlx-audio
library_name: mlx-audio
base_model: Supertone/supertonic-2

Supertonic-2 (MLX)

Supertonic-2-MLX is a pure-MLX port of Supertone/supertonic-2, a lightning-fast on-device TTS system. It runs natively on Apple Silicon through mlx-audio — no ONNX Runtime, no Python inference server, just mx.load + Metal.

66M params, 4 sub-models (duration predictor, text encoder, flow-matching vector estimator, Vocos-style vocoder).
5 languages: English, Korean, Spanish, Portuguese, French.
10 preset voices: M1–M5 (male), F1–F5 (female).
44.1 kHz output, ~0.03 RTF on M4 Pro with 5 Euler steps.
float32 parity with the upstream ONNX Runtime pipeline.

Install

Supertonic support hasn't been upstreamed to mlx-audio yet — install the fork typomonster/mlx-audio:

pip install git+https://github.com/typomonster/mlx-audio.git

Quick start

from mlx_audio.tts import load

# Downloads this repo on first run and caches under ~/.cache/huggingface/.
model = load("typomonster/supertonic-2-mlx")

for r in model.generate("Hello world.", voice="M1", lang="en"):
    # r.audio is an mx.array at model.sample_rate (44100 Hz)
    print(r.samples, r.real_time_factor)

Save to WAV

import numpy as np, soundfile as sf
from mlx_audio.tts import load

model = load("typomonster/supertonic-2-mlx")
pieces = [np.asarray(r.audio) for r in
          model.generate("오늘 날씨가 정말 좋네요.", voice="F1", lang="ko")]
wav = np.concatenate(pieces) if len(pieces) > 1 else pieces[0]
sf.write("out.wav", wav, model.sample_rate)

Multi-language, multi-voice

from mlx_audio.tts import load

model = load("typomonster/supertonic-2-mlx")

cases = [
    ("en", "M1", "The quick brown fox jumps over the lazy dog."),
    ("ko", "F1", "말듣쓰는 음성 비서입니다."),
    ("es", "F3", "Hola, ¿cómo estás hoy?"),
    ("pt", "M2", "Bom dia, tudo bem?"),
    ("fr", "F5", "Bonjour, comment ça va ?"),
]
for lang, voice, text in cases:
    for r in model.generate(text, voice=voice, lang=lang):
        print(lang, voice, r.samples, r.real_time_factor)

Performance

Measured on Apple M1 Max with 5 Euler steps, post-warmup:

Input	Audio	Wall	RTF
`"Hello world."` (en, M1)	1.46 s	42 ms	0.029×
`"오늘 아침 공원을 산책했어요."` (ko, F1)	2.63 s	47 ms	0.018×

Lower RTF is better (<1× means faster than real-time).

More audio samples generated with MLX: https://github.com/typomonster/mlx-audio/tree/main/docs/supertonic

Generation options

model.generate(
    text,
    voice="M1",           # one of M1–M5, F1–F5
    lang="en",            # en | ko | es | pt | fr
    speed=1.05,           # >1 speaks faster (scales predicted duration)
    steps=5,              # Euler steps; more = higher quality, slower
    seed=0,               # deterministic given the same seed + input
    chunk_max_len=None,   # override default (ko=120 chars, others=300)
    silence_between_chunks=0.3,  # seconds between chunks in long texts
)

Files

config.json — mlx-audio model config
{duration_predictor,text_encoder,vector_estimator,vocoder}.safetensors — MLX weights
unicode_indexer.json, voice_styles/*.json — runtime assets
tts.json — upstream pipeline config (preserved for reference)

References

Upstream model: Supertone/supertonic-2
Upstream code: supertone-inc/supertonic · fork with MLX integration: typomonster/supertonic
mlx-audio (with Supertonic support): typomonster/mlx-audio · upstream: Blaizzy/mlx-audio

License

OpenRAIL-M (inherited from the upstream model). See LICENSE for the full terms — redistribution must carry the use-based restrictions (Attachment A) forward.