Supertonic 3 โ€” MLX-native

31-language text-to-speech, ~x100 realtime on Apple Silicon. Native MLX port of Supertone/supertonic-3, runs the full flow-matching + classifier-free-guidance pipeline (DurationPredictor โ†’ TextEncoder โ†’ 24-block VectorEstimator (5 Euler steps) โ†’ 10-block Vocos vocoder) without ONNX, CoreML or any C++ runtime โ€” only MLX + NumPy.

Install

The package isn't on PyPI yet โ€” install directly from this gitea source repository (or from the local checkout):

pip install git+https://github.com/ambassadia/supertonic-3-mlx.git

Runtime dependencies are just mlx, numpy, and huggingface_hub (the last for the one-line weight download). On first use the ~ 400 MB weight bundle is downloaded from ambassadia/supertonic-3-mlx into your Hugging Face cache.

One-shot quickstart + sanity test

A zero-config end-to-end test script ships with the repo. Clone the repo, run the script, and it will create a fresh venv, install everything, version-check MLX (with an optional auto-upgrade), download the weights and synthesise an utterance into hello.wav:

git clone https://github.com/ambassadia/supertonic-3-mlx.git
cd supertonic-3-mlx
./setup_and_test.sh                              # en F1, default text
./setup_and_test.sh fr F2 "Bonjour."             # custom lang / voice / text

Re-runs reuse the venv and the cached weights โ€” second invocation is ~ 20 ms warm load + ~ 30 ms per generate.

Quickstart (after install)

from supertonic_3_mlx import Pipeline

pipe = Pipeline.from_pretrained("ambassadia/supertonic-3-mlx")
wav  = pipe.generate("Hello world from Apple Silicon.", voice="F1", lang="en")

# wav is a 1-D numpy.float32 array at 44.1 kHz
import soundfile as sf
sf.write("hello.wav", wav, pipe.sample_rate)

Audio samples

Six languages, mix of male / female voices, mix of short and long utterances โ€” all generated by the MLX pipeline at the wall times reported below.

  EN ยท F1 ยท 2.79 s โ€” "Hello world from Apple Silicon. Supertonic 3 runs at one hundred times real time."

  EN ยท M1 ยท 3.90 s โ€” "A gentle breeze moved through the open window while the children, still half-asleep, listened to the distant sound of the harbour bells."

  FR ยท F2 ยท 3.41 s โ€” "Bonjour, ceci est un test de synthรจse vocale en franรงais. Le modรจle gรจre trente-et-une langues sur une puce M4."

  DE ยท M2 ยท 3.69 s โ€” "Guten Morgen. Dieses Modell lรคuft komplett auf Apple Silicon, ohne ONNX und ohne CoreML, in reinem MLX."

  JA ยท F3 ยท 1.46 s โ€” "ใ“ใ‚“ใซใกใฏใ€‚ใ“ใ‚Œใฏใ‚ขใƒƒใƒ—ใƒซใ‚ทใƒชใ‚ณใƒณไธŠใงMLXใ‚’ไฝฟใฃใŸใƒ†ใ‚นใƒˆใงใ™ใ€‚"

  ES ยท M3 ยท 2.86 s โ€” "Hola, esto es una prueba de sรญntesis de voz en espaรฑol ejecutada en tiempo real sobre Apple Silicon."

Benchmarks (Apple M4, FP32, median of 3)

Sample Duration MLX wall RTF ONNX SDK Speedup
EN ยท F1 ยท short 2.79 s 36.6 ms x76 1005 ms 28 ร—
EN ยท M1 ยท long 3.90 s 38.4 ms x102 1356 ms 35 ร—
FR ยท F2 3.41 s 37.9 ms x90 1196 ms 32 ร—
DE ยท M2 3.69 s 38.1 ms x97 1314 ms 35 ร—
JA ยท F3 1.46 s 32.1 ms x46 848 ms 26 ร—
ES ยท M3 2.86 s 37.0 ms x77 1002 ms 27 ร—

Raw numbers are in bench_results.csv (regenerable via a private development monorepo; this repository ships the consolidated release artefacts only).

Multi-machine comparison

Same French sentence ("Un jour, Isaac Newton se promรจne dans son jardin quand une pomme lui tombe sur la tรชte. Eurรชka, j'ai trouvรฉ la loi de la gravitation !"), 4 s of audio, median of 5 warm runs, MLX FP32:

Hardware Wall RTF ms / s audio Notes
Mac Studio M3 Ultra (80 GPU cores, 96 GB) 45.8 ms x88 11.3 best on this test
MacBook Air M4 (10 GPU cores, 16 GB) 86.7 ms x47 21.1 reference consumer device
MacBook Air M4 โ€” CoreML (mlpackage, CPU + NE) 303.5 ms x27 37.7 upstream CoreML build
MacBook Air M4 โ€” ONNX SDK (pip install supertonic) ~1200 ms ~x3 ~350 upstream reference Python SDK

The MLX path is ~ 1.78ร— faster than the CoreML build on the same M4 hardware (MLX 21 ms / s of audio vs CoreML 38 ms / s of audio), and ~ 35โ€“40ร— the ONNX SDK reference. Memory footprint on M3 Ultra is 750 MB active / 844 MB peak GPU memory; the M4 footprint is similar since the model size is fixed. The wall on small-utterance inputs is dispatch-bound (24 attention + ConvNeXt blocks ร— 5 Euler steps + the 10-block vocoder all run in ~ 45 ms on the Ultra); the M3 Ultra's 8ร— extra GPU cores buy ~ 2ร— wall because the workload doesn't fill them.

Cold load: 15 ms from the local safetensors snapshot, ~ 17 s on first from_pretrained from the Hub (downloads 379 MB of weights via hf_transfer).

Reference comparison: the CoreML build of the same model on the same hardware runs at x27 realtime. The MLX port is **2-4ร— faster** end-to-end while remaining bit-identical to the ONNX Runtime reference on the vocoder (cosine 1.00) and at cosine โ‰ฅ 0.98 on the full estimator output.

Voices

10 preset voices โ€” five female (F1โ€“F5) and five male (M1โ€“M5). The voice_styles/ directory contains both style_ttl (50ร—256 latent style for the audio path) and style_dp (8ร—16 style for the duration head) for each voice. Pass the voice name as the voice= kwarg to Pipeline.generate.

Languages

31 languages supported. Pass the ISO 639-1 code as the lang= kwarg: en fr de es it pt ja ko zh ru pl nl tr ar hi vi th id cs ro hu el da sv fi no he uk bg hr sk.

Architecture (short)

Four sub-models, all in weights/*.safetensors:

Sub-model Role Params Size
vector_estimator 24-block CFG flow-matching velocity ~64 M 256 MB
text_encoder Character โ†’ 256-D text embedding ~9 M 36 MB
duration_predictor Text โ†’ seconds ~1 M 3.5 MB
vocoder Latent (B,144,T) โ†’ 44.1 kHz wav ~25 M 101 MB

The pipeline runs exactly 5 Euler steps with classifier-free guidance (4ร—cond โˆ’ 3ร—uncond). This schedule is trained-in: reducing the step count or disabling CFG produces an essentially uncorrelated waveform (verified empirically โ€” see the bench_n_steps.py script in the source repo).

Loading from a local snapshot

Three layouts are auto-detected by Pipeline.from_pretrained:

  1. Hugging Face repo id (e.g. "ambassadia/supertonic-3-mlx") โ€” auto-download
  2. Local path containing weights/ (this layout) โ€” fastest cold-load
  3. Local path containing onnx/ (upstream snapshot) โ€” converts at load time

License

This release combines two artefact classes under two distinct licenses:

  • Model weights (weights/*.safetensors) โ€” BigScience Open RAIL-M. See LICENSE for the full text. The Attachment A use restrictions are reproduced below and apply to all downstream use of the model and of generated audio.
  • Port code (src/supertonic_3_mlx/) โ€” Apache License 2.0. See LICENSE-CODE.

See NOTICE for the modifications statement and the upstream attribution.

OpenRAIL-M Attachment A โ€” use restrictions

You agree not to use the model or derivatives:

(a) In any way that violates any applicable national, federal, state, local or international law or regulation.

(b) For the purpose of exploiting, harming or attempting to exploit or harm minors in any way.

(c) To generate or disseminate verifiably false information and/or content with the purpose of harming others.

(d) To generate or disseminate personal identifiable information that can be used to harm an individual.

(e) To generate or disseminate information and/or content (e.g. images, code, posts, articles), and place the information and/or content in any context (e.g. bot generating tweets) without expressly and intelligibly disclaiming that the information and/or content is machine generated.

(f) To defame, disparage or otherwise harass others.

(g) To impersonate or attempt to impersonate (e.g. deepfakes) others without their consent.

(h) For fully automated decision making that adversely impacts an individual's legal rights or otherwise creates or modifies a binding, enforceable obligation.

(i) For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics.

(j) To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm.

(k) For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories.

(l) To provide medical advice and medical results interpretation.

(m) To generate or disseminate information for the purpose to be used for administration of justice, law enforcement, immigration or asylum processes, such as predicting an individual will commit fraud/crime commitment.

Citation

@misc{supertonic3-mlx,
  title  = {Supertonic 3 MLX: native Apple Silicon port of Supertone's multilingual TTS},
  author = {Dupont, Olivier},
  year   = {2026},
  url    = {https://huggingface.co/ambassadia/supertonic-3-mlx},
  note   = {Derivative of Supertone/supertonic-3 (https://huggingface.co/Supertone/supertonic-3)}
}

Please also cite the upstream Supertone Supertonic 3 model when using this port.

Downloads last month
103
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ambassadia/supertonic-3-mlx

Finetuned
(2)
this model