Supertonic 3 — MLX-native

31-language text-to-speech, ~x100 realtime on Apple Silicon. Native MLX port of Supertone/supertonic-3, runs the full flow-matching + classifier-free-guidance pipeline (DurationPredictor → TextEncoder → 24-block VectorEstimator (5 Euler steps) → 10-block Vocos vocoder) without ONNX, CoreML or any C++ runtime — only MLX + NumPy.

Install

The package isn't on PyPI yet — install directly from this gitea source repository (or from the local checkout):

pip install git+https://github.com/ambassadia/supertonic-3-mlx.git

Runtime dependencies are just mlx, numpy, and huggingface_hub (the last for the one-line weight download). On first use the ~ 400 MB weight bundle is downloaded from ambassadia/supertonic-3-mlx into your Hugging Face cache.

One-shot quickstart + sanity test

A zero-config end-to-end test script ships with the repo. Clone the repo, run the script, and it will create a fresh venv, install everything, version-check MLX (with an optional auto-upgrade), download the weights and synthesise an utterance into hello.wav:

git clone https://github.com/ambassadia/supertonic-3-mlx.git
cd supertonic-3-mlx
./setup_and_test.sh                              # en F1, default text
./setup_and_test.sh fr F2 "Bonjour."             # custom lang / voice / text

Re-runs reuse the venv and the cached weights — second invocation is ~ 20 ms warm load + ~ 30 ms per generate.

Quickstart (after install)

from supertonic_3_mlx import Pipeline

pipe = Pipeline.from_pretrained("ambassadia/supertonic-3-mlx")
wav  = pipe.generate("Hello world from Apple Silicon.", voice="F1", lang="en")

# wav is a 1-D numpy.float32 array at 44.1 kHz
import soundfile as sf
sf.write("hello.wav", wav, pipe.sample_rate)

Audio samples

Six languages, mix of male / female voices, mix of short and long utterances — all generated by the MLX pipeline at the wall times reported below.

EN · F1 · 2.79 s — "Hello world from Apple Silicon. Supertonic 3 runs at one hundred times real time."

EN · M1 · 3.90 s — "A gentle breeze moved through the open window while the children, still half-asleep, listened to the distant sound of the harbour bells."

FR · F2 · 3.41 s — "Bonjour, ceci est un test de synthèse vocale en français. Le modèle gère trente-et-une langues sur une puce M4."

DE · M2 · 3.69 s — "Guten Morgen. Dieses Modell läuft komplett auf Apple Silicon, ohne ONNX und ohne CoreML, in reinem MLX."

JA · F3 · 1.46 s — "こんにちは。これはアップルシリコン上でMLXを使ったテストです。"

ES · M3 · 2.86 s — "Hola, esto es una prueba de síntesis de voz en español ejecutada en tiempo real sobre Apple Silicon."

Benchmarks (Apple M4, FP32, median of 3)

Sample	Duration	MLX wall	RTF	ONNX SDK	Speedup
EN · F1 · short	2.79 s	36.6 ms	x76	1005 ms	28 ×
EN · M1 · long	3.90 s	38.4 ms	x102	1356 ms	35 ×
FR · F2	3.41 s	37.9 ms	x90	1196 ms	32 ×
DE · M2	3.69 s	38.1 ms	x97	1314 ms	35 ×
JA · F3	1.46 s	32.1 ms	x46	848 ms	26 ×
ES · M3	2.86 s	37.0 ms	x77	1002 ms	27 ×

Raw numbers are in bench_results.csv (regenerable via a private development monorepo; this repository ships the consolidated release artefacts only).

Multi-machine comparison

Same French sentence ("Un jour, Isaac Newton se promène dans son jardin quand une pomme lui tombe sur la tête. Eurêka, j'ai trouvé la loi de la gravitation !"), 4 s of audio, median of 5 warm runs, MLX FP32:

Hardware	Wall	RTF	ms / s audio	Notes
Mac Studio M3 Ultra (80 GPU cores, 96 GB)	45.8 ms	x88	11.3	best on this test
MacBook Air M4 (10 GPU cores, 16 GB)	86.7 ms	x47	21.1	reference consumer device
MacBook Air M4 — CoreML (mlpackage, CPU + NE)	303.5 ms	x27	37.7	upstream CoreML build
MacBook Air M4 — ONNX SDK (`pip install supertonic`)	~1200 ms	~x3	~350	upstream reference Python SDK

The MLX path is ~ 1.78× faster than the CoreML build on the same M4 hardware (MLX 21 ms / s of audio vs CoreML 38 ms / s of audio), and ~ 35–40× the ONNX SDK reference. Memory footprint on M3 Ultra is 750 MB active / 844 MB peak GPU memory; the M4 footprint is similar since the model size is fixed. The wall on small-utterance inputs is dispatch-bound (24 attention + ConvNeXt blocks × 5 Euler steps + the 10-block vocoder all run in ~ 45 ms on the Ultra); the M3 Ultra's 8× extra GPU cores buy ~ 2× wall because the workload doesn't fill them.

Cold load: 15 ms from the local safetensors snapshot, ~ 17 s on first from_pretrained from the Hub (downloads 379 MB of weights via hf_transfer).

Reference comparison: the CoreML build of the same model on the same hardware runs at x27 realtime. The MLX port is **2-4× faster** end-to-end while remaining bit-identical to the ONNX Runtime reference on the vocoder (cosine 1.00) and at cosine ≥ 0.98 on the full estimator output.

Voices

10 preset voices — five female (F1–F5) and five male (M1–M5). The voice_styles/ directory contains both style_ttl (50×256 latent style for the audio path) and style_dp (8×16 style for the duration head) for each voice. Pass the voice name as the voice= kwarg to Pipeline.generate.

Languages

31 languages supported. Pass the ISO 639-1 code as the lang= kwarg: en fr de es it pt ja ko zh ru pl nl tr ar hi vi th id cs ro hu el da sv fi no he uk bg hr sk.

Architecture (short)

Four sub-models, all in weights/*.safetensors:

Sub-model	Role	Params	Size
`vector_estimator`	24-block CFG flow-matching velocity	~64 M	256 MB
`text_encoder`	Character → 256-D text embedding	~9 M	36 MB
`duration_predictor`	Text → seconds	~1 M	3.5 MB
`vocoder`	Latent (B,144,T) → 44.1 kHz wav	~25 M	101 MB

The pipeline runs exactly 5 Euler steps with classifier-free guidance (4×cond − 3×uncond). This schedule is trained-in: reducing the step count or disabling CFG produces an essentially uncorrelated waveform (verified empirically — see the bench_n_steps.py script in the source repo).

Loading from a local snapshot

Three layouts are auto-detected by Pipeline.from_pretrained:

Hugging Face repo id (e.g. "ambassadia/supertonic-3-mlx") — auto-download
Local path containing weights/ (this layout) — fastest cold-load
Local path containing onnx/ (upstream snapshot) — converts at load time

License

This release combines two artefact classes under two distinct licenses:

Model weights (weights/*.safetensors) — BigScience Open RAIL-M. See LICENSE for the full text. The Attachment A use restrictions are reproduced below and apply to all downstream use of the model and of generated audio.
Port code (src/supertonic_3_mlx/) — Apache License 2.0. See LICENSE-CODE.

See NOTICE for the modifications statement and the upstream attribution.

OpenRAIL-M Attachment A — use restrictions

You agree not to use the model or derivatives:

(a) In any way that violates any applicable national, federal, state, local or international law or regulation.

(b) For the purpose of exploiting, harming or attempting to exploit or harm minors in any way.

(d) To generate or disseminate personal identifiable information that can be used to harm an individual.

(e) To generate or disseminate information and/or content (e.g. images, code, posts, articles), and place the information and/or content in any context (e.g. bot generating tweets) without expressly and intelligibly disclaiming that the information and/or content is machine generated.

(f) To defame, disparage or otherwise harass others.

(g) To impersonate or attempt to impersonate (e.g. deepfakes) others without their consent.

(h) For fully automated decision making that adversely impacts an individual's legal rights or otherwise creates or modifies a binding, enforceable obligation.

(i) For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics.

(j) To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm.

(k) For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories.

(l) To provide medical advice and medical results interpretation.

(m) To generate or disseminate information for the purpose to be used for administration of justice, law enforcement, immigration or asylum processes, such as predicting an individual will commit fraud/crime commitment.

Citation

@misc{supertonic3-mlx,
  title  = {Supertonic 3 MLX: native Apple Silicon port of Supertone's multilingual TTS},
  author = {Dupont, Olivier},
  year   = {2026},
  url    = {https://huggingface.co/ambassadia/supertonic-3-mlx},
  note   = {Derivative of Supertone/supertonic-3 (https://huggingface.co/Supertone/supertonic-3)}
}

Please also cite the upstream Supertone Supertonic 3 model when using this port.

Downloads last month: 65

MLX

Hardware compatibility

Quantized

Model tree for ambassadia/supertonic-3-mlx

Base model

Supertone/supertonic-3

Finetuned

(5)

this model