Instructions to use ambassadia/supertonic-3-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use ambassadia/supertonic-3-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir supertonic-3-mlx ambassadia/supertonic-3-mlx
- Supertonic
How to use ambassadia/supertonic-3-mlx with Supertonic:
from supertonic import TTS tts = TTS(auto_download=True) style = tts.get_voice_style(voice_name="M1") text = "The train delay was announced at 4:45 PM on Wed, Apr 3, 2024 due to track maintenance." wav, duration = tts.synthesize(text, voice_style=style) tts.save_audio(wav, "output.wav")
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Supertonic 3 โ MLX-native
31-language text-to-speech, ~x100 realtime on Apple Silicon. Native MLX port of Supertone/supertonic-3, runs the full flow-matching + classifier-free-guidance pipeline (DurationPredictor โ TextEncoder โ 24-block VectorEstimator (5 Euler steps) โ 10-block Vocos vocoder) without ONNX, CoreML or any C++ runtime โ only MLX + NumPy.
Install
The package isn't on PyPI yet โ install directly from this gitea source repository (or from the local checkout):
pip install git+https://github.com/ambassadia/supertonic-3-mlx.git
Runtime dependencies are just mlx, numpy, and huggingface_hub (the
last for the one-line weight download). On first use the ~ 400 MB weight
bundle is downloaded from
ambassadia/supertonic-3-mlx
into your Hugging Face cache.
One-shot quickstart + sanity test
A zero-config end-to-end test script ships with the repo. Clone the repo,
run the script, and it will create a fresh venv, install everything,
version-check MLX (with an optional auto-upgrade), download the weights
and synthesise an utterance into hello.wav:
git clone https://github.com/ambassadia/supertonic-3-mlx.git
cd supertonic-3-mlx
./setup_and_test.sh # en F1, default text
./setup_and_test.sh fr F2 "Bonjour." # custom lang / voice / text
Re-runs reuse the venv and the cached weights โ second invocation is ~ 20 ms warm load + ~ 30 ms per generate.
Quickstart (after install)
from supertonic_3_mlx import Pipeline
pipe = Pipeline.from_pretrained("ambassadia/supertonic-3-mlx")
wav = pipe.generate("Hello world from Apple Silicon.", voice="F1", lang="en")
# wav is a 1-D numpy.float32 array at 44.1 kHz
import soundfile as sf
sf.write("hello.wav", wav, pipe.sample_rate)
Audio samples
Six languages, mix of male / female voices, mix of short and long utterances โ all generated by the MLX pipeline at the wall times reported below.
EN ยท F1 ยท 2.79 s โ "Hello world from Apple Silicon. Supertonic 3 runs at one hundred times real time."
EN ยท M1 ยท 3.90 s โ "A gentle breeze moved through the open window while the children, still half-asleep, listened to the distant sound of the harbour bells."
FR ยท F2 ยท 3.41 s โ "Bonjour, ceci est un test de synthรจse vocale en franรงais. Le modรจle gรจre trente-et-une langues sur une puce M4."
DE ยท M2 ยท 3.69 s โ "Guten Morgen. Dieses Modell lรคuft komplett auf Apple Silicon, ohne ONNX und ohne CoreML, in reinem MLX."
JA ยท F3 ยท 1.46 s โ "ใใใซใกใฏใใใใฏใขใใใซใทใชใณใณไธใงMLXใไฝฟใฃใใในใใงใใ"
ES ยท M3 ยท 2.86 s โ "Hola, esto es una prueba de sรญntesis de voz en espaรฑol ejecutada en tiempo real sobre Apple Silicon."
Benchmarks (Apple M4, FP32, median of 3)
| Sample | Duration | MLX wall | RTF | ONNX SDK | Speedup |
|---|---|---|---|---|---|
| EN ยท F1 ยท short | 2.79 s | 36.6 ms | x76 | 1005 ms | 28 ร |
| EN ยท M1 ยท long | 3.90 s | 38.4 ms | x102 | 1356 ms | 35 ร |
| FR ยท F2 | 3.41 s | 37.9 ms | x90 | 1196 ms | 32 ร |
| DE ยท M2 | 3.69 s | 38.1 ms | x97 | 1314 ms | 35 ร |
| JA ยท F3 | 1.46 s | 32.1 ms | x46 | 848 ms | 26 ร |
| ES ยท M3 | 2.86 s | 37.0 ms | x77 | 1002 ms | 27 ร |
Raw numbers are in bench_results.csv (regenerable via
a private development monorepo; this repository ships the consolidated release artefacts only).
Multi-machine comparison
Same French sentence
("Un jour, Isaac Newton se promรจne dans son jardin quand une pomme lui tombe sur la tรชte. Eurรชka, j'ai trouvรฉ la loi de la gravitation !"),
4 s of audio, median of 5 warm runs, MLX FP32:
| Hardware | Wall | RTF | ms / s audio | Notes |
|---|---|---|---|---|
| Mac Studio M3 Ultra (80 GPU cores, 96 GB) | 45.8 ms | x88 | 11.3 | best on this test |
| MacBook Air M4 (10 GPU cores, 16 GB) | 86.7 ms | x47 | 21.1 | reference consumer device |
| MacBook Air M4 โ CoreML (mlpackage, CPU + NE) | 303.5 ms | x27 | 37.7 | upstream CoreML build |
MacBook Air M4 โ ONNX SDK (pip install supertonic) |
~1200 ms | ~x3 | ~350 | upstream reference Python SDK |
The MLX path is ~ 1.78ร faster than the CoreML build on the same M4 hardware (MLX 21 ms / s of audio vs CoreML 38 ms / s of audio), and ~ 35โ40ร the ONNX SDK reference. Memory footprint on M3 Ultra is 750 MB active / 844 MB peak GPU memory; the M4 footprint is similar since the model size is fixed. The wall on small-utterance inputs is dispatch-bound (24 attention + ConvNeXt blocks ร 5 Euler steps + the 10-block vocoder all run in ~ 45 ms on the Ultra); the M3 Ultra's 8ร extra GPU cores buy ~ 2ร wall because the workload doesn't fill them.
Cold load: 15 ms from the local safetensors snapshot, ~ 17 s on first
from_pretrained from the Hub (downloads 379 MB of weights via
hf_transfer).
Reference comparison: the CoreML build of the same model on the same hardware
runs at x27 realtime. The MLX port is **2-4ร faster** end-to-end while
remaining bit-identical to the ONNX Runtime reference on the vocoder
(cosine 1.00) and at cosine โฅ 0.98 on the full estimator output.
Voices
10 preset voices โ five female (F1โF5) and five male (M1โM5). The
voice_styles/ directory contains both style_ttl (50ร256 latent style for
the audio path) and style_dp (8ร16 style for the duration head) for each
voice. Pass the voice name as the voice= kwarg to Pipeline.generate.
Languages
31 languages supported. Pass the ISO 639-1 code as the lang= kwarg:
en fr de es it pt ja ko zh ru pl nl tr ar hi
vi th id cs ro hu el da sv fi no he uk bg hr sk.
Architecture (short)
Four sub-models, all in weights/*.safetensors:
| Sub-model | Role | Params | Size |
|---|---|---|---|
vector_estimator |
24-block CFG flow-matching velocity | ~64 M | 256 MB |
text_encoder |
Character โ 256-D text embedding | ~9 M | 36 MB |
duration_predictor |
Text โ seconds | ~1 M | 3.5 MB |
vocoder |
Latent (B,144,T) โ 44.1 kHz wav | ~25 M | 101 MB |
The pipeline runs exactly 5 Euler steps with classifier-free guidance
(4รcond โ 3รuncond). This schedule is trained-in: reducing the step count
or disabling CFG produces an essentially uncorrelated waveform (verified
empirically โ see the bench_n_steps.py script in the source repo).
Loading from a local snapshot
Three layouts are auto-detected by Pipeline.from_pretrained:
- Hugging Face repo id (e.g.
"ambassadia/supertonic-3-mlx") โ auto-download - Local path containing
weights/(this layout) โ fastest cold-load - Local path containing
onnx/(upstream snapshot) โ converts at load time
License
This release combines two artefact classes under two distinct licenses:
- Model weights (
weights/*.safetensors) โ BigScience Open RAIL-M. SeeLICENSEfor the full text. The Attachment A use restrictions are reproduced below and apply to all downstream use of the model and of generated audio. - Port code (
src/supertonic_3_mlx/) โ Apache License 2.0. SeeLICENSE-CODE.
See NOTICE for the modifications statement and the upstream
attribution.
OpenRAIL-M Attachment A โ use restrictions
You agree not to use the model or derivatives:
(a) In any way that violates any applicable national, federal, state, local or international law or regulation.
(b) For the purpose of exploiting, harming or attempting to exploit or harm minors in any way.
(c) To generate or disseminate verifiably false information and/or content with the purpose of harming others.
(d) To generate or disseminate personal identifiable information that can be used to harm an individual.
(e) To generate or disseminate information and/or content (e.g. images, code, posts, articles), and place the information and/or content in any context (e.g. bot generating tweets) without expressly and intelligibly disclaiming that the information and/or content is machine generated.
(f) To defame, disparage or otherwise harass others.
(g) To impersonate or attempt to impersonate (e.g. deepfakes) others without their consent.
(h) For fully automated decision making that adversely impacts an individual's legal rights or otherwise creates or modifies a binding, enforceable obligation.
(i) For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics.
(j) To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm.
(k) For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories.
(l) To provide medical advice and medical results interpretation.
(m) To generate or disseminate information for the purpose to be used for administration of justice, law enforcement, immigration or asylum processes, such as predicting an individual will commit fraud/crime commitment.
Citation
@misc{supertonic3-mlx,
title = {Supertonic 3 MLX: native Apple Silicon port of Supertone's multilingual TTS},
author = {Dupont, Olivier},
year = {2026},
url = {https://huggingface.co/ambassadia/supertonic-3-mlx},
note = {Derivative of Supertone/supertonic-3 (https://huggingface.co/Supertone/supertonic-3)}
}
Please also cite the upstream Supertone Supertonic 3 model when using this port.
- Downloads last month
- 103
Quantized
Model tree for ambassadia/supertonic-3-mlx
Base model
Supertone/supertonic-3