🧃 supertonic-3: Multilingual Text-to-Speech CoreML

| | | |

On‑device multilingual TTS model converted to Core ML for Apple platforms. This is a hand‑port of Supertone Supertonic‑3 v1.7.3 from ONNX → PyTorch → Core ML, suitable for FluidAudio's TTS pipeline on macOS/iOS. 31 languages, 44.1 kHz output, flow‑matching diffusion with classifier‑free guidance (8 denoising steps).

The conversion script is here: https://github.com/FluidInference/mobius/tree/main/models/tts/supertonic-3/coreml

And the FluidAudio integration is here: https://github.com/FluidInference/FluidAudio/tree/main/Sources/FluidAudio/TTS/Supertonic3

Highlights

Core ML: Runs on‑device (ANE + CPU) on Apple Silicon.
Multilingual: 31 languages — see Supported Languages.
High quality: 44.1 kHz output via flow‑matching diffusion + ConvNeXt vocoder.
Voice styling: zero‑shot voice style embeddings (single JSON per voice).
Performance: end‑to‑end RTFx ≈ 8.5× on M2 (CoreML), ≈ 17–19× on M2 with current ANE assignment (3 of 4 modules on ANE).
Privacy: No network calls required once models are downloaded.

Intended Use

Batch TTS for full text segments on macOS/iOS.
Local voice synthesis for note‑taking, accessibility, and creative tools.
Embedded TTS in production apps via the FluidAudio Swift framework.

Supported Platforms

macOS 14+ (Apple Silicon recommended)
iOS 17+

Model Details

Architecture: Supertonic‑3 v1.7.3 — 4‑stage pipeline:
1. text_encoder — token embeddings → contextual text features [B, 256, T].
2. duration_predictor — predicts utterance duration from text features.
3. vector_estimator — flow‑matching diffusion in latent space (8 steps, classifier‑free guidance via batch‑2 duplication, ConvNeXt + cross‑attention to text + style attention).
4. vocoder — ConvNeXt decoder → 44.1 kHz waveform.
Output audio: 44.1 kHz mono, Float32 PCM.
Languages: 31 (see below).
Precision: FP16 weights and activations (mlprogram, iOS 18+ minimum deployment target).
Granularity: vocoder frame ≈ 11.6 ms; latent tick ≈ 69.7 ms.

Supported Languages

English, Korean, Japanese, Arabic, Bulgarian, Czech, Danish, German, Greek, Spanish, Estonian, Finnish, French, Hindi, Croatian, Hungarian, Indonesian, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Swedish, Turkish, Ukrainian, Vietnamese.

Performance (Apple M2, macOS 26.5, FP16)

Module	Size	Predict	Compute placement
duration_predictor	1.8 MB	0.82 ms	CPU (tiny)
text_encoder	17 MB	2.15 ms	62 % ANE
vocoder	48 MB	1.17 ms	100 % ANE
vector_estimator (fp16)	122 MB	9.29 ms	CPU + GPU (see notes)
vector_estimator (int8)	62 MB	~same	int8 weight-only / fp16 acts; ~10 % lower peak RSS, RMSE ≈ 0.016 vs FP16

End‑to‑end on M2: ≈ 0.74 s to synthesize 6.32 s of audio for a single English sentence (RTFx ≈ 8.5×), 8 denoising steps. Output verified against FluidAudio Parakeet TDT ASR.

Note on vector_estimator: 100 % of its ops are ANE‑eligible after the float‑mask + precompute refactor, but Apple's ANECCompile currently returns opaque error 11 on this graph and silently falls back to CPU/GPU. See coreml/trials.md in the conversion repo for the full investigation.

Files

Both .mlpackage (Core ML source bundle, includes weights + spec) and the precompiled .mlmodelc (ready for direct MLModel(contentsOf:) load) are shipped — use .mlmodelc to skip the on‑device compile step on first load.

TextEncoder.mlpackage / TextEncoder.mlmodelc — fixed T=128 text input.
DurationPredictor.mlpackage / DurationPredictor.mlmodelc — fixed T=128 text input.
VectorEstimator.mlpackage / VectorEstimator.mlmodelc — latent.L and text.T as RangeDim(17..512), FP16 weights (122 MB).
VectorEstimator_int8.mlpackage / VectorEstimator_int8.mlmodelc — same model, int8 weight-only (per-channel symmetric) + FP16 activations (62 MB; ~10 % lower peak RSS, RMSE ≈ 0.016 vs FP16).
Vocoder.mlpackage / Vocoder.mlmodelc — latent.L_ttl as RangeDim(4..512).
tts.json — token / text frontend configuration.
unicode_indexer.json — Unicode → token id mapping (multilingual frontend).
voice_styles/M1.json — example voice style embedding (single male reference).
manifest.json — file inventory (sha256 + sizes) for both .mlpackage and .mlmodelc.
infer.py — minimal self-contained Python demo (loads .mlmodelc / .mlpackage directly).
requirements.txt — Python deps for infer.py (coremltools, numpy, soundfile).

Usage

Quick test (Python)

For the curious / for sanity checking, this repo ships a small self‑contained script infer.py that loads all four modules directly via coremltools and writes a 44.1 kHz WAV. No external repo clone required.

# 1. Download the repo (e.g. via huggingface_hub or `git lfs clone`).
git lfs clone https://huggingface.co/FluidInference/supertonic-3-coreml
cd supertonic-3-coreml

# 2. Install the 3 deps (macOS, Python 3.11+ recommended).
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# 3. Synthesize.
python infer.py "Hello, world." --voice-style voice_styles/M1.json -o hello.wav
python infer.py "Bonjour le monde." --lang fr --voice-style voice_styles/M1.json -o fr.wav

# Use the int8-quantized VectorEstimator (62 MB instead of 122 MB).
python infer.py "Hello, int8 build." --vector-estimator VectorEstimator_int8.mlpackage -o int8.wav

# Optional: pick a compute unit explicitly.
python infer.py "Test" --compute-units CPU_AND_NE -o ne.wav

The Python script loads .mlpackage (which is what coremltools accepts); the .mlmodelc bundles are for direct Swift / Objective‑C use (MLModel(contentsOf:)) where they skip the on‑device compile step.

Production (Swift / FluidAudio)

For production use, the FluidAudio Swift framework handles model loading, text frontend, batching, chunking, and the diffusion / vocoder loop.

Swift (FluidAudio)

import AVFoundation
import FluidAudio

Task {
    // Download and load Supertonic-3 models (first run only)
    let models = try await Supertonic3Models.downloadAndLoad()

    // Initialize the TTS manager
    let tts = Supertonic3Manager(config: .default)
    try await tts.initialize(models: models)

    // Synthesize speech for some text with a voice style
    let style = try VoiceStyle.load(path: "voice_styles/M1.json")
    let audio = try await tts.synthesize(text: "Hello, world.", style: style)

    // audio.samples is 44.1 kHz Float32 PCM in [-1, 1]
    try AudioWriter.writeWav(audio.samples, sampleRate: 44_100, to: "hello.wav")

    tts.cleanup()
}

For more examples (including CLI usage and benchmarking), see the FluidAudio repository: https://github.com/FluidInference/FluidAudio

Limitations

44.1 kHz output is high quality but heavier than 16/22.05 kHz TTS — plan for the bandwidth and storage cost.
vector_estimator currently runs on CPU + GPU instead of ANE due to an Apple‑side ANE compiler limitation (see Performance).
Text frontend currently uses fixed T=128 token windows; longer text must be segmented by the caller.

License

OpenRAIL‑M (inherited from upstream Supertone/supertonic-3). The Core ML conversion tooling and FluidAudio integration are MIT‑licensed. See the FluidAudio repository for details and usage guidance.

Downloads last month: -

Model tree for FluidInference/supertonic-3-coreml

Base model

Supertone/supertonic-3

Quantized

(1)

this model