language:
- en
license: mit
library_name: coreml
tags:
- text-to-speech
- tts
- styletts2
- coreml
- apple-silicon
- voice-cloning
pipeline_tag: text-to-speech
StyleTTS2 LibriTTS β CoreML
Apple CoreML port of yl4579/StyleTTS2 (LibriTTS 2nd-stage checkpoint, epoch 20). 9-stage .mlpackage chain with mixed-precision and per-stage compute-unit assignments tuned for Apple Silicon (CPU + ANE + GPU).
24 kHz mono synthesis. Zero-shot voice cloning from a 3-10 second reference WAV.
Highlights
- 9 stages, 258 MB on disk, all fp16 except
har_source(fp32 required for sin(2ΟΒ·cumsum(f0)) numerical stability) - ~390 ms warm CoreML predict per utterance (M-series, mixed CPU+ANE+GPU)
- RTFx ~9.4Γ end-to-end (3.7 s of audio in ~390 ms)
- ~13 s cold start (Apple
anecompilerservicecompiles ANE-targeted graphs on first call; fully cached afterwards) - Per-stage placement:
text_encoder/duration_predictor/decoder_upsampleon CPU,bert/ref_encoder/diffusion_unet/f0n_predictor/decoder_preon ANE,har_sourceon GPU
Repository contents
packages/ 9 mlpackages (258 MB)
text_encoder_fp16.mlpackage 11 MB text β 512-dim embedding (LSTM, RangeDim T)
bert_fp16.mlpackage 12 MB Albert + bert_encoder (fixed T=57)
ref_encoder_fp16.mlpackage 53 MB reference mel β 256-dim style (CNN)
diffusion_unet_fp16.mlpackage 48 MB cross-attention U-Net (fixed T=57; ADPM2 sampler)
duration_predictor_fp16.mlpackage 15 MB LSTM + duration logits (RangeDim T)
f0n_predictor_fp16.mlpackage 16 MB F0 + noise prediction (RangeDim F)
har_source.mlpackage 12 KB F0 β harmonic source (RangeDim F0_LEN, fp32)
decoder_pre_fp16.mlpackage 64 MB AdaIN encode/decode + F0/N convs (RangeDim F)
decoder_upsample_fp16.mlpackage 40 MB HiFi-GAN Generator (RangeDim Fβaudio)
voices/ 17 reference clips (4 MB)
Yinghao.wav, Nima.wav, Gavin.wav, Vinay.wav Identity speakers
amused.wav, anger.wav, disgusted.wav, sleepy.wav Emotion clips
*.wav LibriTTS samples
samples/ End-to-end synthesis samples
sample_swift.wav Produced by the Swift CoreML driver
sample_python.wav Produced by the Python CoreML pipeline
manifest.json Machine-readable spec for all stages
README.md This file
Limits
- Phoneme cap: 57.
bertanddiffusion_unetare pinned to a fixed token axis of 57 because the CoreML CPU MLProgram backend rejects RangeDim on their cross-attention shape ops. Inputs that phonemize to >57 tokens will fail. The other 7 stages support flexible token (1-512) and frame (1-2048) axes. - ANE compile fails for the HiFi-GAN ConvTranspose1d ups stack inside
decoder_upsample. CPU is the most predictable placement; GPU has slightly lower warm latency but contends withhar_source. - Apple Silicon recommended. Intel Macs have not been validated for CoreML mlprogram inference at scale.
Pipeline (per utterance)
text β espeak-ng IPA β tokenize β token_ids
β
βββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
text_encoder bert (fixed T=57) reference WAV β mel β ref_encoder
t_en [1,512,T] bert_dur [1,57,768] ref_s [1,256]
d_en [1,512,57]
β
βΌ
diffusion_unet Γ 5 ADPM2 steps (10 dispatches)
β
βΌ
s_pred [1,256]
β blend(Ξ±, Ξ², ref_s) β
ref [1,128] s [1,128]
β
βΌ
duration_predictor
d [1,T,640] pred_dur β pred_aln_trg
β
βΌ (matmul + hifigan tail-shift)
en [1,640,F] asr [1,512,F]
β
βΌ
f0n_predictor
f0_pred, n_pred [1, 2F]
β
βΌ
har_source
har [1,1,600F]
β
βΌ
decoder_pre
x_pre [1,512,2F]
β
βΌ
decoder_upsample
audio [1,1,72k+]
β
βΌ
tail-trim 50 samples β WAV @ 24 kHz
The 5 non-CoreML steps (espeak phonemize, ADPM2 sampler loop, mel extraction, alignment matrix, tail-shift) run host-side. See manifest.json#non_coreml_pipeline_steps for exact specs.
Voices
voices/*.wav are zero-shot reference clips. The ref_encoder stage reads a mel of the chosen reference and produces a 256-dim style embedding that conditions every downstream stage. Bring your own clip β any 3-10 s mono recording at any sample rate works (resampled to 24 kHz internally). Quality is sensitive to reference cleanliness (background noise transfers).
Quick demo (Swift)
A self-contained Swift demo exists that drives the last 4 stages directly from CoreML, given pre-computed inputs from the Python preprocessor. End-to-end Swift synthesis (no Python) requires porting espeak phonemize + mel + ADPM2 sampler + alignment, ~600 lines of Swift on top of these packages.
Quick demo (Python)
git clone https://github.com/yl4579/StyleTTS2 # for the espeak/text frontend + checkpoint config
# Place this repo's packages/ as coreml/packages/ in StyleTTS2 working tree.
uv run python coreml/inference.py \
--text "StyleTTS 2 is a text to speech model." \
--reference voices/Yinghao.wav \
--output out.wav
Conversion notes
- Source: PyTorch StyleTTS2 LibriTTS 2nd-stage checkpoint (yl4579/StyleTTS2 epoch 20).
- coremltools mlprogram, deployment target macOS15, fp16 compute precision.
- Mixed-precision: 7 stages fp16, 1 stage fp32 (
har_source), 1 stage split for ANE compatibility (decoderβdecoder_pre+decoder_upsample). - Trace parity: all 9 stages mse=0 against eager PyTorch on the trace input.
- Quantization trials (linear int8, 8-bit k-means palettization) tested on
decoder_upsample; both rejected β int8 is slower than fp16 on CPU (no native ConvTranspose1d kernel) and lossy quality (19 dB SNR) for palettization. fp16 is the production setting.
License
MIT (matches upstream yl4579/StyleTTS2). LibriTTS reference clips inherit their LibriTTS / Apache-2.0 licensing.
Citation
If you use this port, please cite the original StyleTTS2 paper:
@article{li2023styletts,
title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
author={Li, Yinghao Aaron and Han, Cong and Raghavan, Vinay and Mischler, Gavin and Mesgarani, Nima},
journal={arXiv preprint arXiv:2306.07691},
year={2023}
}