| --- |
| language: |
| - en |
| license: mit |
| library_name: coreml |
| tags: |
| - text-to-speech |
| - tts |
| - styletts2 |
| - coreml |
| - apple-silicon |
| - voice-cloning |
| pipeline_tag: text-to-speech |
| --- |
| |
| # StyleTTS2 LibriTTS β CoreML |
|
|
| Apple CoreML port of [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2) (LibriTTS 2nd-stage checkpoint, epoch 20). 9-stage `.mlpackage` chain with mixed-precision and per-stage compute-unit assignments tuned for Apple Silicon (CPU + ANE + GPU). |
|
|
| 24 kHz mono synthesis. Zero-shot voice cloning from a 3-10 second reference WAV. |
|
|
| ## Highlights |
|
|
| - **9 stages, 258 MB on disk**, all fp16 except `har_source` (fp32 required for sin(2ΟΒ·cumsum(f0)) numerical stability) |
| - **~390 ms warm CoreML predict** per utterance (M-series, mixed CPU+ANE+GPU) |
| - **RTFx ~9.4Γ** end-to-end (3.7 s of audio in ~390 ms) |
| - **~13 s cold start** (Apple `anecompilerservice` compiles ANE-targeted graphs on first call; fully cached afterwards) |
| - **Per-stage placement**: `text_encoder`/`duration_predictor`/`decoder_upsample` on CPU, `bert`/`ref_encoder`/`diffusion_unet`/`f0n_predictor`/`decoder_pre` on ANE, `har_source` on GPU |
|
|
| ## Repository contents |
|
|
| ``` |
| packages/ 9 mlpackages (258 MB) |
| text_encoder_fp16.mlpackage 11 MB text β 512-dim embedding (LSTM, RangeDim T) |
| bert_fp16.mlpackage 12 MB Albert + bert_encoder (fixed T=57) |
| ref_encoder_fp16.mlpackage 53 MB reference mel β 256-dim style (CNN) |
| diffusion_unet_fp16.mlpackage 48 MB cross-attention U-Net (fixed T=57; ADPM2 sampler) |
| duration_predictor_fp16.mlpackage 15 MB LSTM + duration logits (RangeDim T) |
| f0n_predictor_fp16.mlpackage 16 MB F0 + noise prediction (RangeDim F) |
| har_source.mlpackage 12 KB F0 β harmonic source (RangeDim F0_LEN, fp32) |
| decoder_pre_fp16.mlpackage 64 MB AdaIN encode/decode + F0/N convs (RangeDim F) |
| decoder_upsample_fp16.mlpackage 40 MB HiFi-GAN Generator (RangeDim Fβaudio) |
| voices/ 17 reference clips (4 MB) |
| Yinghao.wav, Nima.wav, Gavin.wav, Vinay.wav Identity speakers |
| amused.wav, anger.wav, disgusted.wav, sleepy.wav Emotion clips |
| *.wav LibriTTS samples |
| samples/ End-to-end synthesis samples |
| sample_swift.wav Produced by the Swift CoreML driver |
| sample_python.wav Produced by the Python CoreML pipeline |
| manifest.json Machine-readable spec for all stages |
| README.md This file |
| ``` |
|
|
| ## Limits |
|
|
| - **Phoneme cap: 57.** `bert` and `diffusion_unet` are pinned to a fixed token axis of 57 because the CoreML CPU MLProgram backend rejects RangeDim on their cross-attention shape ops. Inputs that phonemize to >57 tokens will fail. The other 7 stages support flexible token (1-512) and frame (1-2048) axes. |
| - **ANE compile fails** for the HiFi-GAN ConvTranspose1d ups stack inside `decoder_upsample`. CPU is the most predictable placement; GPU has slightly lower warm latency but contends with `har_source`. |
| - **Apple Silicon recommended.** Intel Macs have not been validated for CoreML mlprogram inference at scale. |
|
|
| ## Pipeline (per utterance) |
|
|
| ``` |
| text β espeak-ng IPA β tokenize β token_ids |
| β |
| βββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ |
| β β β |
| βΌ βΌ βΌ |
| text_encoder bert (fixed T=57) reference WAV β mel β ref_encoder |
| t_en [1,512,T] bert_dur [1,57,768] ref_s [1,256] |
| d_en [1,512,57] |
| β |
| βΌ |
| diffusion_unet Γ 5 ADPM2 steps (10 dispatches) |
| β |
| βΌ |
| s_pred [1,256] |
| β blend(Ξ±, Ξ², ref_s) β |
| ref [1,128] s [1,128] |
| β |
| βΌ |
| duration_predictor |
| d [1,T,640] pred_dur β pred_aln_trg |
| β |
| βΌ (matmul + hifigan tail-shift) |
| en [1,640,F] asr [1,512,F] |
| β |
| βΌ |
| f0n_predictor |
| f0_pred, n_pred [1, 2F] |
| β |
| βΌ |
| har_source |
| har [1,1,600F] |
| β |
| βΌ |
| decoder_pre |
| x_pre [1,512,2F] |
| β |
| βΌ |
| decoder_upsample |
| audio [1,1,72k+] |
| β |
| βΌ |
| tail-trim 50 samples β WAV @ 24 kHz |
| ``` |
|
|
| The 5 non-CoreML steps (espeak phonemize, ADPM2 sampler loop, mel extraction, alignment matrix, tail-shift) run host-side. See `manifest.json#non_coreml_pipeline_steps` for exact specs. |
|
|
| ## Voices |
|
|
| `voices/*.wav` are zero-shot reference clips. The `ref_encoder` stage reads a mel of the chosen reference and produces a 256-dim style embedding that conditions every downstream stage. Bring your own clip β any 3-10 s mono recording at any sample rate works (resampled to 24 kHz internally). Quality is sensitive to reference cleanliness (background noise transfers). |
|
|
| ## Quick demo (Swift) |
|
|
| A self-contained Swift demo exists that drives the last 4 stages directly from CoreML, given pre-computed inputs from the Python preprocessor. End-to-end Swift synthesis (no Python) requires porting espeak phonemize + mel + ADPM2 sampler + alignment, ~600 lines of Swift on top of these packages. |
|
|
| ## Quick demo (Python) |
|
|
| ```bash |
| git clone https://github.com/yl4579/StyleTTS2 # for the espeak/text frontend + checkpoint config |
| # Place this repo's packages/ as coreml/packages/ in StyleTTS2 working tree. |
| uv run python coreml/inference.py \ |
| --text "StyleTTS 2 is a text to speech model." \ |
| --reference voices/Yinghao.wav \ |
| --output out.wav |
| ``` |
|
|
| ## Conversion notes |
|
|
| - Source: PyTorch StyleTTS2 LibriTTS 2nd-stage checkpoint (yl4579/StyleTTS2 epoch 20). |
| - coremltools mlprogram, deployment target macOS15, fp16 compute precision. |
| - Mixed-precision: 7 stages fp16, 1 stage fp32 (`har_source`), 1 stage split for ANE compatibility (`decoder` β `decoder_pre` + `decoder_upsample`). |
| - Trace parity: all 9 stages mse=0 against eager PyTorch on the trace input. |
| - Quantization trials (linear int8, 8-bit k-means palettization) tested on `decoder_upsample`; both rejected β int8 is slower than fp16 on CPU (no native ConvTranspose1d kernel) and lossy quality (19 dB SNR) for palettization. fp16 is the production setting. |
|
|
| ## License |
|
|
| MIT (matches upstream yl4579/StyleTTS2). LibriTTS reference clips inherit their LibriTTS / Apache-2.0 licensing. |
|
|
| ## Citation |
|
|
| If you use this port, please cite the original StyleTTS2 paper: |
|
|
| ```bibtex |
| @article{li2023styletts, |
| title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models}, |
| author={Li, Yinghao Aaron and Han, Cong and Raghavan, Vinay and Mischler, Gavin and Mesgarani, Nima}, |
| journal={arXiv preprint arXiv:2306.07691}, |
| year={2023} |
| } |
| ``` |
|
|