alexwengg's picture
Upload 48 files
77c9940 verified
|
Raw
History Blame Contribute Delete
8.22 kB
metadata
language:
  - en
license: mit
library_name: coreml
tags:
  - text-to-speech
  - tts
  - styletts2
  - coreml
  - apple-silicon
  - voice-cloning
pipeline_tag: text-to-speech

StyleTTS2 LibriTTS β€” CoreML

Apple CoreML port of yl4579/StyleTTS2 (LibriTTS 2nd-stage checkpoint, epoch 20). 9-stage .mlpackage chain with mixed-precision and per-stage compute-unit assignments tuned for Apple Silicon (CPU + ANE + GPU).

24 kHz mono synthesis. Zero-shot voice cloning from a 3-10 second reference WAV.

Highlights

  • 9 stages, 258 MB on disk, all fp16 except har_source (fp32 required for sin(2π·cumsum(f0)) numerical stability)
  • ~390 ms warm CoreML predict per utterance (M-series, mixed CPU+ANE+GPU)
  • RTFx ~9.4Γ— end-to-end (3.7 s of audio in ~390 ms)
  • ~13 s cold start (Apple anecompilerservice compiles ANE-targeted graphs on first call; fully cached afterwards)
  • Per-stage placement: text_encoder/duration_predictor/decoder_upsample on CPU, bert/ref_encoder/diffusion_unet/f0n_predictor/decoder_pre on ANE, har_source on GPU

Repository contents

packages/                                        9 mlpackages (258 MB)
  text_encoder_fp16.mlpackage           11 MB    text β†’ 512-dim embedding (LSTM, RangeDim T)
  bert_fp16.mlpackage                   12 MB    Albert + bert_encoder (fixed T=57)
  ref_encoder_fp16.mlpackage            53 MB    reference mel β†’ 256-dim style (CNN)
  diffusion_unet_fp16.mlpackage         48 MB    cross-attention U-Net (fixed T=57; ADPM2 sampler)
  duration_predictor_fp16.mlpackage     15 MB    LSTM + duration logits (RangeDim T)
  f0n_predictor_fp16.mlpackage          16 MB    F0 + noise prediction (RangeDim F)
  har_source.mlpackage                  12 KB    F0 β†’ harmonic source (RangeDim F0_LEN, fp32)
  decoder_pre_fp16.mlpackage            64 MB    AdaIN encode/decode + F0/N convs (RangeDim F)
  decoder_upsample_fp16.mlpackage       40 MB    HiFi-GAN Generator (RangeDim F→audio)
voices/                                          17 reference clips (4 MB)
  Yinghao.wav, Nima.wav, Gavin.wav, Vinay.wav    Identity speakers
  amused.wav, anger.wav, disgusted.wav, sleepy.wav   Emotion clips
  *.wav                                          LibriTTS samples
samples/                                         End-to-end synthesis samples
  sample_swift.wav                               Produced by the Swift CoreML driver
  sample_python.wav                              Produced by the Python CoreML pipeline
manifest.json                                    Machine-readable spec for all stages
README.md                                        This file

Limits

  • Phoneme cap: 57. bert and diffusion_unet are pinned to a fixed token axis of 57 because the CoreML CPU MLProgram backend rejects RangeDim on their cross-attention shape ops. Inputs that phonemize to >57 tokens will fail. The other 7 stages support flexible token (1-512) and frame (1-2048) axes.
  • ANE compile fails for the HiFi-GAN ConvTranspose1d ups stack inside decoder_upsample. CPU is the most predictable placement; GPU has slightly lower warm latency but contends with har_source.
  • Apple Silicon recommended. Intel Macs have not been validated for CoreML mlprogram inference at scale.

Pipeline (per utterance)

text β†’ espeak-ng IPA β†’ tokenize β†’ token_ids
                                       β”‚
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚                               β”‚                                  β”‚
       β–Ό                               β–Ό                                  β–Ό
text_encoder              bert (fixed T=57)            reference WAV β†’ mel β†’ ref_encoder
   t_en [1,512,T]          bert_dur [1,57,768]                   ref_s [1,256]
                           d_en [1,512,57]
                                       β”‚
                                       β–Ό
                         diffusion_unet Γ— 5 ADPM2 steps (10 dispatches)
                                       β”‚
                                       β–Ό
                                  s_pred [1,256]
                              ↓ blend(Ξ±, Ξ², ref_s) ↓
                              ref [1,128]   s [1,128]
                                       β”‚
                                       β–Ό
                                duration_predictor
                            d [1,T,640]   pred_dur β†’ pred_aln_trg
                                       β”‚
                                       β–Ό (matmul + hifigan tail-shift)
                                   en [1,640,F]   asr [1,512,F]
                                       β”‚
                                       β–Ό
                                f0n_predictor
                                f0_pred, n_pred [1, 2F]
                                       β”‚
                                       β–Ό
                                  har_source
                                  har [1,1,600F]
                                       β”‚
                                       β–Ό
                                  decoder_pre
                                  x_pre [1,512,2F]
                                       β”‚
                                       β–Ό
                                decoder_upsample
                                  audio [1,1,72k+]
                                       β”‚
                                       β–Ό
                              tail-trim 50 samples β†’ WAV @ 24 kHz

The 5 non-CoreML steps (espeak phonemize, ADPM2 sampler loop, mel extraction, alignment matrix, tail-shift) run host-side. See manifest.json#non_coreml_pipeline_steps for exact specs.

Voices

voices/*.wav are zero-shot reference clips. The ref_encoder stage reads a mel of the chosen reference and produces a 256-dim style embedding that conditions every downstream stage. Bring your own clip β€” any 3-10 s mono recording at any sample rate works (resampled to 24 kHz internally). Quality is sensitive to reference cleanliness (background noise transfers).

Quick demo (Swift)

A self-contained Swift demo exists that drives the last 4 stages directly from CoreML, given pre-computed inputs from the Python preprocessor. End-to-end Swift synthesis (no Python) requires porting espeak phonemize + mel + ADPM2 sampler + alignment, ~600 lines of Swift on top of these packages.

Quick demo (Python)

git clone https://github.com/yl4579/StyleTTS2  # for the espeak/text frontend + checkpoint config
# Place this repo's packages/ as coreml/packages/ in StyleTTS2 working tree.
uv run python coreml/inference.py \
    --text "StyleTTS 2 is a text to speech model." \
    --reference voices/Yinghao.wav \
    --output out.wav

Conversion notes

  • Source: PyTorch StyleTTS2 LibriTTS 2nd-stage checkpoint (yl4579/StyleTTS2 epoch 20).
  • coremltools mlprogram, deployment target macOS15, fp16 compute precision.
  • Mixed-precision: 7 stages fp16, 1 stage fp32 (har_source), 1 stage split for ANE compatibility (decoder β†’ decoder_pre + decoder_upsample).
  • Trace parity: all 9 stages mse=0 against eager PyTorch on the trace input.
  • Quantization trials (linear int8, 8-bit k-means palettization) tested on decoder_upsample; both rejected β€” int8 is slower than fp16 on CPU (no native ConvTranspose1d kernel) and lossy quality (19 dB SNR) for palettization. fp16 is the production setting.

License

MIT (matches upstream yl4579/StyleTTS2). LibriTTS reference clips inherit their LibriTTS / Apache-2.0 licensing.

Citation

If you use this port, please cite the original StyleTTS2 paper:

@article{li2023styletts,
  title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
  author={Li, Yinghao Aaron and Han, Cong and Raghavan, Vinay and Mischler, Gavin and Mesgarani, Nima},
  journal={arXiv preprint arXiv:2306.07691},
  year={2023}
}