StyleTTS2 → CoreML iteration_2

Production-ready fp32 mlpackages adopting Trials 4 + 6 + 8b from coreml/fusions.md.

Pipeline (8 stages, 8 dispatches)

text_encoder       → CPU_ONLY      fp32   21 MB
bert               → ALL           fp32   23 MB
ref_encoder        → CPU_AND_GPU   fp32  106 MB
fused_diffusion_sampler  → ALL    fp32   94 MB   ← Trial 4 (replaces diffusion_unet × 8)
duration_predictor → CPU_ONLY      fp32   30 MB
fused_f0n_har_source     → CPU_ONLY  fp32  32 MB ← Trial 6 (replaces f0n_predictor + har_source)
decoder_pre        → CPU_AND_NE    fp32  128 MB
decoder_upsample   → CPU_ONLY      fp32   79 MB

Total: 514 MB, 8 mlpackages, 8 dispatches per utterance.

Performance

Warm latency on M-series Mac, single-process, no other GPU/ANE workloads:

Pipeline warm: ~480–565 ms (down from ~1030 ms baseline)
Stage count: 9 → 8 (Trials 4 + 6)
Dispatches per utterance: 16 → 8 (−50%)

See coreml/fusions.md for full trial history, latency tables, parity chains, and per-stage placement sweep results.

Adopted trials

Trial	Change	Save
4	fused 5-step ADPM2 sampler (8 dispatches → 1)	−437 ms warm
6	fused f0n_predictor + har_source	−42 ms warm
8b	bert→ALL, ref_encoder→CPU_AND_GPU, sampler→ALL	small but stable

Skipped / dropped

Trial	Outcome
5	har + decoder_upsample fuse — partition tax (+290 ms)
7	ref_encoder + sampler fuse — partition tax (200 MB graph)
8a	aggressive `decoder_upsample → ALL` — bimodal 322–759 ms
9	`_hifigan_shift` fold — sub-1 ms saving, dominated by Trial 8

Usage

Drop packages/ into models/tts/styletts2/coreml/ (or symlink) and run python -m coreml.inference from the styletts2 root. The _STAGE_COMPUTE and _STAGE_PRECISION manifests in coreml/inference.py are wired to load these by default.

To compare against the legacy 9-package path:

python -m coreml.inference --no-fused