# StyleTTS2 → CoreML iteration_2 Production-ready fp32 mlpackages adopting Trials 4 + 6 + 8b from `coreml/fusions.md`. ## Pipeline (8 stages, 8 dispatches) ``` text_encoder → CPU_ONLY fp32 21 MB bert → ALL fp32 23 MB ref_encoder → CPU_AND_GPU fp32 106 MB fused_diffusion_sampler → ALL fp32 94 MB ← Trial 4 (replaces diffusion_unet × 8) duration_predictor → CPU_ONLY fp32 30 MB fused_f0n_har_source → CPU_ONLY fp32 32 MB ← Trial 6 (replaces f0n_predictor + har_source) decoder_pre → CPU_AND_NE fp32 128 MB decoder_upsample → CPU_ONLY fp32 79 MB ``` Total: **514 MB**, 8 mlpackages, 8 dispatches per utterance. ## Performance Warm latency on M-series Mac, single-process, no other GPU/ANE workloads: * Pipeline warm: **~480–565 ms** (down from ~1030 ms baseline) * Stage count: 9 → 8 (Trials 4 + 6) * Dispatches per utterance: 16 → 8 (−50%) See `coreml/fusions.md` for full trial history, latency tables, parity chains, and per-stage placement sweep results. ## Adopted trials | Trial | Change | Save | |-------|------------------------------------------------------|------| | 4 | fused 5-step ADPM2 sampler (8 dispatches → 1) | −437 ms warm | | 6 | fused f0n_predictor + har_source | −42 ms warm | | 8b | bert→ALL, ref_encoder→CPU_AND_GPU, sampler→ALL | small but stable | ## Skipped / dropped | Trial | Outcome | |-------|------------------------------------------------------| | 5 | har + decoder_upsample fuse — partition tax (+290 ms) | | 7 | ref_encoder + sampler fuse — partition tax (200 MB graph) | | 8a | aggressive `decoder_upsample → ALL` — bimodal 322–759 ms | | 9 | `_hifigan_shift` fold — sub-1 ms saving, dominated by Trial 8 | ## Usage Drop `packages/` into `models/tts/styletts2/coreml/` (or symlink) and run `python -m coreml.inference` from the styletts2 root. The `_STAGE_COMPUTE` and `_STAGE_PRECISION` manifests in `coreml/inference.py` are wired to load these by default. To compare against the legacy 9-package path: ```bash python -m coreml.inference --no-fused ```