| # StyleTTS2 β CoreML iteration_3 |
| |
| Mixed-precision build on top of iteration_2: 7 stages flipped to fp16 |
| weight precision, 1 stage kept at fp32 to avoid an audible-quality |
| regression. Disk halved, pipeline-stage sum cut 24β41 % cool. |
|
|
| ## Pipeline (8 stages, 8 dispatches) |
|
|
| ``` |
| text_encoder β CPU_ONLY fp16 11 MB |
| bert β ALL fp16 12 MB |
| ref_encoder β CPU_AND_GPU fp16 53 MB |
| fused_diffusion_sampler β ALL fp16 47 MB β Trial 4 |
| duration_predictor β CPU_ONLY fp16 15 MB |
| fused_f0n_har_source β CPU_ONLY fp32 32 MB β Trial 6 (kept fp32: cumsum drift) |
| decoder_pre β CPU_AND_NE fp16 64 MB |
| decoder_upsample β CPU_ONLY fp16 40 MB |
| ``` |
|
|
| Total: **274 MB**, 8 mlpackages, 8 dispatches per utterance. |
|
|
| ## Performance |
|
|
| Warm pipeline-stage sum (sum of per-stage timings reported by |
| `coreml.inference`), 3-iter sweep with 8 s cooldown, M-series Mac: |
|
|
| | Build | min | avg | max | |
| |-----------------|------|------|-------| |
| | iteration_2 fp32| 782 | 898 | 1075 | |
| | iteration_3 | **460** | **683** | 1110 (thermal) | |
|
|
| Cool-run delta: **β322 ms (β41 %)** at min, **β215 ms (β24 %)** at avg. |
| The max bucket bunches because pipeline-wide variance dominates any |
| config β same pattern observed in Trial 8b benches. |
|
|
| Per-stage savings observed end-to-end: |
|
|
| | stage | fp32 ms | fp16 ms | Ξ | |
| |-------------------------|---------|---------|----------| |
| | fused_diffusion_sampler | 18.3 | 14.7 | β3.6 ms | |
| | decoder_pre | 35 | 7 | β28 ms | |
| | decoder_upsample | 593β638 | 284β325 | **β309 ms** | |
|
|
| ## Mixed precision rationale |
|
|
| | Stage | fp16 verdict | Why | |
| |-------------------------|---------------------|-----------------------------------------| |
| | text_encoder | adopt | clean A/B | |
| | bert | adopt | clean A/B | |
| | ref_encoder | adopt | clean A/B | |
| | fused_diffusion_sampler | adopt | parity 4.66e-3, A/B clean | |
| | duration_predictor | adopt | clean A/B | |
| | fused_f0n_har_source | **drop** | har computes sin(2ΟΒ·cumsum(f0)) over 88 200 samples; fp16 cumsum drifts ~10 bits, audible phase distortion in second half | |
| | decoder_pre | adopt | parity tight, A/B clean | |
| | decoder_upsample | adopt | A/B clean; previously feared "+240 ms" regression on `ALL` did not reproduce on `CPU_ONLY` placement (this is the 8b-winning placement) | |
|
|
| Drift evidence comes from per-stage CoreML parity vs eager fp32 plus |
| direct A/B listening of three configurations: |
|
|
| ``` |
| sanity_fp16_mixed.wav (5 fp16 / 3 fp32) β clean |
| sanity_fp16_plus_decpre.wav (6 fp16 / 2 fp32) β clean |
| sanity_fp16_plus_decup.wav (7 fp16 / 1 fp32) β clean β this build |
| sanity_fp16_plus_f0n.wav (8 fp16) β degraded second half |
| ``` |
|
|
| ## Storage |
|
|
| | Artifact | iteration_2 | iteration_3 | |
| |--------------------------------------|-------------|-------------| |
| | Total | 514 MB | **274 MB** (β47 %) | |
| | largest stage | decoder_pre 128 MB | decoder_pre 64 MB | |
| | smallest stage | text_encoder 21 MB | text_encoder 11 MB | |
|
|
| ## Usage |
|
|
| Same wiring as iteration_2 β `_STAGE_PRECISION` in `coreml/inference.py` |
| selects fp16 / fp32 per stage. No code changes, only the manifest values |
| flip: |
| |
| ```python |
| _STAGE_PRECISION: dict[str, str] = { |
| "text_encoder": "fp16", |
| "bert": "fp16", |
| "ref_encoder": "fp16", |
| "fused_diffusion_sampler": "fp16", |
| "diffusion_unet": "fp32", # legacy fallback |
| "duration_predictor": "fp16", |
| "fused_f0n_har_source": "fp32", # cumsum drift |
| "f0n_predictor": "fp32", # legacy fallback |
| "har_source": "fp32", # legacy fallback |
| "decoder_pre": "fp16", |
| "decoder_upsample": "fp16", |
| } |
| ``` |
| |
| CLI overrides still work: |
|
|
| ```bash |
| # Re-run any stage at fp32 to A/B |
| python -m coreml.inference --fp32 decoder_upsample |
| |
| # Drop back to iteration_2 wholesale |
| python -m coreml.inference --fp32 |
| ``` |
|
|
| ## Skipped trials this iteration |
|
|
| | Stage | Reason for staying fp32 | |
| |--------------------------|------------------------------------------------------| |
| | fused_f0n_har_source | har_source cumsum drift over 88 200-sample window | |
|
|
| Other quantization tiers (int8 weight-only, int4 palettization) deferred |
| to a future iteration β fp16 already pays for itself on disk and warm |
| latency. |
|
|
| ## Token-axis buckets (Trial 11) |
|
|
| The `bert` and `fused_diffusion_sampler` packages reject `ct.RangeDim` |
| on the token axis (HF Albert + cross-attn produce ops MIL refuses with |
| "data-dependent shapes were disabled"). The default packages above |
| hard-code T = 57, which caps prompts at ~37 chars. |
|
|
| To support longer prompts without RangeDim, this iteration ships |
| **three additional fixed-T variants** of each constrained stage: |
|
|
| | File | Compute | Size | |
| |---------------------------------------------------|--------------|-------| |
| | `bert_fp16_t64.mlpackage` | ALL | 12 MB | |
| | `bert_fp16_t128.mlpackage` | ALL | 12 MB | |
| | `bert_fp16_t256.mlpackage` | ALL | 12 MB | |
| | `fused_diffusion_sampler_fp16_t64.mlpackage` | ALL | 48 MB | |
| | `fused_diffusion_sampler_fp16_t128.mlpackage` | ALL | 48 MB | |
| | `fused_diffusion_sampler_fp16_t256.mlpackage` | ALL | 48 MB | |
| | **Sub-total (extra over the 8 defaults)** | | **180 MB** | |
|
|
| The original `bert_fp16.mlpackage` / `fused_diffusion_sampler_fp16.mlpackage` |
| (T = 57) remain in the manifest as the default fast path β every |
| sentence that fits T = 57 should keep using them. The bucketed variants |
| are loaded on demand for longer prompts. |
|
|
| Loader policy (Swift / Python): |
|
|
| ``` |
| real_n = #espeak tokens |
| if real_n <= 57: use *_fp16.mlpackage (default) |
| elif real_n <= 64: use *_fp16_t64.mlpackage |
| elif real_n <= 128: use *_fp16_t128.mlpackage |
| elif real_n <= 256: use *_fp16_t256.mlpackage |
| else: error (extend the bucket ladder) |
| ``` |
|
|
| Pad the token + attention_mask tensors with zeros to the chosen |
| bucket's T. `bert` honours `attention_mask`, so contamination at |
| padded positions is bounded; the sampler attends to bert output, so |
| it inherits the same masking. |
|
|
| Per-bucket end-to-end inference verified by `coreml/inference_buckets.py |
| --all` (writes `coreml/out_t{64,128,256}.wav`): |
|
|
| | Bucket | Prompt | Tokens | Audio | Pipeline | |
| |--------|--------------------------------------------|--------|--------|----------| |
| | 64 | "Hello there. How are you today?" | 36 | 2.42 s | 494 ms | |
| | 128 | "StyleTTS 2 is a text to speech model." | 57 | 3.60 s | 414 ms | |
| | 256 | longer paragraph (see `inference_buckets.py`) | 154 | 8.37 s | 4933 ms | |
|
|
| T = 256 cost is dominated by `decoder_upsample` at 4.5 s / 4.9 s |
| (real-time-ish CPU_ONLY at 24 kHz Γ 8.4 s output). Bucket-swap cost |
| itself is a few ms; the rest of the pipeline scales with output |
| frame count, not bucket size. |
| |
| **Total iteration_3 footprint with buckets: 451 MB** (274 MB defaults |
| + 180 MB buckets), or skip the T = 57 defaults entirely and ship only |
| buckets to save ~60 MB. |
| |
| ### Build / refresh the bucketed packages |
| |
| ```bash |
| cd models/tts/styletts2 |
| |
| # Build buckets (writes to coreml/packages/, run once) |
| uv run python coreml/build_buckets.py \ |
| --buckets 64,128,256 --stages bert,sampler --precision fp16 |
| |
| # Stage into iteration_3 + compile |
| for T in 64 128 256; do |
| for stage in bert fused_diffusion_sampler; do |
| cp -R "coreml/packages/${stage}_fp16_t${T}.mlpackage" \ |
| "iteration_3/packages/${stage}_fp16_t${T}.mlpackage" |
| xcrun coremlcompiler compile \ |
| "iteration_3/packages/${stage}_fp16_t${T}.mlpackage" \ |
| "iteration_3/compiled/" |
| done |
| done |
| |
| # Validate |
| uv run python coreml/inference_buckets.py --all --output-dir coreml |
| ``` |
| |
| ### HuggingFace upload manifest |
| |
| Upload the entire `iteration_3/packages/` tree (14 mlpackages): |
|
|
| ``` |
| iteration_3/packages/ |
| βββ text_encoder_fp16.mlpackage |
| βββ bert_fp16.mlpackage β T=57 default |
| βββ bert_fp16_t64.mlpackage β bucket |
| βββ bert_fp16_t128.mlpackage β bucket |
| βββ bert_fp16_t256.mlpackage β bucket |
| βββ ref_encoder_fp16.mlpackage |
| βββ fused_diffusion_sampler_fp16.mlpackage β T=57 default |
| βββ fused_diffusion_sampler_fp16_t64.mlpackage β bucket |
| βββ fused_diffusion_sampler_fp16_t128.mlpackage β bucket |
| βββ fused_diffusion_sampler_fp16_t256.mlpackage β bucket |
| βββ duration_predictor_fp16.mlpackage |
| βββ fused_f0n_har_source.mlpackage β fp32 (cumsum drift) |
| βββ decoder_pre_fp16.mlpackage |
| βββ decoder_upsample_fp16.mlpackage |
| ``` |
|
|
| Total: **451 MB** (12 fp16 stages + 1 fp32 stage + 1 cumsum-sensitive |
| stage). Compiled `.mlmodelc` siblings live next to the packages in |
| `iteration_3/compiled/` β same file count, same total size. |
|
|