# StyleTTS2 → CoreML iteration_3

Mixed-precision build on top of iteration_2: 7 stages flipped to fp16
weight precision, 1 stage kept at fp32 to avoid an audible-quality
regression. Disk halved, pipeline-stage sum cut 24–41 % cool.

## Pipeline (8 stages, 8 dispatches)

```
text_encoder           → CPU_ONLY      fp16   11 MB
bert                   → ALL           fp16   12 MB
ref_encoder            → CPU_AND_GPU   fp16   53 MB
fused_diffusion_sampler → ALL          fp16   47 MB   ← Trial 4
duration_predictor     → CPU_ONLY      fp16   15 MB
fused_f0n_har_source   → CPU_ONLY      fp32   32 MB   ← Trial 6 (kept fp32: cumsum drift)
decoder_pre            → CPU_AND_NE    fp16   64 MB
decoder_upsample       → CPU_ONLY      fp16   40 MB
```

Total: **274 MB**, 8 mlpackages, 8 dispatches per utterance.

## Performance

Warm pipeline-stage sum (sum of per-stage timings reported by
`coreml.inference`), 3-iter sweep with 8 s cooldown, M-series Mac:

| Build           | min  | avg  | max   |
|-----------------|------|------|-------|
| iteration_2 fp32| 782  | 898  | 1075  |
| iteration_3     | **460** | **683** | 1110 (thermal) |

Cool-run delta: **−322 ms (−41 %)** at min, **−215 ms (−24 %)** at avg.
The max bucket bunches because pipeline-wide variance dominates any
config — same pattern observed in Trial 8b benches.

Per-stage savings observed end-to-end:

| stage                   | fp32 ms | fp16 ms | Δ        |
|-------------------------|---------|---------|----------|
| fused_diffusion_sampler | 18.3    | 14.7    | −3.6 ms  |
| decoder_pre             | 35      | 7       | −28 ms   |
| decoder_upsample        | 593–638 | 284–325 | **−309 ms** |

## Mixed precision rationale

| Stage                   | fp16 verdict        | Why                                     |
|-------------------------|---------------------|-----------------------------------------|
| text_encoder            | adopt               | clean A/B                               |
| bert                    | adopt               | clean A/B                               |
| ref_encoder             | adopt               | clean A/B                               |
| fused_diffusion_sampler | adopt               | parity 4.66e-3, A/B clean                |
| duration_predictor      | adopt               | clean A/B                               |
| fused_f0n_har_source    | **drop**            | har computes sin(2π·cumsum(f0)) over 88 200 samples; fp16 cumsum drifts ~10 bits, audible phase distortion in second half |
| decoder_pre             | adopt               | parity tight, A/B clean                  |
| decoder_upsample        | adopt               | A/B clean; previously feared "+240 ms" regression on `ALL` did not reproduce on `CPU_ONLY` placement (this is the 8b-winning placement) |

Drift evidence comes from per-stage CoreML parity vs eager fp32 plus
direct A/B listening of three configurations:

```
sanity_fp16_mixed.wav         (5 fp16 / 3 fp32)   — clean
sanity_fp16_plus_decpre.wav   (6 fp16 / 2 fp32)   — clean
sanity_fp16_plus_decup.wav    (7 fp16 / 1 fp32)   — clean   ← this build
sanity_fp16_plus_f0n.wav      (8 fp16)            — degraded second half
```

## Storage

| Artifact                             | iteration_2 | iteration_3 |
|--------------------------------------|-------------|-------------|
| Total                                | 514 MB      | **274 MB** (−47 %) |
| largest stage                        | decoder_pre 128 MB | decoder_pre 64 MB |
| smallest stage                       | text_encoder 21 MB | text_encoder 11 MB |

## Usage

Same wiring as iteration_2 — `_STAGE_PRECISION` in `coreml/inference.py`
selects fp16 / fp32 per stage. No code changes, only the manifest values
flip:

```python
_STAGE_PRECISION: dict[str, str] = {
    "text_encoder":             "fp16",
    "bert":                     "fp16",
    "ref_encoder":              "fp16",
    "fused_diffusion_sampler":  "fp16",
    "diffusion_unet":           "fp32",  # legacy fallback
    "duration_predictor":       "fp16",
    "fused_f0n_har_source":     "fp32",  # cumsum drift
    "f0n_predictor":            "fp32",  # legacy fallback
    "har_source":               "fp32",  # legacy fallback
    "decoder_pre":              "fp16",
    "decoder_upsample":         "fp16",
}
```

CLI overrides still work:

```bash
# Re-run any stage at fp32 to A/B
python -m coreml.inference --fp32 decoder_upsample

# Drop back to iteration_2 wholesale
python -m coreml.inference --fp32
```

## Skipped trials this iteration

| Stage                    | Reason for staying fp32                              |
|--------------------------|------------------------------------------------------|
| fused_f0n_har_source     | har_source cumsum drift over 88 200-sample window    |

Other quantization tiers (int8 weight-only, int4 palettization) deferred
to a future iteration — fp16 already pays for itself on disk and warm
latency.

## Token-axis buckets (Trial 11)

The `bert` and `fused_diffusion_sampler` packages reject `ct.RangeDim`
on the token axis (HF Albert + cross-attn produce ops MIL refuses with
"data-dependent shapes were disabled"). The default packages above
hard-code T = 57, which caps prompts at ~37 chars.

To support longer prompts without RangeDim, this iteration ships
**three additional fixed-T variants** of each constrained stage:

| File                                              | Compute      | Size  |
|---------------------------------------------------|--------------|-------|
| `bert_fp16_t64.mlpackage`                         | ALL          | 12 MB |
| `bert_fp16_t128.mlpackage`                        | ALL          | 12 MB |
| `bert_fp16_t256.mlpackage`                        | ALL          | 12 MB |
| `fused_diffusion_sampler_fp16_t64.mlpackage`      | ALL          | 48 MB |
| `fused_diffusion_sampler_fp16_t128.mlpackage`     | ALL          | 48 MB |
| `fused_diffusion_sampler_fp16_t256.mlpackage`     | ALL          | 48 MB |
| **Sub-total (extra over the 8 defaults)**         |              | **180 MB** |

The original `bert_fp16.mlpackage` / `fused_diffusion_sampler_fp16.mlpackage`
(T = 57) remain in the manifest as the default fast path — every
sentence that fits T = 57 should keep using them. The bucketed variants
are loaded on demand for longer prompts.

Loader policy (Swift / Python):

```
real_n = #espeak tokens
if   real_n <=  57: use *_fp16.mlpackage          (default)
elif real_n <=  64: use *_fp16_t64.mlpackage
elif real_n <= 128: use *_fp16_t128.mlpackage
elif real_n <= 256: use *_fp16_t256.mlpackage
else: error (extend the bucket ladder)
```

Pad the token + attention_mask tensors with zeros to the chosen
bucket's T. `bert` honours `attention_mask`, so contamination at
padded positions is bounded; the sampler attends to bert output, so
it inherits the same masking.

Per-bucket end-to-end inference verified by `coreml/inference_buckets.py
--all` (writes `coreml/out_t{64,128,256}.wav`):

| Bucket | Prompt                                     | Tokens | Audio  | Pipeline |
|--------|--------------------------------------------|--------|--------|----------|
| 64     | "Hello there. How are you today?"          | 36     | 2.42 s |  494 ms  |
| 128    | "StyleTTS 2 is a text to speech model."    | 57     | 3.60 s |  414 ms  |
| 256    | longer paragraph (see `inference_buckets.py`) | 154 | 8.37 s | 4933 ms  |

T = 256 cost is dominated by `decoder_upsample` at 4.5 s / 4.9 s
(real-time-ish CPU_ONLY at 24 kHz × 8.4 s output). Bucket-swap cost
itself is a few ms; the rest of the pipeline scales with output
frame count, not bucket size.

**Total iteration_3 footprint with buckets: 451 MB** (274 MB defaults
+ 180 MB buckets), or skip the T = 57 defaults entirely and ship only
buckets to save ~60 MB.

### Build / refresh the bucketed packages

```bash
cd models/tts/styletts2

# Build buckets (writes to coreml/packages/, run once)
uv run python coreml/build_buckets.py \
    --buckets 64,128,256 --stages bert,sampler --precision fp16

# Stage into iteration_3 + compile
for T in 64 128 256; do
  for stage in bert fused_diffusion_sampler; do
    cp -R "coreml/packages/${stage}_fp16_t${T}.mlpackage" \
          "iteration_3/packages/${stage}_fp16_t${T}.mlpackage"
    xcrun coremlcompiler compile \
      "iteration_3/packages/${stage}_fp16_t${T}.mlpackage" \
      "iteration_3/compiled/"
  done
done

# Validate
uv run python coreml/inference_buckets.py --all --output-dir coreml
```

### HuggingFace upload manifest

Upload the entire `iteration_3/packages/` tree (14 mlpackages):

```
iteration_3/packages/
├── text_encoder_fp16.mlpackage
├── bert_fp16.mlpackage                              ← T=57 default
├── bert_fp16_t64.mlpackage                          ← bucket
├── bert_fp16_t128.mlpackage                         ← bucket
├── bert_fp16_t256.mlpackage                         ← bucket
├── ref_encoder_fp16.mlpackage
├── fused_diffusion_sampler_fp16.mlpackage           ← T=57 default
├── fused_diffusion_sampler_fp16_t64.mlpackage       ← bucket
├── fused_diffusion_sampler_fp16_t128.mlpackage      ← bucket
├── fused_diffusion_sampler_fp16_t256.mlpackage      ← bucket
├── duration_predictor_fp16.mlpackage
├── fused_f0n_har_source.mlpackage                   ← fp32 (cumsum drift)
├── decoder_pre_fp16.mlpackage
└── decoder_upsample_fp16.mlpackage
```

Total: **451 MB** (12 fp16 stages + 1 fp32 stage + 1 cumsum-sensitive
stage). Compiled `.mlmodelc` siblings live next to the packages in
`iteration_3/compiled/` — same file count, same total size.