Upload 269 files

80af6a2 verified about 2 months ago

9.74 kB

	# StyleTTS2 → CoreML iteration_3

	Mixed-precision build on top of iteration_2: 7 stages flipped to fp16
	weight precision, 1 stage kept at fp32 to avoid an audible-quality
	regression. Disk halved, pipeline-stage sum cut 24–41 % cool.

	## Pipeline (8 stages, 8 dispatches)

	```
	text_encoder → CPU_ONLY fp16 11 MB
	bert → ALL fp16 12 MB
	ref_encoder → CPU_AND_GPU fp16 53 MB
	fused_diffusion_sampler → ALL fp16 47 MB ← Trial 4
	duration_predictor → CPU_ONLY fp16 15 MB
	fused_f0n_har_source → CPU_ONLY fp32 32 MB ← Trial 6 (kept fp32: cumsum drift)
	decoder_pre → CPU_AND_NE fp16 64 MB
	decoder_upsample → CPU_ONLY fp16 40 MB
	```

	Total: 274 MB, 8 mlpackages, 8 dispatches per utterance.

	## Performance

	Warm pipeline-stage sum (sum of per-stage timings reported by
	`coreml.inference`), 3-iter sweep with 8 s cooldown, M-series Mac:

	\| Build \| min \| avg \| max \|
	\|-----------------\|------\|------\|-------\|
	\| iteration_2 fp32\| 782 \| 898 \| 1075 \|
	\| iteration_3 \| 460 \| 683 \| 1110 (thermal) \|

	Cool-run delta: −322 ms (−41 %) at min, −215 ms (−24 %) at avg.
	The max bucket bunches because pipeline-wide variance dominates any
	config — same pattern observed in Trial 8b benches.

	Per-stage savings observed end-to-end:

	\| stage \| fp32 ms \| fp16 ms \| Δ \|
	\|-------------------------\|---------\|---------\|----------\|
	\| fused_diffusion_sampler \| 18.3 \| 14.7 \| −3.6 ms \|
	\| decoder_pre \| 35 \| 7 \| −28 ms \|
	\| decoder_upsample \| 593–638 \| 284–325 \| −309 ms \|

	## Mixed precision rationale

	\| Stage \| fp16 verdict \| Why \|
	\|-------------------------\|---------------------\|-----------------------------------------\|
	\| text_encoder \| adopt \| clean A/B \|
	\| bert \| adopt \| clean A/B \|
	\| ref_encoder \| adopt \| clean A/B \|
	\| fused_diffusion_sampler \| adopt \| parity 4.66e-3, A/B clean \|
	\| duration_predictor \| adopt \| clean A/B \|
	\| fused_f0n_har_source \| drop \| har computes sin(2π·cumsum(f0)) over 88 200 samples; fp16 cumsum drifts ~10 bits, audible phase distortion in second half \|
	\| decoder_pre \| adopt \| parity tight, A/B clean \|
	\| decoder_upsample \| adopt \| A/B clean; previously feared "+240 ms" regression on `ALL` did not reproduce on `CPU_ONLY` placement (this is the 8b-winning placement) \|

	Drift evidence comes from per-stage CoreML parity vs eager fp32 plus
	direct A/B listening of three configurations:

	```
	sanity_fp16_mixed.wav (5 fp16 / 3 fp32) — clean
	sanity_fp16_plus_decpre.wav (6 fp16 / 2 fp32) — clean
	sanity_fp16_plus_decup.wav (7 fp16 / 1 fp32) — clean ← this build
	sanity_fp16_plus_f0n.wav (8 fp16) — degraded second half
	```

	## Storage

	\| Artifact \| iteration_2 \| iteration_3 \|
	\|--------------------------------------\|-------------\|-------------\|
	\| Total \| 514 MB \| 274 MB (−47 %) \|
	\| largest stage \| decoder_pre 128 MB \| decoder_pre 64 MB \|
	\| smallest stage \| text_encoder 21 MB \| text_encoder 11 MB \|

	## Usage

	Same wiring as iteration_2 — `_STAGE_PRECISION` in `coreml/inference.py`
	selects fp16 / fp32 per stage. No code changes, only the manifest values
	flip:

	```python
	_STAGE_PRECISION: dict[str, str] = {
	"text_encoder": "fp16",
	"bert": "fp16",
	"ref_encoder": "fp16",
	"fused_diffusion_sampler": "fp16",
	"diffusion_unet": "fp32", # legacy fallback
	"duration_predictor": "fp16",
	"fused_f0n_har_source": "fp32", # cumsum drift
	"f0n_predictor": "fp32", # legacy fallback
	"har_source": "fp32", # legacy fallback
	"decoder_pre": "fp16",
	"decoder_upsample": "fp16",
	}
	```

	CLI overrides still work:

	```bash
	# Re-run any stage at fp32 to A/B
	python -m coreml.inference --fp32 decoder_upsample

	# Drop back to iteration_2 wholesale
	python -m coreml.inference --fp32
	```

	## Skipped trials this iteration

	\| Stage \| Reason for staying fp32 \|
	\|--------------------------\|------------------------------------------------------\|
	\| fused_f0n_har_source \| har_source cumsum drift over 88 200-sample window \|

	Other quantization tiers (int8 weight-only, int4 palettization) deferred
	to a future iteration — fp16 already pays for itself on disk and warm
	latency.

	## Token-axis buckets (Trial 11)

	The `bert` and `fused_diffusion_sampler` packages reject `ct.RangeDim`
	on the token axis (HF Albert + cross-attn produce ops MIL refuses with
	"data-dependent shapes were disabled"). The default packages above
	hard-code T = 57, which caps prompts at ~37 chars.

	To support longer prompts without RangeDim, this iteration ships
	three additional fixed-T variants of each constrained stage:

	\| File \| Compute \| Size \|
	\|---------------------------------------------------\|--------------\|-------\|
	\| `bert_fp16_t64.mlpackage` \| ALL \| 12 MB \|
	\| `bert_fp16_t128.mlpackage` \| ALL \| 12 MB \|
	\| `bert_fp16_t256.mlpackage` \| ALL \| 12 MB \|
	\| `fused_diffusion_sampler_fp16_t64.mlpackage` \| ALL \| 48 MB \|
	\| `fused_diffusion_sampler_fp16_t128.mlpackage` \| ALL \| 48 MB \|
	\| `fused_diffusion_sampler_fp16_t256.mlpackage` \| ALL \| 48 MB \|
	\| Sub-total (extra over the 8 defaults) \| \| 180 MB \|

	The original `bert_fp16.mlpackage` / `fused_diffusion_sampler_fp16.mlpackage`
	(T = 57) remain in the manifest as the default fast path — every
	sentence that fits T = 57 should keep using them. The bucketed variants
	are loaded on demand for longer prompts.

	Loader policy (Swift / Python):

	```
	real_n = #espeak tokens
	if real_n <= 57: use *_fp16.mlpackage (default)
	elif real_n <= 64: use *_fp16_t64.mlpackage
	elif real_n <= 128: use *_fp16_t128.mlpackage
	elif real_n <= 256: use *_fp16_t256.mlpackage
	else: error (extend the bucket ladder)
	```

	Pad the token + attention_mask tensors with zeros to the chosen
	bucket's T. `bert` honours `attention_mask`, so contamination at
	padded positions is bounded; the sampler attends to bert output, so
	it inherits the same masking.

	Per-bucket end-to-end inference verified by `coreml/inference_buckets.py
	--all` (writes `coreml/out_t{64,128,256}.wav`):

	\| Bucket \| Prompt \| Tokens \| Audio \| Pipeline \|
	\|--------\|--------------------------------------------\|--------\|--------\|----------\|
	\| 64 \| "Hello there. How are you today?" \| 36 \| 2.42 s \| 494 ms \|
	\| 128 \| "StyleTTS 2 is a text to speech model." \| 57 \| 3.60 s \| 414 ms \|
	\| 256 \| longer paragraph (see `inference_buckets.py`) \| 154 \| 8.37 s \| 4933 ms \|

	T = 256 cost is dominated by `decoder_upsample` at 4.5 s / 4.9 s
	(real-time-ish CPU_ONLY at 24 kHz × 8.4 s output). Bucket-swap cost
	itself is a few ms; the rest of the pipeline scales with output
	frame count, not bucket size.

	Total iteration_3 footprint with buckets: 451 MB (274 MB defaults
	+ 180 MB buckets), or skip the T = 57 defaults entirely and ship only
	buckets to save ~60 MB.

	### Build / refresh the bucketed packages

	```bash
	cd models/tts/styletts2

	# Build buckets (writes to coreml/packages/, run once)
	uv run python coreml/build_buckets.py \
	--buckets 64,128,256 --stages bert,sampler --precision fp16

	# Stage into iteration_3 + compile
	for T in 64 128 256; do
	for stage in bert fused_diffusion_sampler; do
	cp -R "coreml/packages/${stage}_fp16_t${T}.mlpackage" \
	"iteration_3/packages/${stage}_fp16_t${T}.mlpackage"
	xcrun coremlcompiler compile \
	"iteration_3/packages/${stage}_fp16_t${T}.mlpackage" \
	"iteration_3/compiled/"
	done
	done

	# Validate
	uv run python coreml/inference_buckets.py --all --output-dir coreml
	```

	### HuggingFace upload manifest

	Upload the entire `iteration_3/packages/` tree (14 mlpackages):

	```
	iteration_3/packages/
	├── text_encoder_fp16.mlpackage
	├── bert_fp16.mlpackage ← T=57 default
	├── bert_fp16_t64.mlpackage ← bucket
	├── bert_fp16_t128.mlpackage ← bucket
	├── bert_fp16_t256.mlpackage ← bucket
	├── ref_encoder_fp16.mlpackage
	├── fused_diffusion_sampler_fp16.mlpackage ← T=57 default
	├── fused_diffusion_sampler_fp16_t64.mlpackage ← bucket
	├── fused_diffusion_sampler_fp16_t128.mlpackage ← bucket
	├── fused_diffusion_sampler_fp16_t256.mlpackage ← bucket
	├── duration_predictor_fp16.mlpackage
	├── fused_f0n_har_source.mlpackage ← fp32 (cumsum drift)
	├── decoder_pre_fp16.mlpackage
	└── decoder_upsample_fp16.mlpackage
	```

	Total: 451 MB (12 fp16 stages + 1 fp32 stage + 1 cumsum-sensitive
	stage). Compiled `.mlmodelc` siblings live next to the packages in
	`iteration_3/compiled/` — same file count, same total size.