Add Benchmark PRD (evaluation protocol for comparing Base vs A vs B)

ea5d923 verified 20 days ago

11.6 kB

	# Astra-TTS Benchmark PRD: Architecture Comparison

	## Overview

	\| Field \| Value \|
	\|-------\|-------\|
	\| Project \| Astra-TTS Architecture Evaluation \|
	\| Goal \| Determine whether architectural improvements (Model B) outperform naive shrinking (Model A) at ~55M params \|
	\| Dataset \| LibriTTS (English) \|
	\| Baseline \| Original ZipVoice 123M (k2-fsa/ZipVoice) \|
	\| Models Under Test \| Model A (Slim 55M), Model B (Enhanced 55M) \|
	\| Evaluation Configurations \| 4 total (see below) \|

	---

	## Hypothesis

	> At the same parameter budget (~55M), architectural improvements (GQA, Depthwise Separable Conv, Grouped Parameter Sharing, Dilated ConvNeXt, RoPE, ConvNeXt text refinement, removed NLA) will yield equal or better quality than naive shrinking, while enabling significantly faster inference through EPSS + Midpoint + SmoothCache.

	---

	## Models & Configurations to Evaluate

	\| ID \| Model \| Params \| Inference Mode \| Purpose \|
	\|----\|-------\|--------\|---------------\|---------\|
	\| Baseline \| ZipVoice Original \| 123M \| Euler 16 NFE uniform \| Reference — published numbers \|
	\| A-std \| Model A (Slim) \| 55M \| Euler 16 NFE uniform \| Naive shrink baseline \|
	\| B-std \| Model B (Enhanced) \| 55M \| Euler 16 NFE uniform \| Quality comparison (fair, same inference as A) \|
	\| B-opt \| Model B (Enhanced) \| 55M \| Midpoint 4-step + EPSS + SmoothCache \| Speed comparison (full optimized stack) \|

	### Why these 4?

	- Baseline vs A-std: How much does shrinking from 123M→55M cost in quality?
	- A-std vs B-std: Do arch improvements help at same size and same inference? (Quality ablation)
	- B-std vs B-opt: How much speed do inference optimizations add? (Speed ablation)
	- A-std vs B-opt: The real comparison — same params, but B is faster AND better?

	---

	## Training Protocol

	All models (A and B) must be trained under identical conditions for fair comparison:

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Dataset \| LibriTTS (train-clean-100 + train-clean-360 + train-other-500) \|
	\| Total hours \| ~585 hours \|
	\| Audio preprocessing \| Resample to 24kHz, trim silence, normalize volume \|
	\| Text preprocessing \| IPA phonemization via eSpeak-ng (same as ZipVoice) \|
	\| Optimizer \| ScaledAdam \|
	\| Learning rate \| 0.045 (linear warmup 5000 steps) \|
	\| Batch strategy \| Dynamic batching, max 300s total duration per batch \|
	\| Training steps \| 500,000 steps (both models, same count) \|
	\| Gradient clipping \| 1.0 \|
	\| EMA \| 0.9999 (for evaluation) \|
	\| Random seed \| Fixed (42) for reproducibility \|
	\| Mixed precision \| bf16 \|
	\| Checkpoint selection \| Best validation loss OR step 500k (whichever is reported) \|

	### Baseline Model

	The original ZipVoice 123M checkpoint from [k2-fsa/ZipVoice](https://huggingface.co/k2-fsa/ZipVoice) is used directly. The `zipvoice_libritts/model.pt` variant (trained on LibriTTS) is the correct baseline since our models are also trained on LibriTTS.

	---

	## Evaluation Protocol

	### Test Sets

	\| Test Set \| Samples \| Purpose \|
	\|----------\|---------\|---------\|
	\| LibriSpeech-PC test-clean \| Standard partition \| Primary benchmark (matches ZipVoice paper) \|
	\| Seed-TTS test-en \| Standard partition \| Cross-domain zero-shot evaluation \|

	### Evaluation Procedure

	For each test utterance:
	1. Select a reference audio clip (3-10 seconds) from the same speaker
	2. Provide the reference audio + reference transcription + target text to the model
	3. Generate speech
	4. Measure metrics against ground truth

	### Metrics

	#### Quality Metrics

	\| Metric \| What it measures \| Tool \| Target range \|
	\|--------\|-----------------\|------\|-------------\|
	\| WER (Word Error Rate) \| Intelligibility — can you understand the words? \| Whisper-large-v3 transcription → WER vs ground truth text \| Lower is better. ZipVoice baseline: 1.64% \|
	\| SIM-o (Speaker Similarity - original) \| Voice cloning quality — does it sound like the target speaker? \| WavLM-TDNN speaker verification model, cosine similarity between generated and original target audio \| Higher is better. ZipVoice baseline: 0.668 \|
	\| UTMOS \| Naturalness/quality — does it sound like real speech? \| UTMOS predictor (pretrained MOS estimator) \| Higher is better. ZipVoice baseline: 3.98 \|

	#### Speed Metrics

	\| Metric \| What it measures \| How \|
	\|--------\|-----------------\|-----\|
	\| RTF (Real-Time Factor) \| Time to generate / duration of generated audio \| Measure wall-clock inference time, divide by audio length \|
	\| NFE (Number of Function Evaluations) \| Model forward passes per utterance \| Count \|
	\| Latency (s) \| Absolute time for a 10-second utterance \| Measure on fixed hardware \|
	\| Peak Memory (MB) \| Maximum GPU/CPU memory during inference \| torch.cuda.max_memory_allocated() \|

	#### Speed Evaluation Hardware

	All speed metrics measured on:
	- GPU: Single NVIDIA A100 80GB (for GPU RTF)
	- CPU: Single-threaded Intel Xeon (for CPU RTF)
	- Batch size: 1 (real-world latency scenario)
	- Warm-up: 10 utterances discarded before timing
	- Measurement: Mean of 50 utterances ± std dev

	---

	## Success Criteria

	### Primary (Must achieve to validate Model B)

	\| Criterion \| Condition \| Rationale \|
	\|-----------\|-----------\|-----------\|
	\| B-std quality ≥ A-std quality \| B-std WER ≤ A-std WER AND B-std UTMOS ≥ A-std UTMOS \| Arch changes must not hurt quality \|
	\| B-opt quality ≈ B-std quality \| B-opt WER within +0.3% of B-std AND B-opt UTMOS within -0.1 of B-std \| Inference optimizations must be near-lossless \|
	\| B-opt speed > A-std speed \| B-opt RTF < 0.5 × A-std RTF \| Must be at least 2× faster \|

	### Stretch Goals

	\| Criterion \| Condition \| What it would prove \|
	\|-----------\|-----------\|---------------------\|
	\| B-std matches Baseline quality \| B-std WER ≤ 2.0% AND UTMOS ≥ 3.8 \| Enhanced 55M achieves near-123M quality \|
	\| B-opt achieves 5×+ speedup \| B-opt RTF < 0.2 × A-std RTF \| Full optimization stack works at scale \|
	\| B-std WER < A-std WER by >0.3% \| Statistical significance (p<0.05) \| ConvNeXt/GQA/RoPE genuinely help alignment \|

	### Failure Criteria (Abort/revise)

	\| Condition \| Action \|
	\|-----------\|--------\|
	\| B-std quality < A-std on ALL metrics \| Arch changes hurt → revert to simpler model, investigate which change caused regression \|
	\| B-opt quality degrades >10% vs B-std \| Inference optimizations too aggressive → relax cache schedule or increase NFE \|
	\| Both A-std and B-std WER > 4% \| 55M is too small for this task → increase param budget to 70-80M \|

	---

	## Ablation Matrix (Optional, if time allows)

	To understand which specific change helped or hurt, run these single-change ablations from Model B:

	\| Ablation \| Change from Model B \| What it tests \|
	\|----------\|--------------------\| --------------\|
	\| B - GQA \| Use 4 KV heads instead of 2 \| Is GQA actually free at this scale? \|
	\| B - DepSepConv \| Use standard Linear FFN \| Is depthwise sep conv as good as linear? \|
	\| B - ParamSharing \| No weight sharing between layers \| Does sharing work at 55M? \|
	\| B - DilatedConvNeXt \| Remove dilated conv, attention only \| Does local conv help? \|
	\| B - RoPE \| Use Zipformer native pos enc \| Does RoPE matter? \|
	\| B - ConvNeXt Text \| Remove text refinement, use 4 encoder layers \| Does ConvNeXt help WER? \|
	\| B + NLA \| Add NLA back \| Was removing NLA actually fine? \|

	Each ablation trains for same 500k steps. Report WER + UTMOS + SIM-o for each.

	---

	## Inference Optimization Ablation

	To understand which inference trick contributes most:

	\| Config \| Solver \| NFE \| Steps \| Cache \| Expected speed \|
	\|--------\|--------\|-----\|-------\|-------\|---------------\|
	\| B-std \| Euler \| 16 \| Uniform \| None \| 1× \|
	\| B + EPSS only \| Euler \| 8 \| EPSS \| None \| ~2× \|
	\| B + Midpoint only \| Midpoint \| 8 (4 steps) \| Uniform \| None \| ~2× \|
	\| B + Cache only \| Euler \| 16 \| Uniform \| SmoothCache \| ~1.4× \|
	\| B + EPSS + Midpoint \| Midpoint \| 8 (4 steps) \| EPSS \| None \| ~4× \|
	\| B-opt (all) \| Midpoint \| 8 (4 steps) \| EPSS \| SmoothCache \| ~6× \|

	Report quality (WER, UTMOS, SIM-o) AND speed (RTF) for each.

	---

	## Reporting Format

	### Results Table (Template)

	```markdown
	\| Model \| Config \| WER↓ \| SIM-o↑ \| UTMOS↑ \| RTF↓ \| NFE \| Mem (MB) \|
	\|-------\|--------\|------\|--------\|--------\|------\|-----\|----------\|
	\| ZipVoice (Baseline) \| Euler 16 \| 1.64 \| 0.668 \| 3.98 \| X.XX \| 16 \| XXX \|
	\| Model A (Slim) \| Euler 16 \| X.XX \| X.XXX \| X.XX \| X.XX \| 16 \| XXX \|
	\| Model B (Enhanced) \| Euler 16 \| X.XX \| X.XXX \| X.XX \| X.XX \| 16 \| XXX \|
	\| Model B (Enhanced) \| Optimized \| X.XX \| X.XXX \| X.XX \| X.XX \| 4eff \| XXX \|
	```

	### Visualizations Required

	1. Bar chart: WER comparison (Baseline vs A vs B-std vs B-opt)
	2. Bar chart: UTMOS comparison (same)
	3. Scatter plot: Quality (UTMOS) vs Speed (RTF) — Pareto frontier
	4. Training curves: Loss vs steps for Model A and Model B (convergence comparison)
	5. Ablation heatmap: Each change → metric delta

	---

	## Timeline

	\| Phase \| Duration \| Deliverable \|
	\|-------\|----------\|-------------\|
	\| 1. Implementation \| 1-2 weeks \| Model A and B code, training scripts \|
	\| 2. Training \| 2-3 weeks \| Both models trained for 500k steps \|
	\| 3. Evaluation \| 3-5 days \| All metrics computed \|
	\| 4. Ablations (optional) \| 1-2 weeks \| Per-change ablation results \|
	\| 5. Report \| 2-3 days \| Final comparison document with conclusions \|

	---

	## Decision Framework

	After results are in:

	```
	IF B-std > A-std (quality) AND B-opt >> A-std (speed):
	→ Model B wins. Proceed to Malayalam training with Model B architecture.

	ELIF B-std ≈ A-std (quality within noise) AND B-opt >> A-std (speed):
	→ Model B wins on speed alone. Still proceed with B.

	ELIF B-std < A-std (quality regression):
	→ Investigate via ablations. Remove harmful changes.
	→ Create Model B' with only beneficial changes.
	→ Re-evaluate B' vs A.

	ELIF both A and B unacceptably bad (WER > 4%):
	→ 55M is too small. Scale up to 70-80M.
	→ Or revisit training config (more steps, different lr).
	```

	---

	## Dependencies

	\| Dependency \| Source \| Status \|
	\|-----------\|--------\|--------\|
	\| ZipVoice training code \| [github.com/k2-fsa/ZipVoice](https://github.com/k2-fsa/ZipVoice) \| Available \|
	\| LibriTTS dataset \| [OpenSLR](https://www.openslr.org/60/) \| Available \|
	\| ZipVoice LibriTTS checkpoint (baseline) \| [k2-fsa/ZipVoice](https://huggingface.co/k2-fsa/ZipVoice) `zipvoice_libritts/model.pt` \| Available \|
	\| Whisper-large-v3 (WER eval) \| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) \| Available \|
	\| UTMOS predictor \| [sarulab-speech/UTMOS](https://github.com/sarulab-speech/UTMOS22) \| Available \|
	\| WavLM-TDNN (speaker similarity) \| [microsoft/wavlm-large](https://huggingface.co/microsoft/wavlm-large) \| Available \|
	\| Vocos vocoder \| Bundled with ZipVoice \| Available \|

	---

	## References

	- ZipVoice: [arXiv:2506.13053](https://arxiv.org/abs/2506.13053)
	- Zipformer: [arXiv:2310.11230](https://arxiv.org/abs/2310.11230)
	- Fast F5-TTS / EPSS: [arXiv:2505.19931](https://arxiv.org/abs/2505.19931)
	- SmoothCache: [arXiv:2411.10510](https://arxiv.org/abs/2411.10510)
	- F5-TTS: [arXiv:2410.06885](https://arxiv.org/abs/2410.06885)
	- M3-TTS: [arXiv:2512.04720](https://arxiv.org/abs/2512.04720)
	- GQA: [arXiv:2305.13245](https://arxiv.org/abs/2305.13245)
	- FLY-TTS: [arXiv:2407.00753](https://arxiv.org/abs/2407.00753)
	- ResidualTransformer: [arXiv:2310.02489](https://arxiv.org/abs/2310.02489)
	- Supertonic 3: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)