# Astra-TTS Benchmark PRD: Architecture Comparison ## Overview | Field | Value | |-------|-------| | **Project** | Astra-TTS Architecture Evaluation | | **Goal** | Determine whether architectural improvements (Model B) outperform naive shrinking (Model A) at ~55M params | | **Dataset** | LibriTTS (English) | | **Baseline** | Original ZipVoice 123M (k2-fsa/ZipVoice) | | **Models Under Test** | Model A (Slim 55M), Model B (Enhanced 55M) | | **Evaluation Configurations** | 4 total (see below) | --- ## Hypothesis > At the same parameter budget (~55M), architectural improvements (GQA, Depthwise Separable Conv, Grouped Parameter Sharing, Dilated ConvNeXt, RoPE, ConvNeXt text refinement, removed NLA) will yield **equal or better quality** than naive shrinking, while enabling **significantly faster inference** through EPSS + Midpoint + SmoothCache. --- ## Models & Configurations to Evaluate | ID | Model | Params | Inference Mode | Purpose | |----|-------|--------|---------------|---------| | **Baseline** | ZipVoice Original | 123M | Euler 16 NFE uniform | Reference — published numbers | | **A-std** | Model A (Slim) | 55M | Euler 16 NFE uniform | Naive shrink baseline | | **B-std** | Model B (Enhanced) | 55M | Euler 16 NFE uniform | **Quality comparison** (fair, same inference as A) | | **B-opt** | Model B (Enhanced) | 55M | Midpoint 4-step + EPSS + SmoothCache | **Speed comparison** (full optimized stack) | ### Why these 4? - **Baseline vs A-std**: How much does shrinking from 123M→55M cost in quality? - **A-std vs B-std**: Do arch improvements help at same size and same inference? (Quality ablation) - **B-std vs B-opt**: How much speed do inference optimizations add? (Speed ablation) - **A-std vs B-opt**: The real comparison — same params, but B is faster AND better? --- ## Training Protocol All models (A and B) must be trained under **identical conditions** for fair comparison: | Parameter | Value | |-----------|-------| | **Dataset** | LibriTTS (train-clean-100 + train-clean-360 + train-other-500) | | **Total hours** | ~585 hours | | **Audio preprocessing** | Resample to 24kHz, trim silence, normalize volume | | **Text preprocessing** | IPA phonemization via eSpeak-ng (same as ZipVoice) | | **Optimizer** | ScaledAdam | | **Learning rate** | 0.045 (linear warmup 5000 steps) | | **Batch strategy** | Dynamic batching, max 300s total duration per batch | | **Training steps** | 500,000 steps (both models, same count) | | **Gradient clipping** | 1.0 | | **EMA** | 0.9999 (for evaluation) | | **Random seed** | Fixed (42) for reproducibility | | **Mixed precision** | bf16 | | **Checkpoint selection** | Best validation loss OR step 500k (whichever is reported) | ### Baseline Model The original ZipVoice 123M checkpoint from [k2-fsa/ZipVoice](https://huggingface.co/k2-fsa/ZipVoice) is used directly. The `zipvoice_libritts/model.pt` variant (trained on LibriTTS) is the correct baseline since our models are also trained on LibriTTS. --- ## Evaluation Protocol ### Test Sets | Test Set | Samples | Purpose | |----------|---------|---------| | **LibriSpeech-PC test-clean** | Standard partition | Primary benchmark (matches ZipVoice paper) | | **Seed-TTS test-en** | Standard partition | Cross-domain zero-shot evaluation | ### Evaluation Procedure For each test utterance: 1. Select a reference audio clip (3-10 seconds) from the same speaker 2. Provide the reference audio + reference transcription + target text to the model 3. Generate speech 4. Measure metrics against ground truth ### Metrics #### Quality Metrics | Metric | What it measures | Tool | Target range | |--------|-----------------|------|-------------| | **WER** (Word Error Rate) | Intelligibility — can you understand the words? | Whisper-large-v3 transcription → WER vs ground truth text | Lower is better. ZipVoice baseline: 1.64% | | **SIM-o** (Speaker Similarity - original) | Voice cloning quality — does it sound like the target speaker? | WavLM-TDNN speaker verification model, cosine similarity between generated and original target audio | Higher is better. ZipVoice baseline: 0.668 | | **UTMOS** | Naturalness/quality — does it sound like real speech? | UTMOS predictor (pretrained MOS estimator) | Higher is better. ZipVoice baseline: 3.98 | #### Speed Metrics | Metric | What it measures | How | |--------|-----------------|-----| | **RTF** (Real-Time Factor) | Time to generate / duration of generated audio | Measure wall-clock inference time, divide by audio length | | **NFE** (Number of Function Evaluations) | Model forward passes per utterance | Count | | **Latency (s)** | Absolute time for a 10-second utterance | Measure on fixed hardware | | **Peak Memory (MB)** | Maximum GPU/CPU memory during inference | torch.cuda.max_memory_allocated() | #### Speed Evaluation Hardware All speed metrics measured on: - **GPU**: Single NVIDIA A100 80GB (for GPU RTF) - **CPU**: Single-threaded Intel Xeon (for CPU RTF) - **Batch size**: 1 (real-world latency scenario) - **Warm-up**: 10 utterances discarded before timing - **Measurement**: Mean of 50 utterances ± std dev --- ## Success Criteria ### Primary (Must achieve to validate Model B) | Criterion | Condition | Rationale | |-----------|-----------|-----------| | **B-std quality ≥ A-std quality** | B-std WER ≤ A-std WER AND B-std UTMOS ≥ A-std UTMOS | Arch changes must not hurt quality | | **B-opt quality ≈ B-std quality** | B-opt WER within +0.3% of B-std AND B-opt UTMOS within -0.1 of B-std | Inference optimizations must be near-lossless | | **B-opt speed > A-std speed** | B-opt RTF < 0.5 × A-std RTF | Must be at least 2× faster | ### Stretch Goals | Criterion | Condition | What it would prove | |-----------|-----------|---------------------| | B-std matches Baseline quality | B-std WER ≤ 2.0% AND UTMOS ≥ 3.8 | Enhanced 55M achieves near-123M quality | | B-opt achieves 5×+ speedup | B-opt RTF < 0.2 × A-std RTF | Full optimization stack works at scale | | B-std WER < A-std WER by >0.3% | Statistical significance (p<0.05) | ConvNeXt/GQA/RoPE genuinely help alignment | ### Failure Criteria (Abort/revise) | Condition | Action | |-----------|--------| | B-std quality < A-std on ALL metrics | Arch changes hurt → revert to simpler model, investigate which change caused regression | | B-opt quality degrades >10% vs B-std | Inference optimizations too aggressive → relax cache schedule or increase NFE | | Both A-std and B-std WER > 4% | 55M is too small for this task → increase param budget to 70-80M | --- ## Ablation Matrix (Optional, if time allows) To understand **which specific change** helped or hurt, run these single-change ablations from Model B: | Ablation | Change from Model B | What it tests | |----------|--------------------| --------------| | B - GQA | Use 4 KV heads instead of 2 | Is GQA actually free at this scale? | | B - DepSepConv | Use standard Linear FFN | Is depthwise sep conv as good as linear? | | B - ParamSharing | No weight sharing between layers | Does sharing work at 55M? | | B - DilatedConvNeXt | Remove dilated conv, attention only | Does local conv help? | | B - RoPE | Use Zipformer native pos enc | Does RoPE matter? | | B - ConvNeXt Text | Remove text refinement, use 4 encoder layers | Does ConvNeXt help WER? | | B + NLA | Add NLA back | Was removing NLA actually fine? | Each ablation trains for same 500k steps. Report WER + UTMOS + SIM-o for each. --- ## Inference Optimization Ablation To understand which inference trick contributes most: | Config | Solver | NFE | Steps | Cache | Expected speed | |--------|--------|-----|-------|-------|---------------| | B-std | Euler | 16 | Uniform | None | 1× | | B + EPSS only | Euler | 8 | EPSS | None | ~2× | | B + Midpoint only | Midpoint | 8 (4 steps) | Uniform | None | ~2× | | B + Cache only | Euler | 16 | Uniform | SmoothCache | ~1.4× | | B + EPSS + Midpoint | Midpoint | 8 (4 steps) | EPSS | None | ~4× | | B-opt (all) | Midpoint | 8 (4 steps) | EPSS | SmoothCache | ~6× | Report quality (WER, UTMOS, SIM-o) AND speed (RTF) for each. --- ## Reporting Format ### Results Table (Template) ```markdown | Model | Config | WER↓ | SIM-o↑ | UTMOS↑ | RTF↓ | NFE | Mem (MB) | |-------|--------|------|--------|--------|------|-----|----------| | ZipVoice (Baseline) | Euler 16 | 1.64 | 0.668 | 3.98 | X.XX | 16 | XXX | | Model A (Slim) | Euler 16 | X.XX | X.XXX | X.XX | X.XX | 16 | XXX | | Model B (Enhanced) | Euler 16 | X.XX | X.XXX | X.XX | X.XX | 16 | XXX | | Model B (Enhanced) | Optimized | X.XX | X.XXX | X.XX | X.XX | 4eff | XXX | ``` ### Visualizations Required 1. **Bar chart**: WER comparison (Baseline vs A vs B-std vs B-opt) 2. **Bar chart**: UTMOS comparison (same) 3. **Scatter plot**: Quality (UTMOS) vs Speed (RTF) — Pareto frontier 4. **Training curves**: Loss vs steps for Model A and Model B (convergence comparison) 5. **Ablation heatmap**: Each change → metric delta --- ## Timeline | Phase | Duration | Deliverable | |-------|----------|-------------| | **1. Implementation** | 1-2 weeks | Model A and B code, training scripts | | **2. Training** | 2-3 weeks | Both models trained for 500k steps | | **3. Evaluation** | 3-5 days | All metrics computed | | **4. Ablations** (optional) | 1-2 weeks | Per-change ablation results | | **5. Report** | 2-3 days | Final comparison document with conclusions | --- ## Decision Framework After results are in: ``` IF B-std > A-std (quality) AND B-opt >> A-std (speed): → Model B wins. Proceed to Malayalam training with Model B architecture. ELIF B-std ≈ A-std (quality within noise) AND B-opt >> A-std (speed): → Model B wins on speed alone. Still proceed with B. ELIF B-std < A-std (quality regression): → Investigate via ablations. Remove harmful changes. → Create Model B' with only beneficial changes. → Re-evaluate B' vs A. ELIF both A and B unacceptably bad (WER > 4%): → 55M is too small. Scale up to 70-80M. → Or revisit training config (more steps, different lr). ``` --- ## Dependencies | Dependency | Source | Status | |-----------|--------|--------| | ZipVoice training code | [github.com/k2-fsa/ZipVoice](https://github.com/k2-fsa/ZipVoice) | Available | | LibriTTS dataset | [OpenSLR](https://www.openslr.org/60/) | Available | | ZipVoice LibriTTS checkpoint (baseline) | [k2-fsa/ZipVoice](https://huggingface.co/k2-fsa/ZipVoice) `zipvoice_libritts/model.pt` | Available | | Whisper-large-v3 (WER eval) | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | Available | | UTMOS predictor | [sarulab-speech/UTMOS](https://github.com/sarulab-speech/UTMOS22) | Available | | WavLM-TDNN (speaker similarity) | [microsoft/wavlm-large](https://huggingface.co/microsoft/wavlm-large) | Available | | Vocos vocoder | Bundled with ZipVoice | Available | --- ## References - ZipVoice: [arXiv:2506.13053](https://arxiv.org/abs/2506.13053) - Zipformer: [arXiv:2310.11230](https://arxiv.org/abs/2310.11230) - Fast F5-TTS / EPSS: [arXiv:2505.19931](https://arxiv.org/abs/2505.19931) - SmoothCache: [arXiv:2411.10510](https://arxiv.org/abs/2411.10510) - F5-TTS: [arXiv:2410.06885](https://arxiv.org/abs/2410.06885) - M3-TTS: [arXiv:2512.04720](https://arxiv.org/abs/2512.04720) - GQA: [arXiv:2305.13245](https://arxiv.org/abs/2305.13245) - FLY-TTS: [arXiv:2407.00753](https://arxiv.org/abs/2407.00753) - ResidualTransformer: [arXiv:2310.02489](https://arxiv.org/abs/2310.02489) - Supertonic 3: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)