# Astra-TTS Benchmark PRD: Architecture Comparison

## Overview

| Field | Value |
|-------|-------|
| **Project** | Astra-TTS Architecture Evaluation |
| **Goal** | Determine whether architectural improvements (Model B) outperform naive shrinking (Model A) at ~55M params |
| **Dataset** | LibriTTS (English) |
| **Baseline** | Original ZipVoice 123M (k2-fsa/ZipVoice) |
| **Models Under Test** | Model A (Slim 55M), Model B (Enhanced 55M) |
| **Evaluation Configurations** | 4 total (see below) |

---

## Hypothesis

> At the same parameter budget (~55M), architectural improvements (GQA, Depthwise Separable Conv, Grouped Parameter Sharing, Dilated ConvNeXt, RoPE, ConvNeXt text refinement, removed NLA) will yield **equal or better quality** than naive shrinking, while enabling **significantly faster inference** through EPSS + Midpoint + SmoothCache.

---

## Models & Configurations to Evaluate

| ID | Model | Params | Inference Mode | Purpose |
|----|-------|--------|---------------|---------|
| **Baseline** | ZipVoice Original | 123M | Euler 16 NFE uniform | Reference — published numbers |
| **A-std** | Model A (Slim) | 55M | Euler 16 NFE uniform | Naive shrink baseline |
| **B-std** | Model B (Enhanced) | 55M | Euler 16 NFE uniform | **Quality comparison** (fair, same inference as A) |
| **B-opt** | Model B (Enhanced) | 55M | Midpoint 4-step + EPSS + SmoothCache | **Speed comparison** (full optimized stack) |

### Why these 4?

- **Baseline vs A-std**: How much does shrinking from 123M→55M cost in quality?
- **A-std vs B-std**: Do arch improvements help at same size and same inference? (Quality ablation)
- **B-std vs B-opt**: How much speed do inference optimizations add? (Speed ablation)
- **A-std vs B-opt**: The real comparison — same params, but B is faster AND better?

---

## Training Protocol

All models (A and B) must be trained under **identical conditions** for fair comparison:

| Parameter | Value |
|-----------|-------|
| **Dataset** | LibriTTS (train-clean-100 + train-clean-360 + train-other-500) |
| **Total hours** | ~585 hours |
| **Audio preprocessing** | Resample to 24kHz, trim silence, normalize volume |
| **Text preprocessing** | IPA phonemization via eSpeak-ng (same as ZipVoice) |
| **Optimizer** | ScaledAdam |
| **Learning rate** | 0.045 (linear warmup 5000 steps) |
| **Batch strategy** | Dynamic batching, max 300s total duration per batch |
| **Training steps** | 500,000 steps (both models, same count) |
| **Gradient clipping** | 1.0 |
| **EMA** | 0.9999 (for evaluation) |
| **Random seed** | Fixed (42) for reproducibility |
| **Mixed precision** | bf16 |
| **Checkpoint selection** | Best validation loss OR step 500k (whichever is reported) |

### Baseline Model

The original ZipVoice 123M checkpoint from [k2-fsa/ZipVoice](https://huggingface.co/k2-fsa/ZipVoice) is used directly. The `zipvoice_libritts/model.pt` variant (trained on LibriTTS) is the correct baseline since our models are also trained on LibriTTS.

---

## Evaluation Protocol

### Test Sets

| Test Set | Samples | Purpose |
|----------|---------|---------|
| **LibriSpeech-PC test-clean** | Standard partition | Primary benchmark (matches ZipVoice paper) |
| **Seed-TTS test-en** | Standard partition | Cross-domain zero-shot evaluation |

### Evaluation Procedure

For each test utterance:
1. Select a reference audio clip (3-10 seconds) from the same speaker
2. Provide the reference audio + reference transcription + target text to the model
3. Generate speech
4. Measure metrics against ground truth

### Metrics

#### Quality Metrics

| Metric | What it measures | Tool | Target range |
|--------|-----------------|------|-------------|
| **WER** (Word Error Rate) | Intelligibility — can you understand the words? | Whisper-large-v3 transcription → WER vs ground truth text | Lower is better. ZipVoice baseline: 1.64% |
| **SIM-o** (Speaker Similarity - original) | Voice cloning quality — does it sound like the target speaker? | WavLM-TDNN speaker verification model, cosine similarity between generated and original target audio | Higher is better. ZipVoice baseline: 0.668 |
| **UTMOS** | Naturalness/quality — does it sound like real speech? | UTMOS predictor (pretrained MOS estimator) | Higher is better. ZipVoice baseline: 3.98 |

#### Speed Metrics

| Metric | What it measures | How |
|--------|-----------------|-----|
| **RTF** (Real-Time Factor) | Time to generate / duration of generated audio | Measure wall-clock inference time, divide by audio length |
| **NFE** (Number of Function Evaluations) | Model forward passes per utterance | Count |
| **Latency (s)** | Absolute time for a 10-second utterance | Measure on fixed hardware |
| **Peak Memory (MB)** | Maximum GPU/CPU memory during inference | torch.cuda.max_memory_allocated() |

#### Speed Evaluation Hardware

All speed metrics measured on:
- **GPU**: Single NVIDIA A100 80GB (for GPU RTF)
- **CPU**: Single-threaded Intel Xeon (for CPU RTF)
- **Batch size**: 1 (real-world latency scenario)
- **Warm-up**: 10 utterances discarded before timing
- **Measurement**: Mean of 50 utterances ± std dev

---

## Success Criteria

### Primary (Must achieve to validate Model B)

| Criterion | Condition | Rationale |
|-----------|-----------|-----------|
| **B-std quality ≥ A-std quality** | B-std WER ≤ A-std WER AND B-std UTMOS ≥ A-std UTMOS | Arch changes must not hurt quality |
| **B-opt quality ≈ B-std quality** | B-opt WER within +0.3% of B-std AND B-opt UTMOS within -0.1 of B-std | Inference optimizations must be near-lossless |
| **B-opt speed > A-std speed** | B-opt RTF < 0.5 × A-std RTF | Must be at least 2× faster |

### Stretch Goals

| Criterion | Condition | What it would prove |
|-----------|-----------|---------------------|
| B-std matches Baseline quality | B-std WER ≤ 2.0% AND UTMOS ≥ 3.8 | Enhanced 55M achieves near-123M quality |
| B-opt achieves 5×+ speedup | B-opt RTF < 0.2 × A-std RTF | Full optimization stack works at scale |
| B-std WER < A-std WER by >0.3% | Statistical significance (p<0.05) | ConvNeXt/GQA/RoPE genuinely help alignment |

### Failure Criteria (Abort/revise)

| Condition | Action |
|-----------|--------|
| B-std quality < A-std on ALL metrics | Arch changes hurt → revert to simpler model, investigate which change caused regression |
| B-opt quality degrades >10% vs B-std | Inference optimizations too aggressive → relax cache schedule or increase NFE |
| Both A-std and B-std WER > 4% | 55M is too small for this task → increase param budget to 70-80M |

---

## Ablation Matrix (Optional, if time allows)

To understand **which specific change** helped or hurt, run these single-change ablations from Model B:

| Ablation | Change from Model B | What it tests |
|----------|--------------------| --------------|
| B - GQA | Use 4 KV heads instead of 2 | Is GQA actually free at this scale? |
| B - DepSepConv | Use standard Linear FFN | Is depthwise sep conv as good as linear? |
| B - ParamSharing | No weight sharing between layers | Does sharing work at 55M? |
| B - DilatedConvNeXt | Remove dilated conv, attention only | Does local conv help? |
| B - RoPE | Use Zipformer native pos enc | Does RoPE matter? |
| B - ConvNeXt Text | Remove text refinement, use 4 encoder layers | Does ConvNeXt help WER? |
| B + NLA | Add NLA back | Was removing NLA actually fine? |

Each ablation trains for same 500k steps. Report WER + UTMOS + SIM-o for each.

---

## Inference Optimization Ablation

To understand which inference trick contributes most:

| Config | Solver | NFE | Steps | Cache | Expected speed |
|--------|--------|-----|-------|-------|---------------|
| B-std | Euler | 16 | Uniform | None | 1× |
| B + EPSS only | Euler | 8 | EPSS | None | ~2× |
| B + Midpoint only | Midpoint | 8 (4 steps) | Uniform | None | ~2× |
| B + Cache only | Euler | 16 | Uniform | SmoothCache | ~1.4× |
| B + EPSS + Midpoint | Midpoint | 8 (4 steps) | EPSS | None | ~4× |
| B-opt (all) | Midpoint | 8 (4 steps) | EPSS | SmoothCache | ~6× |

Report quality (WER, UTMOS, SIM-o) AND speed (RTF) for each.

---

## Reporting Format

### Results Table (Template)

```markdown
| Model | Config | WER↓ | SIM-o↑ | UTMOS↑ | RTF↓ | NFE | Mem (MB) |
|-------|--------|------|--------|--------|------|-----|----------|
| ZipVoice (Baseline) | Euler 16 | 1.64 | 0.668 | 3.98 | X.XX | 16 | XXX |
| Model A (Slim) | Euler 16 | X.XX | X.XXX | X.XX | X.XX | 16 | XXX |
| Model B (Enhanced) | Euler 16 | X.XX | X.XXX | X.XX | X.XX | 16 | XXX |
| Model B (Enhanced) | Optimized | X.XX | X.XXX | X.XX | X.XX | 4eff | XXX |
```

### Visualizations Required

1. **Bar chart**: WER comparison (Baseline vs A vs B-std vs B-opt)
2. **Bar chart**: UTMOS comparison (same)
3. **Scatter plot**: Quality (UTMOS) vs Speed (RTF) — Pareto frontier
4. **Training curves**: Loss vs steps for Model A and Model B (convergence comparison)
5. **Ablation heatmap**: Each change → metric delta

---

## Timeline

| Phase | Duration | Deliverable |
|-------|----------|-------------|
| **1. Implementation** | 1-2 weeks | Model A and B code, training scripts |
| **2. Training** | 2-3 weeks | Both models trained for 500k steps |
| **3. Evaluation** | 3-5 days | All metrics computed |
| **4. Ablations** (optional) | 1-2 weeks | Per-change ablation results |
| **5. Report** | 2-3 days | Final comparison document with conclusions |

---

## Decision Framework

After results are in:

```
IF B-std > A-std (quality) AND B-opt >> A-std (speed):
    → Model B wins. Proceed to Malayalam training with Model B architecture.
    
ELIF B-std ≈ A-std (quality within noise) AND B-opt >> A-std (speed):
    → Model B wins on speed alone. Still proceed with B.
    
ELIF B-std < A-std (quality regression):
    → Investigate via ablations. Remove harmful changes.
    → Create Model B' with only beneficial changes.
    → Re-evaluate B' vs A.
    
ELIF both A and B unacceptably bad (WER > 4%):
    → 55M is too small. Scale up to 70-80M.
    → Or revisit training config (more steps, different lr).
```

---

## Dependencies

| Dependency | Source | Status |
|-----------|--------|--------|
| ZipVoice training code | [github.com/k2-fsa/ZipVoice](https://github.com/k2-fsa/ZipVoice) | Available |
| LibriTTS dataset | [OpenSLR](https://www.openslr.org/60/) | Available |
| ZipVoice LibriTTS checkpoint (baseline) | [k2-fsa/ZipVoice](https://huggingface.co/k2-fsa/ZipVoice) `zipvoice_libritts/model.pt` | Available |
| Whisper-large-v3 (WER eval) | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | Available |
| UTMOS predictor | [sarulab-speech/UTMOS](https://github.com/sarulab-speech/UTMOS22) | Available |
| WavLM-TDNN (speaker similarity) | [microsoft/wavlm-large](https://huggingface.co/microsoft/wavlm-large) | Available |
| Vocos vocoder | Bundled with ZipVoice | Available |

---

## References

- ZipVoice: [arXiv:2506.13053](https://arxiv.org/abs/2506.13053)
- Zipformer: [arXiv:2310.11230](https://arxiv.org/abs/2310.11230)
- Fast F5-TTS / EPSS: [arXiv:2505.19931](https://arxiv.org/abs/2505.19931)
- SmoothCache: [arXiv:2411.10510](https://arxiv.org/abs/2411.10510)
- F5-TTS: [arXiv:2410.06885](https://arxiv.org/abs/2410.06885)
- M3-TTS: [arXiv:2512.04720](https://arxiv.org/abs/2512.04720)
- GQA: [arXiv:2305.13245](https://arxiv.org/abs/2305.13245)
- FLY-TTS: [arXiv:2407.00753](https://arxiv.org/abs/2407.00753)
- ResidualTransformer: [arXiv:2310.02489](https://arxiv.org/abs/2310.02489)
- Supertonic 3: [Supertone/supertonic-3](https://huggingface.co/Supertone/supertonic-3)