File size: 5,842 Bytes
bc8ea92
c1f5502
 
bc8ea92
c1f5502
bc8ea92
c1f5502
 
bc8ea92
c1f5502
bc8ea92
 
a3419b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bc8ea92
 
 
 
 
 
a3419b6
 
 
 
 
bc8ea92
a3419b6
bc8ea92
a3419b6
bc8ea92
 
a3419b6
bc8ea92
 
 
 
a3419b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: TTS Eval Framework
emoji: πŸŽ™
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.12.0
app_file: app.py
pinned: false
python_version: "3.12"
---

# Bantrly TTS Evaluation Framework

A research-grade evaluation framework for comparing Text-to-Speech (TTS) engines for use in a K-12 speech coaching product. Built to help make data-driven decisions about which TTS engine to deploy in production.

**Author:** Aankit Das

---

## Overview

This project evaluates TTS engines across four dimensions that matter for a real-time coaching product:

| Metric | What it measures |
|---|---|
| **UTMOS** | Automated naturalness score (1–5), predicts human MOS (Saeki et al. 2022) |
| **WER** | Word Error Rate via Whisper transcription (Radford et al. 2023) |
| **RTF** | Real Time Factor β€” synthesis time / audio duration (<1.0 = faster than real time) |
| **Cost** | Actual cost for paid engines, Chirp 3 HD equivalent for free engines |

---

## Engines Evaluated

| Engine | Type | Local | Cost |
|---|---|---|---|
| Kokoro (tuned) | Neural OSS | βœ“ | $0 |
| Piper (ONNX) | Neural OSS | βœ“ | $0 |
| Parler-TTS (mini) | Neural OSS | βœ“ | $0 |
| edge-tts (Microsoft) | Neural cloud | βœ— | $0 (free tier) |
| Chatterbox Turbo | Neural cloud | βœ— | ~$0.001/sec (RunPod) |
| Chirp 3 HD | Neural cloud | βœ— | ~$16/1M chars (Google) |
| pyttsx3 | Rule-based | βœ“ | $0 |

---

## Project Structure

```
β”œβ”€β”€ app/                           # Gradio evaluation app
β”‚   β”œβ”€β”€ app.py                     # main UI
β”‚   β”œβ”€β”€ evaluator.py               # WER, UTMOS, RTF, cost metrics
β”‚   β”œβ”€β”€ engines/                   # pluggable TTS engine implementations
β”‚   β”‚   β”œβ”€β”€ base.py                # abstract base class
β”‚   β”‚   β”œβ”€β”€ kokoro_engine.py
β”‚   β”‚   β”œβ”€β”€ piper_engine.py
β”‚   β”‚   β”œβ”€β”€ parler_engine.py
β”‚   β”‚   β”œβ”€β”€ edge_tts_engine.py
β”‚   β”‚   β”œβ”€β”€ chatterbox_runpod_engine.py
β”‚   β”‚   β”œβ”€β”€ chirp_engine.py        # stub, requires Google Cloud API key
β”‚   β”‚   └── pyttsx3_engine.py
β”‚   └── results/                   # saved eval outputs
β”œβ”€β”€ notebooks/
β”‚   └── evaluation.ipynb           # Project B evaluation notebook
β”œβ”€β”€ src/                           # TTS client wrappers (used by notebook)
β”œβ”€β”€ config/
β”‚   └── tts_scripts.py             # 6 standardized coaching scripts
β”œβ”€β”€ results/                       # notebook results (CSVs, charts)
└── voices/piper/                  # Piper ONNX voice models (download separately)
```

## Setup

### Prerequisites

- Python 3.12
- [uv](https://github.com/astral-sh/uv) package manager
- NVIDIA GPU recommended (CUDA 12.4+) for Kokoro and Parler-TTS
- CUDA-enabled PyTorch (see below)

### Install
```bash
git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git
cd YOUR_REPO_NAME
uv sync
```

### Install CUDA PyTorch (required for GPU inference)
```bash
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
```

### Download Piper voices
```bash
uv run python -c "
from pathlib import Path
import urllib.request

voices_dir = Path('voices/piper')
voices_dir.mkdir(parents=True, exist_ok=True)

base_female = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium'
base_male = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium'

for base, prefix in [(base_female, 'en_US-amy-medium'), (base_male, 'en_US-lessac-medium')]:
    for ext in ['.onnx', '.onnx.json']:
        fname = prefix + ext
        out = voices_dir / fname
        if not out.exists():
            print(f'Downloading {fname}...')
            urllib.request.urlretrieve(f'{base}/{fname}', out)
print('Done.')
"
```

### Environment variables

Create `app/.env`:
```bash
cp app/.env.example app/.env
# then edit app/.env and fill in your API keys
```

Required for cloud engines:
- `RUNPOD_API_KEY` β€” for Chatterbox Turbo
- `GOOGLE_APPLICATION_CREDENTIALS` β€” for Chirp 3 HD (when available)

---

## Running the App
```bash
cd app
uv run gradio app.py
```

Open `http://127.0.0.1:7860` in your browser.

---

## Adding a New Engine

1. Create `app/engines/your_engine.py` subclassing `TTSEngine`
2. Implement `synthesize(text, band, output_path) -> dict`
3. Set `BAND_CONFIG` for grade-band voice/speed tuning
4. Register in `app/engines/__init__.py`

The framework picks it up automatically β€” no other changes needed.

---

## Evaluation Methodology

Six standardized coaching scripts cover all four grade bands (K-2, 3-5, 6-8, 9-12) and five scenario types (praise, correction, instruction, SEL, out-of-scope). Scripts are defined in `config/tts_scripts.py`.

Rubric scores (1–3 anchored scale) were assigned manually after listening to all synthesized outputs. Automated metrics (UTMOS, WER, RTF) are computed programmatically for every run and logged to `app/results/eval_log.csv`.

---

## Key Findings

- **Kokoro** achieves the highest quality (UTMOS ~4.5) at zero cost, running fully locally
- **Piper** is the fastest engine (RTF ~0.04x) with acceptable quality (UTMOS ~3.9), suitable as a lightweight fallback
- **Parler-TTS** supports instruction-controlled voice descriptions but is too slow (RTF ~2.3x) for real-time coaching
- **Chatterbox Turbo** (via RunPod) achieves strong naturalness (UTMOS ~4.4) but at higher cost than character-based pricing models
- **pyttsx3** is unsuitable for child-facing products β€” zero warmth, robotic output

---

## References

- Saeki et al. (2022). *UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022*. INTERSPEECH.
- Radford et al. (2023). *Robust Speech Recognition via Large-Scale Weak Supervision*. ICML. (Whisper)
- Minixhofer et al. (2024). *TTSDS β€” Text-to-Speech Distribution Score*. SSW.