Spaces:
Sleeping
Sleeping
File size: 5,842 Bytes
bc8ea92 c1f5502 bc8ea92 c1f5502 bc8ea92 c1f5502 bc8ea92 c1f5502 bc8ea92 a3419b6 bc8ea92 a3419b6 bc8ea92 a3419b6 bc8ea92 a3419b6 bc8ea92 a3419b6 bc8ea92 a3419b6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | ---
title: TTS Eval Framework
emoji: π
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.12.0
app_file: app.py
pinned: false
python_version: "3.12"
---
# Bantrly TTS Evaluation Framework
A research-grade evaluation framework for comparing Text-to-Speech (TTS) engines for use in a K-12 speech coaching product. Built to help make data-driven decisions about which TTS engine to deploy in production.
**Author:** Aankit Das
---
## Overview
This project evaluates TTS engines across four dimensions that matter for a real-time coaching product:
| Metric | What it measures |
|---|---|
| **UTMOS** | Automated naturalness score (1β5), predicts human MOS (Saeki et al. 2022) |
| **WER** | Word Error Rate via Whisper transcription (Radford et al. 2023) |
| **RTF** | Real Time Factor β synthesis time / audio duration (<1.0 = faster than real time) |
| **Cost** | Actual cost for paid engines, Chirp 3 HD equivalent for free engines |
---
## Engines Evaluated
| Engine | Type | Local | Cost |
|---|---|---|---|
| Kokoro (tuned) | Neural OSS | β | $0 |
| Piper (ONNX) | Neural OSS | β | $0 |
| Parler-TTS (mini) | Neural OSS | β | $0 |
| edge-tts (Microsoft) | Neural cloud | β | $0 (free tier) |
| Chatterbox Turbo | Neural cloud | β | ~$0.001/sec (RunPod) |
| Chirp 3 HD | Neural cloud | β | ~$16/1M chars (Google) |
| pyttsx3 | Rule-based | β | $0 |
---
## Project Structure
```
βββ app/ # Gradio evaluation app
β βββ app.py # main UI
β βββ evaluator.py # WER, UTMOS, RTF, cost metrics
β βββ engines/ # pluggable TTS engine implementations
β β βββ base.py # abstract base class
β β βββ kokoro_engine.py
β β βββ piper_engine.py
β β βββ parler_engine.py
β β βββ edge_tts_engine.py
β β βββ chatterbox_runpod_engine.py
β β βββ chirp_engine.py # stub, requires Google Cloud API key
β β βββ pyttsx3_engine.py
β βββ results/ # saved eval outputs
βββ notebooks/
β βββ evaluation.ipynb # Project B evaluation notebook
βββ src/ # TTS client wrappers (used by notebook)
βββ config/
β βββ tts_scripts.py # 6 standardized coaching scripts
βββ results/ # notebook results (CSVs, charts)
βββ voices/piper/ # Piper ONNX voice models (download separately)
```
## Setup
### Prerequisites
- Python 3.12
- [uv](https://github.com/astral-sh/uv) package manager
- NVIDIA GPU recommended (CUDA 12.4+) for Kokoro and Parler-TTS
- CUDA-enabled PyTorch (see below)
### Install
```bash
git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git
cd YOUR_REPO_NAME
uv sync
```
### Install CUDA PyTorch (required for GPU inference)
```bash
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
```
### Download Piper voices
```bash
uv run python -c "
from pathlib import Path
import urllib.request
voices_dir = Path('voices/piper')
voices_dir.mkdir(parents=True, exist_ok=True)
base_female = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium'
base_male = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium'
for base, prefix in [(base_female, 'en_US-amy-medium'), (base_male, 'en_US-lessac-medium')]:
for ext in ['.onnx', '.onnx.json']:
fname = prefix + ext
out = voices_dir / fname
if not out.exists():
print(f'Downloading {fname}...')
urllib.request.urlretrieve(f'{base}/{fname}', out)
print('Done.')
"
```
### Environment variables
Create `app/.env`:
```bash
cp app/.env.example app/.env
# then edit app/.env and fill in your API keys
```
Required for cloud engines:
- `RUNPOD_API_KEY` β for Chatterbox Turbo
- `GOOGLE_APPLICATION_CREDENTIALS` β for Chirp 3 HD (when available)
---
## Running the App
```bash
cd app
uv run gradio app.py
```
Open `http://127.0.0.1:7860` in your browser.
---
## Adding a New Engine
1. Create `app/engines/your_engine.py` subclassing `TTSEngine`
2. Implement `synthesize(text, band, output_path) -> dict`
3. Set `BAND_CONFIG` for grade-band voice/speed tuning
4. Register in `app/engines/__init__.py`
The framework picks it up automatically β no other changes needed.
---
## Evaluation Methodology
Six standardized coaching scripts cover all four grade bands (K-2, 3-5, 6-8, 9-12) and five scenario types (praise, correction, instruction, SEL, out-of-scope). Scripts are defined in `config/tts_scripts.py`.
Rubric scores (1β3 anchored scale) were assigned manually after listening to all synthesized outputs. Automated metrics (UTMOS, WER, RTF) are computed programmatically for every run and logged to `app/results/eval_log.csv`.
---
## Key Findings
- **Kokoro** achieves the highest quality (UTMOS ~4.5) at zero cost, running fully locally
- **Piper** is the fastest engine (RTF ~0.04x) with acceptable quality (UTMOS ~3.9), suitable as a lightweight fallback
- **Parler-TTS** supports instruction-controlled voice descriptions but is too slow (RTF ~2.3x) for real-time coaching
- **Chatterbox Turbo** (via RunPod) achieves strong naturalness (UTMOS ~4.4) but at higher cost than character-based pricing models
- **pyttsx3** is unsuitable for child-facing products β zero warmth, robotic output
---
## References
- Saeki et al. (2022). *UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022*. INTERSPEECH.
- Radford et al. (2023). *Robust Speech Recognition via Large-Scale Weak Supervision*. ICML. (Whisper)
- Minixhofer et al. (2024). *TTSDS β Text-to-Speech Distribution Score*. SSW. |