tts-eval-framework / README.md
aankitdas's picture
fix: force Python 3.12 on HF Spaces
c1f5502
---
title: TTS Eval Framework
emoji: πŸŽ™
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.12.0
app_file: app.py
pinned: false
python_version: "3.12"
---
# Bantrly TTS Evaluation Framework
A research-grade evaluation framework for comparing Text-to-Speech (TTS) engines for use in a K-12 speech coaching product. Built to help make data-driven decisions about which TTS engine to deploy in production.
**Author:** Aankit Das
---
## Overview
This project evaluates TTS engines across four dimensions that matter for a real-time coaching product:
| Metric | What it measures |
|---|---|
| **UTMOS** | Automated naturalness score (1–5), predicts human MOS (Saeki et al. 2022) |
| **WER** | Word Error Rate via Whisper transcription (Radford et al. 2023) |
| **RTF** | Real Time Factor β€” synthesis time / audio duration (<1.0 = faster than real time) |
| **Cost** | Actual cost for paid engines, Chirp 3 HD equivalent for free engines |
---
## Engines Evaluated
| Engine | Type | Local | Cost |
|---|---|---|---|
| Kokoro (tuned) | Neural OSS | βœ“ | $0 |
| Piper (ONNX) | Neural OSS | βœ“ | $0 |
| Parler-TTS (mini) | Neural OSS | βœ“ | $0 |
| edge-tts (Microsoft) | Neural cloud | βœ— | $0 (free tier) |
| Chatterbox Turbo | Neural cloud | βœ— | ~$0.001/sec (RunPod) |
| Chirp 3 HD | Neural cloud | βœ— | ~$16/1M chars (Google) |
| pyttsx3 | Rule-based | βœ“ | $0 |
---
## Project Structure
```
β”œβ”€β”€ app/ # Gradio evaluation app
β”‚ β”œβ”€β”€ app.py # main UI
β”‚ β”œβ”€β”€ evaluator.py # WER, UTMOS, RTF, cost metrics
β”‚ β”œβ”€β”€ engines/ # pluggable TTS engine implementations
β”‚ β”‚ β”œβ”€β”€ base.py # abstract base class
β”‚ β”‚ β”œβ”€β”€ kokoro_engine.py
β”‚ β”‚ β”œβ”€β”€ piper_engine.py
β”‚ β”‚ β”œβ”€β”€ parler_engine.py
β”‚ β”‚ β”œβ”€β”€ edge_tts_engine.py
β”‚ β”‚ β”œβ”€β”€ chatterbox_runpod_engine.py
β”‚ β”‚ β”œβ”€β”€ chirp_engine.py # stub, requires Google Cloud API key
β”‚ β”‚ └── pyttsx3_engine.py
β”‚ └── results/ # saved eval outputs
β”œβ”€β”€ notebooks/
β”‚ └── evaluation.ipynb # Project B evaluation notebook
β”œβ”€β”€ src/ # TTS client wrappers (used by notebook)
β”œβ”€β”€ config/
β”‚ └── tts_scripts.py # 6 standardized coaching scripts
β”œβ”€β”€ results/ # notebook results (CSVs, charts)
└── voices/piper/ # Piper ONNX voice models (download separately)
```
## Setup
### Prerequisites
- Python 3.12
- [uv](https://github.com/astral-sh/uv) package manager
- NVIDIA GPU recommended (CUDA 12.4+) for Kokoro and Parler-TTS
- CUDA-enabled PyTorch (see below)
### Install
```bash
git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git
cd YOUR_REPO_NAME
uv sync
```
### Install CUDA PyTorch (required for GPU inference)
```bash
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
```
### Download Piper voices
```bash
uv run python -c "
from pathlib import Path
import urllib.request
voices_dir = Path('voices/piper')
voices_dir.mkdir(parents=True, exist_ok=True)
base_female = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium'
base_male = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium'
for base, prefix in [(base_female, 'en_US-amy-medium'), (base_male, 'en_US-lessac-medium')]:
for ext in ['.onnx', '.onnx.json']:
fname = prefix + ext
out = voices_dir / fname
if not out.exists():
print(f'Downloading {fname}...')
urllib.request.urlretrieve(f'{base}/{fname}', out)
print('Done.')
"
```
### Environment variables
Create `app/.env`:
```bash
cp app/.env.example app/.env
# then edit app/.env and fill in your API keys
```
Required for cloud engines:
- `RUNPOD_API_KEY` β€” for Chatterbox Turbo
- `GOOGLE_APPLICATION_CREDENTIALS` β€” for Chirp 3 HD (when available)
---
## Running the App
```bash
cd app
uv run gradio app.py
```
Open `http://127.0.0.1:7860` in your browser.
---
## Adding a New Engine
1. Create `app/engines/your_engine.py` subclassing `TTSEngine`
2. Implement `synthesize(text, band, output_path) -> dict`
3. Set `BAND_CONFIG` for grade-band voice/speed tuning
4. Register in `app/engines/__init__.py`
The framework picks it up automatically β€” no other changes needed.
---
## Evaluation Methodology
Six standardized coaching scripts cover all four grade bands (K-2, 3-5, 6-8, 9-12) and five scenario types (praise, correction, instruction, SEL, out-of-scope). Scripts are defined in `config/tts_scripts.py`.
Rubric scores (1–3 anchored scale) were assigned manually after listening to all synthesized outputs. Automated metrics (UTMOS, WER, RTF) are computed programmatically for every run and logged to `app/results/eval_log.csv`.
---
## Key Findings
- **Kokoro** achieves the highest quality (UTMOS ~4.5) at zero cost, running fully locally
- **Piper** is the fastest engine (RTF ~0.04x) with acceptable quality (UTMOS ~3.9), suitable as a lightweight fallback
- **Parler-TTS** supports instruction-controlled voice descriptions but is too slow (RTF ~2.3x) for real-time coaching
- **Chatterbox Turbo** (via RunPod) achieves strong naturalness (UTMOS ~4.4) but at higher cost than character-based pricing models
- **pyttsx3** is unsuitable for child-facing products β€” zero warmth, robotic output
---
## References
- Saeki et al. (2022). *UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022*. INTERSPEECH.
- Radford et al. (2023). *Robust Speech Recognition via Large-Scale Weak Supervision*. ICML. (Whisper)
- Minixhofer et al. (2024). *TTSDS β€” Text-to-Speech Distribution Score*. SSW.