Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.14.0
title: TTS Eval Framework
emoji: π
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.12.0
app_file: app.py
pinned: false
python_version: '3.12'
Bantrly TTS Evaluation Framework
A research-grade evaluation framework for comparing Text-to-Speech (TTS) engines for use in a K-12 speech coaching product. Built to help make data-driven decisions about which TTS engine to deploy in production.
Author: Aankit Das
Overview
This project evaluates TTS engines across four dimensions that matter for a real-time coaching product:
| Metric | What it measures |
|---|---|
| UTMOS | Automated naturalness score (1β5), predicts human MOS (Saeki et al. 2022) |
| WER | Word Error Rate via Whisper transcription (Radford et al. 2023) |
| RTF | Real Time Factor β synthesis time / audio duration (<1.0 = faster than real time) |
| Cost | Actual cost for paid engines, Chirp 3 HD equivalent for free engines |
Engines Evaluated
| Engine | Type | Local | Cost |
|---|---|---|---|
| Kokoro (tuned) | Neural OSS | β | $0 |
| Piper (ONNX) | Neural OSS | β | $0 |
| Parler-TTS (mini) | Neural OSS | β | $0 |
| edge-tts (Microsoft) | Neural cloud | β | $0 (free tier) |
| Chatterbox Turbo | Neural cloud | β | ~$0.001/sec (RunPod) |
| Chirp 3 HD | Neural cloud | β | ~$16/1M chars (Google) |
| pyttsx3 | Rule-based | β | $0 |
Project Structure
βββ app/ # Gradio evaluation app
β βββ app.py # main UI
β βββ evaluator.py # WER, UTMOS, RTF, cost metrics
β βββ engines/ # pluggable TTS engine implementations
β β βββ base.py # abstract base class
β β βββ kokoro_engine.py
β β βββ piper_engine.py
β β βββ parler_engine.py
β β βββ edge_tts_engine.py
β β βββ chatterbox_runpod_engine.py
β β βββ chirp_engine.py # stub, requires Google Cloud API key
β β βββ pyttsx3_engine.py
β βββ results/ # saved eval outputs
βββ notebooks/
β βββ evaluation.ipynb # Project B evaluation notebook
βββ src/ # TTS client wrappers (used by notebook)
βββ config/
β βββ tts_scripts.py # 6 standardized coaching scripts
βββ results/ # notebook results (CSVs, charts)
βββ voices/piper/ # Piper ONNX voice models (download separately)
Setup
Prerequisites
- Python 3.12
- uv package manager
- NVIDIA GPU recommended (CUDA 12.4+) for Kokoro and Parler-TTS
- CUDA-enabled PyTorch (see below)
Install
git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git
cd YOUR_REPO_NAME
uv sync
Install CUDA PyTorch (required for GPU inference)
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
Download Piper voices
uv run python -c "
from pathlib import Path
import urllib.request
voices_dir = Path('voices/piper')
voices_dir.mkdir(parents=True, exist_ok=True)
base_female = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium'
base_male = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium'
for base, prefix in [(base_female, 'en_US-amy-medium'), (base_male, 'en_US-lessac-medium')]:
for ext in ['.onnx', '.onnx.json']:
fname = prefix + ext
out = voices_dir / fname
if not out.exists():
print(f'Downloading {fname}...')
urllib.request.urlretrieve(f'{base}/{fname}', out)
print('Done.')
"
Environment variables
Create app/.env:
cp app/.env.example app/.env
# then edit app/.env and fill in your API keys
Required for cloud engines:
RUNPOD_API_KEYβ for Chatterbox TurboGOOGLE_APPLICATION_CREDENTIALSβ for Chirp 3 HD (when available)
Running the App
cd app
uv run gradio app.py
Open http://127.0.0.1:7860 in your browser.
Adding a New Engine
- Create
app/engines/your_engine.pysubclassingTTSEngine - Implement
synthesize(text, band, output_path) -> dict - Set
BAND_CONFIGfor grade-band voice/speed tuning - Register in
app/engines/__init__.py
The framework picks it up automatically β no other changes needed.
Evaluation Methodology
Six standardized coaching scripts cover all four grade bands (K-2, 3-5, 6-8, 9-12) and five scenario types (praise, correction, instruction, SEL, out-of-scope). Scripts are defined in config/tts_scripts.py.
Rubric scores (1β3 anchored scale) were assigned manually after listening to all synthesized outputs. Automated metrics (UTMOS, WER, RTF) are computed programmatically for every run and logged to app/results/eval_log.csv.
Key Findings
- Kokoro achieves the highest quality (UTMOS ~4.5) at zero cost, running fully locally
- Piper is the fastest engine (RTF ~0.04x) with acceptable quality (UTMOS ~3.9), suitable as a lightweight fallback
- Parler-TTS supports instruction-controlled voice descriptions but is too slow (RTF ~2.3x) for real-time coaching
- Chatterbox Turbo (via RunPod) achieves strong naturalness (UTMOS ~4.4) but at higher cost than character-based pricing models
- pyttsx3 is unsuitable for child-facing products β zero warmth, robotic output
References
- Saeki et al. (2022). UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022. INTERSPEECH.
- Radford et al. (2023). Robust Speech Recognition via Large-Scale Weak Supervision. ICML. (Whisper)
- Minixhofer et al. (2024). TTSDS β Text-to-Speech Distribution Score. SSW.