tts-eval-framework / README.md
aankitdas's picture
fix: force Python 3.12 on HF Spaces
c1f5502

A newer version of the Gradio SDK is available: 6.14.0

Upgrade
metadata
title: TTS Eval Framework
emoji: πŸŽ™
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.12.0
app_file: app.py
pinned: false
python_version: '3.12'

Bantrly TTS Evaluation Framework

A research-grade evaluation framework for comparing Text-to-Speech (TTS) engines for use in a K-12 speech coaching product. Built to help make data-driven decisions about which TTS engine to deploy in production.

Author: Aankit Das


Overview

This project evaluates TTS engines across four dimensions that matter for a real-time coaching product:

Metric What it measures
UTMOS Automated naturalness score (1–5), predicts human MOS (Saeki et al. 2022)
WER Word Error Rate via Whisper transcription (Radford et al. 2023)
RTF Real Time Factor β€” synthesis time / audio duration (<1.0 = faster than real time)
Cost Actual cost for paid engines, Chirp 3 HD equivalent for free engines

Engines Evaluated

Engine Type Local Cost
Kokoro (tuned) Neural OSS βœ“ $0
Piper (ONNX) Neural OSS βœ“ $0
Parler-TTS (mini) Neural OSS βœ“ $0
edge-tts (Microsoft) Neural cloud βœ— $0 (free tier)
Chatterbox Turbo Neural cloud βœ— ~$0.001/sec (RunPod)
Chirp 3 HD Neural cloud βœ— ~$16/1M chars (Google)
pyttsx3 Rule-based βœ“ $0

Project Structure

β”œβ”€β”€ app/                           # Gradio evaluation app
β”‚   β”œβ”€β”€ app.py                     # main UI
β”‚   β”œβ”€β”€ evaluator.py               # WER, UTMOS, RTF, cost metrics
β”‚   β”œβ”€β”€ engines/                   # pluggable TTS engine implementations
β”‚   β”‚   β”œβ”€β”€ base.py                # abstract base class
β”‚   β”‚   β”œβ”€β”€ kokoro_engine.py
β”‚   β”‚   β”œβ”€β”€ piper_engine.py
β”‚   β”‚   β”œβ”€β”€ parler_engine.py
β”‚   β”‚   β”œβ”€β”€ edge_tts_engine.py
β”‚   β”‚   β”œβ”€β”€ chatterbox_runpod_engine.py
β”‚   β”‚   β”œβ”€β”€ chirp_engine.py        # stub, requires Google Cloud API key
β”‚   β”‚   └── pyttsx3_engine.py
β”‚   └── results/                   # saved eval outputs
β”œβ”€β”€ notebooks/
β”‚   └── evaluation.ipynb           # Project B evaluation notebook
β”œβ”€β”€ src/                           # TTS client wrappers (used by notebook)
β”œβ”€β”€ config/
β”‚   └── tts_scripts.py             # 6 standardized coaching scripts
β”œβ”€β”€ results/                       # notebook results (CSVs, charts)
└── voices/piper/                  # Piper ONNX voice models (download separately)

Setup

Prerequisites

  • Python 3.12
  • uv package manager
  • NVIDIA GPU recommended (CUDA 12.4+) for Kokoro and Parler-TTS
  • CUDA-enabled PyTorch (see below)

Install

git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git
cd YOUR_REPO_NAME
uv sync

Install CUDA PyTorch (required for GPU inference)

uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Download Piper voices

uv run python -c "
from pathlib import Path
import urllib.request

voices_dir = Path('voices/piper')
voices_dir.mkdir(parents=True, exist_ok=True)

base_female = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium'
base_male = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium'

for base, prefix in [(base_female, 'en_US-amy-medium'), (base_male, 'en_US-lessac-medium')]:
    for ext in ['.onnx', '.onnx.json']:
        fname = prefix + ext
        out = voices_dir / fname
        if not out.exists():
            print(f'Downloading {fname}...')
            urllib.request.urlretrieve(f'{base}/{fname}', out)
print('Done.')
"

Environment variables

Create app/.env:

cp app/.env.example app/.env
# then edit app/.env and fill in your API keys

Required for cloud engines:

  • RUNPOD_API_KEY β€” for Chatterbox Turbo
  • GOOGLE_APPLICATION_CREDENTIALS β€” for Chirp 3 HD (when available)

Running the App

cd app
uv run gradio app.py

Open http://127.0.0.1:7860 in your browser.


Adding a New Engine

  1. Create app/engines/your_engine.py subclassing TTSEngine
  2. Implement synthesize(text, band, output_path) -> dict
  3. Set BAND_CONFIG for grade-band voice/speed tuning
  4. Register in app/engines/__init__.py

The framework picks it up automatically β€” no other changes needed.


Evaluation Methodology

Six standardized coaching scripts cover all four grade bands (K-2, 3-5, 6-8, 9-12) and five scenario types (praise, correction, instruction, SEL, out-of-scope). Scripts are defined in config/tts_scripts.py.

Rubric scores (1–3 anchored scale) were assigned manually after listening to all synthesized outputs. Automated metrics (UTMOS, WER, RTF) are computed programmatically for every run and logged to app/results/eval_log.csv.


Key Findings

  • Kokoro achieves the highest quality (UTMOS ~4.5) at zero cost, running fully locally
  • Piper is the fastest engine (RTF ~0.04x) with acceptable quality (UTMOS ~3.9), suitable as a lightweight fallback
  • Parler-TTS supports instruction-controlled voice descriptions but is too slow (RTF ~2.3x) for real-time coaching
  • Chatterbox Turbo (via RunPod) achieves strong naturalness (UTMOS ~4.4) but at higher cost than character-based pricing models
  • pyttsx3 is unsuitable for child-facing products β€” zero warmth, robotic output

References

  • Saeki et al. (2022). UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022. INTERSPEECH.
  • Radford et al. (2023). Robust Speech Recognition via Large-Scale Weak Supervision. ICML. (Whisper)
  • Minixhofer et al. (2024). TTSDS β€” Text-to-Speech Distribution Score. SSW.