Spaces:

aankitdas
/

tts-eval-framework

Sleeping

App Files Files Community

tts-eval-framework / README.md

aankitdas

fix: force Python 3.12 on HF Spaces

c1f5502 about 1 month ago

preview code

raw

history blame contribute delete

5.84 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

metadata

title: TTS Eval Framework
emoji: 🎙
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.12.0
app_file: app.py
pinned: false
python_version: '3.12'

Bantrly TTS Evaluation Framework

A research-grade evaluation framework for comparing Text-to-Speech (TTS) engines for use in a K-12 speech coaching product. Built to help make data-driven decisions about which TTS engine to deploy in production.

Author: Aankit Das

Overview

This project evaluates TTS engines across four dimensions that matter for a real-time coaching product:

Metric	What it measures
UTMOS	Automated naturalness score (1–5), predicts human MOS (Saeki et al. 2022)
WER	Word Error Rate via Whisper transcription (Radford et al. 2023)
RTF	Real Time Factor — synthesis time / audio duration (<1.0 = faster than real time)
Cost	Actual cost for paid engines, Chirp 3 HD equivalent for free engines

Engines Evaluated

Engine	Type	Local	Cost
Kokoro (tuned)	Neural OSS	✓	$0
Piper (ONNX)	Neural OSS	✓	$0
Parler-TTS (mini)	Neural OSS	✓	$0
edge-tts (Microsoft)	Neural cloud	✗	$0 (free tier)
Chatterbox Turbo	Neural cloud	✗	~$0.001/sec (RunPod)
Chirp 3 HD	Neural cloud	✗	~$16/1M chars (Google)
pyttsx3	Rule-based	✓	$0

Project Structure

├── app/                           # Gradio evaluation app
│   ├── app.py                     # main UI
│   ├── evaluator.py               # WER, UTMOS, RTF, cost metrics
│   ├── engines/                   # pluggable TTS engine implementations
│   │   ├── base.py                # abstract base class
│   │   ├── kokoro_engine.py
│   │   ├── piper_engine.py
│   │   ├── parler_engine.py
│   │   ├── edge_tts_engine.py
│   │   ├── chatterbox_runpod_engine.py
│   │   ├── chirp_engine.py        # stub, requires Google Cloud API key
│   │   └── pyttsx3_engine.py
│   └── results/                   # saved eval outputs
├── notebooks/
│   └── evaluation.ipynb           # Project B evaluation notebook
├── src/                           # TTS client wrappers (used by notebook)
├── config/
│   └── tts_scripts.py             # 6 standardized coaching scripts
├── results/                       # notebook results (CSVs, charts)
└── voices/piper/                  # Piper ONNX voice models (download separately)

Setup

Prerequisites

Python 3.12
uv package manager
NVIDIA GPU recommended (CUDA 12.4+) for Kokoro and Parler-TTS
CUDA-enabled PyTorch (see below)

Install

git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git
cd YOUR_REPO_NAME
uv sync

Install CUDA PyTorch (required for GPU inference)

uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Download Piper voices

uv run python -c "
from pathlib import Path
import urllib.request

voices_dir = Path('voices/piper')
voices_dir.mkdir(parents=True, exist_ok=True)

base_female = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium'
base_male = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium'

for base, prefix in [(base_female, 'en_US-amy-medium'), (base_male, 'en_US-lessac-medium')]:
    for ext in ['.onnx', '.onnx.json']:
        fname = prefix + ext
        out = voices_dir / fname
        if not out.exists():
            print(f'Downloading {fname}...')
            urllib.request.urlretrieve(f'{base}/{fname}', out)
print('Done.')
"

Environment variables

Create app/.env:

cp app/.env.example app/.env
# then edit app/.env and fill in your API keys

Required for cloud engines:

RUNPOD_API_KEY — for Chatterbox Turbo
GOOGLE_APPLICATION_CREDENTIALS — for Chirp 3 HD (when available)

Running the App

cd app
uv run gradio app.py

Open http://127.0.0.1:7860 in your browser.

Adding a New Engine

Create app/engines/your_engine.py subclassing TTSEngine
Implement synthesize(text, band, output_path) -> dict
Set BAND_CONFIG for grade-band voice/speed tuning
Register in app/engines/__init__.py

The framework picks it up automatically — no other changes needed.

Evaluation Methodology

Six standardized coaching scripts cover all four grade bands (K-2, 3-5, 6-8, 9-12) and five scenario types (praise, correction, instruction, SEL, out-of-scope). Scripts are defined in config/tts_scripts.py.

Rubric scores (1–3 anchored scale) were assigned manually after listening to all synthesized outputs. Automated metrics (UTMOS, WER, RTF) are computed programmatically for every run and logged to app/results/eval_log.csv.

Key Findings

Kokoro achieves the highest quality (UTMOS ~4.5) at zero cost, running fully locally
Piper is the fastest engine (RTF ~0.04x) with acceptable quality (UTMOS ~3.9), suitable as a lightweight fallback
Parler-TTS supports instruction-controlled voice descriptions but is too slow (RTF ~2.3x) for real-time coaching
Chatterbox Turbo (via RunPod) achieves strong naturalness (UTMOS ~4.4) but at higher cost than character-based pricing models
pyttsx3 is unsuitable for child-facing products — zero warmth, robotic output

References

Saeki et al. (2022). UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022. INTERSPEECH.
Radford et al. (2023). Robust Speech Recognition via Large-Scale Weak Supervision. ICML. (Whisper)
Minixhofer et al. (2024). TTSDS — Text-to-Speech Distribution Score. SSW.