Spaces:
Sleeping
Sleeping
| title: TTS Eval Framework | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: gradio | |
| sdk_version: 6.12.0 | |
| app_file: app.py | |
| pinned: false | |
| python_version: "3.12" | |
| # Bantrly TTS Evaluation Framework | |
| A research-grade evaluation framework for comparing Text-to-Speech (TTS) engines for use in a K-12 speech coaching product. Built to help make data-driven decisions about which TTS engine to deploy in production. | |
| **Author:** Aankit Das | |
| --- | |
| ## Overview | |
| This project evaluates TTS engines across four dimensions that matter for a real-time coaching product: | |
| | Metric | What it measures | | |
| |---|---| | |
| | **UTMOS** | Automated naturalness score (1β5), predicts human MOS (Saeki et al. 2022) | | |
| | **WER** | Word Error Rate via Whisper transcription (Radford et al. 2023) | | |
| | **RTF** | Real Time Factor β synthesis time / audio duration (<1.0 = faster than real time) | | |
| | **Cost** | Actual cost for paid engines, Chirp 3 HD equivalent for free engines | | |
| --- | |
| ## Engines Evaluated | |
| | Engine | Type | Local | Cost | | |
| |---|---|---|---| | |
| | Kokoro (tuned) | Neural OSS | β | $0 | | |
| | Piper (ONNX) | Neural OSS | β | $0 | | |
| | Parler-TTS (mini) | Neural OSS | β | $0 | | |
| | edge-tts (Microsoft) | Neural cloud | β | $0 (free tier) | | |
| | Chatterbox Turbo | Neural cloud | β | ~$0.001/sec (RunPod) | | |
| | Chirp 3 HD | Neural cloud | β | ~$16/1M chars (Google) | | |
| | pyttsx3 | Rule-based | β | $0 | | |
| --- | |
| ## Project Structure | |
| ``` | |
| βββ app/ # Gradio evaluation app | |
| β βββ app.py # main UI | |
| β βββ evaluator.py # WER, UTMOS, RTF, cost metrics | |
| β βββ engines/ # pluggable TTS engine implementations | |
| β β βββ base.py # abstract base class | |
| β β βββ kokoro_engine.py | |
| β β βββ piper_engine.py | |
| β β βββ parler_engine.py | |
| β β βββ edge_tts_engine.py | |
| β β βββ chatterbox_runpod_engine.py | |
| β β βββ chirp_engine.py # stub, requires Google Cloud API key | |
| β β βββ pyttsx3_engine.py | |
| β βββ results/ # saved eval outputs | |
| βββ notebooks/ | |
| β βββ evaluation.ipynb # Project B evaluation notebook | |
| βββ src/ # TTS client wrappers (used by notebook) | |
| βββ config/ | |
| β βββ tts_scripts.py # 6 standardized coaching scripts | |
| βββ results/ # notebook results (CSVs, charts) | |
| βββ voices/piper/ # Piper ONNX voice models (download separately) | |
| ``` | |
| ## Setup | |
| ### Prerequisites | |
| - Python 3.12 | |
| - [uv](https://github.com/astral-sh/uv) package manager | |
| - NVIDIA GPU recommended (CUDA 12.4+) for Kokoro and Parler-TTS | |
| - CUDA-enabled PyTorch (see below) | |
| ### Install | |
| ```bash | |
| git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git | |
| cd YOUR_REPO_NAME | |
| uv sync | |
| ``` | |
| ### Install CUDA PyTorch (required for GPU inference) | |
| ```bash | |
| uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 | |
| ``` | |
| ### Download Piper voices | |
| ```bash | |
| uv run python -c " | |
| from pathlib import Path | |
| import urllib.request | |
| voices_dir = Path('voices/piper') | |
| voices_dir.mkdir(parents=True, exist_ok=True) | |
| base_female = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium' | |
| base_male = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium' | |
| for base, prefix in [(base_female, 'en_US-amy-medium'), (base_male, 'en_US-lessac-medium')]: | |
| for ext in ['.onnx', '.onnx.json']: | |
| fname = prefix + ext | |
| out = voices_dir / fname | |
| if not out.exists(): | |
| print(f'Downloading {fname}...') | |
| urllib.request.urlretrieve(f'{base}/{fname}', out) | |
| print('Done.') | |
| " | |
| ``` | |
| ### Environment variables | |
| Create `app/.env`: | |
| ```bash | |
| cp app/.env.example app/.env | |
| # then edit app/.env and fill in your API keys | |
| ``` | |
| Required for cloud engines: | |
| - `RUNPOD_API_KEY` β for Chatterbox Turbo | |
| - `GOOGLE_APPLICATION_CREDENTIALS` β for Chirp 3 HD (when available) | |
| --- | |
| ## Running the App | |
| ```bash | |
| cd app | |
| uv run gradio app.py | |
| ``` | |
| Open `http://127.0.0.1:7860` in your browser. | |
| --- | |
| ## Adding a New Engine | |
| 1. Create `app/engines/your_engine.py` subclassing `TTSEngine` | |
| 2. Implement `synthesize(text, band, output_path) -> dict` | |
| 3. Set `BAND_CONFIG` for grade-band voice/speed tuning | |
| 4. Register in `app/engines/__init__.py` | |
| The framework picks it up automatically β no other changes needed. | |
| --- | |
| ## Evaluation Methodology | |
| Six standardized coaching scripts cover all four grade bands (K-2, 3-5, 6-8, 9-12) and five scenario types (praise, correction, instruction, SEL, out-of-scope). Scripts are defined in `config/tts_scripts.py`. | |
| Rubric scores (1β3 anchored scale) were assigned manually after listening to all synthesized outputs. Automated metrics (UTMOS, WER, RTF) are computed programmatically for every run and logged to `app/results/eval_log.csv`. | |
| --- | |
| ## Key Findings | |
| - **Kokoro** achieves the highest quality (UTMOS ~4.5) at zero cost, running fully locally | |
| - **Piper** is the fastest engine (RTF ~0.04x) with acceptable quality (UTMOS ~3.9), suitable as a lightweight fallback | |
| - **Parler-TTS** supports instruction-controlled voice descriptions but is too slow (RTF ~2.3x) for real-time coaching | |
| - **Chatterbox Turbo** (via RunPod) achieves strong naturalness (UTMOS ~4.4) but at higher cost than character-based pricing models | |
| - **pyttsx3** is unsuitable for child-facing products β zero warmth, robotic output | |
| --- | |
| ## References | |
| - Saeki et al. (2022). *UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022*. INTERSPEECH. | |
| - Radford et al. (2023). *Robust Speech Recognition via Large-Scale Weak Supervision*. ICML. (Whisper) | |
| - Minixhofer et al. (2024). *TTSDS β Text-to-Speech Distribution Score*. SSW. |