--- title: TTS Eval Framework emoji: 🎙 colorFrom: blue colorTo: green sdk: gradio sdk_version: 6.12.0 app_file: app.py pinned: false python_version: "3.12" --- # Bantrly TTS Evaluation Framework A research-grade evaluation framework for comparing Text-to-Speech (TTS) engines for use in a K-12 speech coaching product. Built to help make data-driven decisions about which TTS engine to deploy in production. **Author:** Aankit Das --- ## Overview This project evaluates TTS engines across four dimensions that matter for a real-time coaching product: | Metric | What it measures | |---|---| | **UTMOS** | Automated naturalness score (1–5), predicts human MOS (Saeki et al. 2022) | | **WER** | Word Error Rate via Whisper transcription (Radford et al. 2023) | | **RTF** | Real Time Factor — synthesis time / audio duration (<1.0 = faster than real time) | | **Cost** | Actual cost for paid engines, Chirp 3 HD equivalent for free engines | --- ## Engines Evaluated | Engine | Type | Local | Cost | |---|---|---|---| | Kokoro (tuned) | Neural OSS | ✓ | $0 | | Piper (ONNX) | Neural OSS | ✓ | $0 | | Parler-TTS (mini) | Neural OSS | ✓ | $0 | | edge-tts (Microsoft) | Neural cloud | ✗ | $0 (free tier) | | Chatterbox Turbo | Neural cloud | ✗ | ~$0.001/sec (RunPod) | | Chirp 3 HD | Neural cloud | ✗ | ~$16/1M chars (Google) | | pyttsx3 | Rule-based | ✓ | $0 | --- ## Project Structure ``` ├── app/ # Gradio evaluation app │ ├── app.py # main UI │ ├── evaluator.py # WER, UTMOS, RTF, cost metrics │ ├── engines/ # pluggable TTS engine implementations │ │ ├── base.py # abstract base class │ │ ├── kokoro_engine.py │ │ ├── piper_engine.py │ │ ├── parler_engine.py │ │ ├── edge_tts_engine.py │ │ ├── chatterbox_runpod_engine.py │ │ ├── chirp_engine.py # stub, requires Google Cloud API key │ │ └── pyttsx3_engine.py │ └── results/ # saved eval outputs ├── notebooks/ │ └── evaluation.ipynb # Project B evaluation notebook ├── src/ # TTS client wrappers (used by notebook) ├── config/ │ └── tts_scripts.py # 6 standardized coaching scripts ├── results/ # notebook results (CSVs, charts) └── voices/piper/ # Piper ONNX voice models (download separately) ``` ## Setup ### Prerequisites - Python 3.12 - [uv](https://github.com/astral-sh/uv) package manager - NVIDIA GPU recommended (CUDA 12.4+) for Kokoro and Parler-TTS - CUDA-enabled PyTorch (see below) ### Install ```bash git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git cd YOUR_REPO_NAME uv sync ``` ### Install CUDA PyTorch (required for GPU inference) ```bash uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 ``` ### Download Piper voices ```bash uv run python -c " from pathlib import Path import urllib.request voices_dir = Path('voices/piper') voices_dir.mkdir(parents=True, exist_ok=True) base_female = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium' base_male = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium' for base, prefix in [(base_female, 'en_US-amy-medium'), (base_male, 'en_US-lessac-medium')]: for ext in ['.onnx', '.onnx.json']: fname = prefix + ext out = voices_dir / fname if not out.exists(): print(f'Downloading {fname}...') urllib.request.urlretrieve(f'{base}/{fname}', out) print('Done.') " ``` ### Environment variables Create `app/.env`: ```bash cp app/.env.example app/.env # then edit app/.env and fill in your API keys ``` Required for cloud engines: - `RUNPOD_API_KEY` — for Chatterbox Turbo - `GOOGLE_APPLICATION_CREDENTIALS` — for Chirp 3 HD (when available) --- ## Running the App ```bash cd app uv run gradio app.py ``` Open `http://127.0.0.1:7860` in your browser. --- ## Adding a New Engine 1. Create `app/engines/your_engine.py` subclassing `TTSEngine` 2. Implement `synthesize(text, band, output_path) -> dict` 3. Set `BAND_CONFIG` for grade-band voice/speed tuning 4. Register in `app/engines/__init__.py` The framework picks it up automatically — no other changes needed. --- ## Evaluation Methodology Six standardized coaching scripts cover all four grade bands (K-2, 3-5, 6-8, 9-12) and five scenario types (praise, correction, instruction, SEL, out-of-scope). Scripts are defined in `config/tts_scripts.py`. Rubric scores (1–3 anchored scale) were assigned manually after listening to all synthesized outputs. Automated metrics (UTMOS, WER, RTF) are computed programmatically for every run and logged to `app/results/eval_log.csv`. --- ## Key Findings - **Kokoro** achieves the highest quality (UTMOS ~4.5) at zero cost, running fully locally - **Piper** is the fastest engine (RTF ~0.04x) with acceptable quality (UTMOS ~3.9), suitable as a lightweight fallback - **Parler-TTS** supports instruction-controlled voice descriptions but is too slow (RTF ~2.3x) for real-time coaching - **Chatterbox Turbo** (via RunPod) achieves strong naturalness (UTMOS ~4.4) but at higher cost than character-based pricing models - **pyttsx3** is unsuitable for child-facing products — zero warmth, robotic output --- ## References - Saeki et al. (2022). *UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022*. INTERSPEECH. - Radford et al. (2023). *Robust Speech Recognition via Large-Scale Weak Supervision*. ICML. (Whisper) - Minixhofer et al. (2024). *TTSDS — Text-to-Speech Distribution Score*. SSW.