Spaces:

aankitdas
/

tts-eval-framework

Sleeping

App Files Files Community

tts-eval-framework / README.md

aankitdas

fix: force Python 3.12 on HF Spaces

c1f5502 about 1 month ago

preview code

raw

history blame contribute delete

5.84 kB

	---
	title: TTS Eval Framework
	emoji: 🎙
	colorFrom: blue
	colorTo: green
	sdk: gradio
	sdk_version: 6.12.0
	app_file: app.py
	pinned: false
	python_version: "3.12"
	---

	# Bantrly TTS Evaluation Framework

	A research-grade evaluation framework for comparing Text-to-Speech (TTS) engines for use in a K-12 speech coaching product. Built to help make data-driven decisions about which TTS engine to deploy in production.

	Author: Aankit Das

	---

	## Overview

	This project evaluates TTS engines across four dimensions that matter for a real-time coaching product:

	\| Metric \| What it measures \|
	\|---\|---\|
	\| UTMOS \| Automated naturalness score (1–5), predicts human MOS (Saeki et al. 2022) \|
	\| WER \| Word Error Rate via Whisper transcription (Radford et al. 2023) \|
	\| RTF \| Real Time Factor — synthesis time / audio duration (<1.0 = faster than real time) \|
	\| Cost \| Actual cost for paid engines, Chirp 3 HD equivalent for free engines \|

	---

	## Engines Evaluated

	\| Engine \| Type \| Local \| Cost \|
	\|---\|---\|---\|---\|
	\| Kokoro (tuned) \| Neural OSS \| ✓ \| $0 \|
	\| Piper (ONNX) \| Neural OSS \| ✓ \| $0 \|
	\| Parler-TTS (mini) \| Neural OSS \| ✓ \| $0 \|
	\| edge-tts (Microsoft) \| Neural cloud \| ✗ \| $0 (free tier) \|
	\| Chatterbox Turbo \| Neural cloud \| ✗ \| ~$0.001/sec (RunPod) \|
	\| Chirp 3 HD \| Neural cloud \| ✗ \| ~$16/1M chars (Google) \|
	\| pyttsx3 \| Rule-based \| ✓ \| $0 \|

	---

	## Project Structure

	```
	├── app/ # Gradio evaluation app
	│ ├── app.py # main UI
	│ ├── evaluator.py # WER, UTMOS, RTF, cost metrics
	│ ├── engines/ # pluggable TTS engine implementations
	│ │ ├── base.py # abstract base class
	│ │ ├── kokoro_engine.py
	│ │ ├── piper_engine.py
	│ │ ├── parler_engine.py
	│ │ ├── edge_tts_engine.py
	│ │ ├── chatterbox_runpod_engine.py
	│ │ ├── chirp_engine.py # stub, requires Google Cloud API key
	│ │ └── pyttsx3_engine.py
	│ └── results/ # saved eval outputs
	├── notebooks/
	│ └── evaluation.ipynb # Project B evaluation notebook
	├── src/ # TTS client wrappers (used by notebook)
	├── config/
	│ └── tts_scripts.py # 6 standardized coaching scripts
	├── results/ # notebook results (CSVs, charts)
	└── voices/piper/ # Piper ONNX voice models (download separately)
	```

	## Setup

	### Prerequisites

	- Python 3.12
	- [uv](https://github.com/astral-sh/uv) package manager
	- NVIDIA GPU recommended (CUDA 12.4+) for Kokoro and Parler-TTS
	- CUDA-enabled PyTorch (see below)

	### Install
	```bash
	git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git
	cd YOUR_REPO_NAME
	uv sync
	```

	### Install CUDA PyTorch (required for GPU inference)
	```bash
	uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
	```

	### Download Piper voices
	```bash
	uv run python -c "
	from pathlib import Path
	import urllib.request

	voices_dir = Path('voices/piper')
	voices_dir.mkdir(parents=True, exist_ok=True)

	base_female = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium'
	base_male = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium'

	for base, prefix in [(base_female, 'en_US-amy-medium'), (base_male, 'en_US-lessac-medium')]:
	for ext in ['.onnx', '.onnx.json']:
	fname = prefix + ext
	out = voices_dir / fname
	if not out.exists():
	print(f'Downloading {fname}...')
	urllib.request.urlretrieve(f'{base}/{fname}', out)
	print('Done.')
	"
	```

	### Environment variables

	Create `app/.env`:
	```bash
	cp app/.env.example app/.env
	# then edit app/.env and fill in your API keys
	```

	Required for cloud engines:
	- `RUNPOD_API_KEY` — for Chatterbox Turbo
	- `GOOGLE_APPLICATION_CREDENTIALS` — for Chirp 3 HD (when available)

	---

	## Running the App
	```bash
	cd app
	uv run gradio app.py
	```

	Open `http://127.0.0.1:7860` in your browser.

	---

	## Adding a New Engine

	1. Create `app/engines/your_engine.py` subclassing `TTSEngine`
	2. Implement `synthesize(text, band, output_path) -> dict`
	3. Set `BAND_CONFIG` for grade-band voice/speed tuning
	4. Register in `app/engines/__init__.py`

	The framework picks it up automatically — no other changes needed.

	---

	## Evaluation Methodology

	Six standardized coaching scripts cover all four grade bands (K-2, 3-5, 6-8, 9-12) and five scenario types (praise, correction, instruction, SEL, out-of-scope). Scripts are defined in `config/tts_scripts.py`.

	Rubric scores (1–3 anchored scale) were assigned manually after listening to all synthesized outputs. Automated metrics (UTMOS, WER, RTF) are computed programmatically for every run and logged to `app/results/eval_log.csv`.

	---

	## Key Findings

	- Kokoro achieves the highest quality (UTMOS ~4.5) at zero cost, running fully locally
	- Piper is the fastest engine (RTF ~0.04x) with acceptable quality (UTMOS ~3.9), suitable as a lightweight fallback
	- Parler-TTS supports instruction-controlled voice descriptions but is too slow (RTF ~2.3x) for real-time coaching
	- Chatterbox Turbo (via RunPod) achieves strong naturalness (UTMOS ~4.4) but at higher cost than character-based pricing models
	- pyttsx3 is unsuitable for child-facing products — zero warmth, robotic output

	---

	## References

	- Saeki et al. (2022). UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022. INTERSPEECH.
	- Radford et al. (2023). Robust Speech Recognition via Large-Scale Weak Supervision. ICML. (Whisper)
	- Minixhofer et al. (2024). TTSDS — Text-to-Speech Distribution Score. SSW.