Spaces:

aankitdas
/

tts-eval-framework

Sleeping

File size: 5,842 Bytes

bc8ea92
c1f5502
 
bc8ea92
c1f5502
bc8ea92
c1f5502
 
bc8ea92
c1f5502
bc8ea92
 
a3419b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bc8ea92
 
 
 
 
 
a3419b6
 
 
 
 
bc8ea92
a3419b6
bc8ea92
a3419b6
bc8ea92
 
a3419b6
bc8ea92
 
 
 
a3419b6

---
title: TTS Eval Framework
emoji: 🎙
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.12.0
app_file: app.py
pinned: false
python_version: "3.12"
---

# Bantrly TTS Evaluation Framework

A research-grade evaluation framework for comparing Text-to-Speech (TTS) engines for use in a K-12 speech coaching product. Built to help make data-driven decisions about which TTS engine to deploy in production.

**Author:** Aankit Das

---

## Overview

This project evaluates TTS engines across four dimensions that matter for a real-time coaching product:

| Metric | What it measures |
|---|---|
| **UTMOS** | Automated naturalness score (1–5), predicts human MOS (Saeki et al. 2022) |
| **WER** | Word Error Rate via Whisper transcription (Radford et al. 2023) |
| **RTF** | Real Time Factor — synthesis time / audio duration (<1.0 = faster than real time) |
| **Cost** | Actual cost for paid engines, Chirp 3 HD equivalent for free engines |

---

## Engines Evaluated

| Engine | Type | Local | Cost |
|---|---|---|---|
| Kokoro (tuned) | Neural OSS | ✓ | $0 |
| Piper (ONNX) | Neural OSS | ✓ | $0 |
| Parler-TTS (mini) | Neural OSS | ✓ | $0 |
| edge-tts (Microsoft) | Neural cloud | ✗ | $0 (free tier) |
| Chatterbox Turbo | Neural cloud | ✗ | ~$0.001/sec (RunPod) |
| Chirp 3 HD | Neural cloud | ✗ | ~$16/1M chars (Google) |
| pyttsx3 | Rule-based | ✓ | $0 |

---

## Project Structure

```
├── app/                           # Gradio evaluation app
│   ├── app.py                     # main UI
│   ├── evaluator.py               # WER, UTMOS, RTF, cost metrics
│   ├── engines/                   # pluggable TTS engine implementations
│   │   ├── base.py                # abstract base class
│   │   ├── kokoro_engine.py
│   │   ├── piper_engine.py
│   │   ├── parler_engine.py
│   │   ├── edge_tts_engine.py
│   │   ├── chatterbox_runpod_engine.py
│   │   ├── chirp_engine.py        # stub, requires Google Cloud API key
│   │   └── pyttsx3_engine.py
│   └── results/                   # saved eval outputs
├── notebooks/
│   └── evaluation.ipynb           # Project B evaluation notebook
├── src/                           # TTS client wrappers (used by notebook)
├── config/
│   └── tts_scripts.py             # 6 standardized coaching scripts
├── results/                       # notebook results (CSVs, charts)
└── voices/piper/                  # Piper ONNX voice models (download separately)
```

## Setup

### Prerequisites

- Python 3.12
- [uv](https://github.com/astral-sh/uv) package manager
- NVIDIA GPU recommended (CUDA 12.4+) for Kokoro and Parler-TTS
- CUDA-enabled PyTorch (see below)

### Install
```bash
git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git
cd YOUR_REPO_NAME
uv sync
```

### Install CUDA PyTorch (required for GPU inference)
```bash
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
```

### Download Piper voices
```bash
uv run python -c "
from pathlib import Path
import urllib.request

voices_dir = Path('voices/piper')
voices_dir.mkdir(parents=True, exist_ok=True)

base_female = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium'
base_male = 'https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium'

for base, prefix in [(base_female, 'en_US-amy-medium'), (base_male, 'en_US-lessac-medium')]:
    for ext in ['.onnx', '.onnx.json']:
        fname = prefix + ext
        out = voices_dir / fname
        if not out.exists():
            print(f'Downloading {fname}...')
            urllib.request.urlretrieve(f'{base}/{fname}', out)
print('Done.')
"
```

### Environment variables

Create `app/.env`:
```bash
cp app/.env.example app/.env
# then edit app/.env and fill in your API keys
```

Required for cloud engines:
- `RUNPOD_API_KEY` — for Chatterbox Turbo
- `GOOGLE_APPLICATION_CREDENTIALS` — for Chirp 3 HD (when available)

---

## Running the App
```bash
cd app
uv run gradio app.py
```

Open `http://127.0.0.1:7860` in your browser.

---

## Adding a New Engine

1. Create `app/engines/your_engine.py` subclassing `TTSEngine`
2. Implement `synthesize(text, band, output_path) -> dict`
3. Set `BAND_CONFIG` for grade-band voice/speed tuning
4. Register in `app/engines/__init__.py`

The framework picks it up automatically — no other changes needed.

---

## Evaluation Methodology

Six standardized coaching scripts cover all four grade bands (K-2, 3-5, 6-8, 9-12) and five scenario types (praise, correction, instruction, SEL, out-of-scope). Scripts are defined in `config/tts_scripts.py`.

Rubric scores (1–3 anchored scale) were assigned manually after listening to all synthesized outputs. Automated metrics (UTMOS, WER, RTF) are computed programmatically for every run and logged to `app/results/eval_log.csv`.

---

## Key Findings

- **Kokoro** achieves the highest quality (UTMOS ~4.5) at zero cost, running fully locally
- **Piper** is the fastest engine (RTF ~0.04x) with acceptable quality (UTMOS ~3.9), suitable as a lightweight fallback
- **Parler-TTS** supports instruction-controlled voice descriptions but is too slow (RTF ~2.3x) for real-time coaching
- **Chatterbox Turbo** (via RunPod) achieves strong naturalness (UTMOS ~4.4) but at higher cost than character-based pricing models
- **pyttsx3** is unsuitable for child-facing products — zero warmth, robotic output

---

## References

- Saeki et al. (2022). *UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022*. INTERSPEECH.
- Radford et al. (2023). *Robust Speech Recognition via Large-Scale Weak Supervision*. ICML. (Whisper)
- Minixhofer et al. (2024). *TTSDS — Text-to-Speech Distribution Score*. SSW.