Spaces:

PlotweaverModel
/

MOS_Evaluation

Running

File size: 11,493 Bytes

---
title: TTS MOS Evaluation
emoji: 🎧
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: "5.49.1"
app_file: app.py
pinned: false
---

# 🎧 Plotweaver AI — TTS MOS Evaluation Platform

A multi-user [Gradio](https://gradio.app) application for collecting **Mean Opinion Score (MOS)**
ratings of synthesised speech across **multiple languages**. Reviewers create accounts,
listen to audio samples in the languages they are competent in, and rate each sample on
**7 criteria (1–5)**. An admin uploads audio per language and reads off aggregated MOS results
per language and per model, with one-click export for your paper.

Built for the Plotweaver AI African-language TTS validation workflow (Yoruba, Hausa, Igbo,
Nigerian English, Akan, Swahili, Nigerian Pidgin, … add more any time).

> **Note on the front-matter above:** Hugging Face Spaces reads the YAML block at the top of this
> file to configure the Space (SDK, app file, etc.). Keep it as the very first thing in the file.
> Set `sdk_version` to whatever current Gradio 5.x the Space build accepts — if it complains, the
> build log lists valid versions.

---

## What it does

- **Accounts & roles.** Reviewers self-register; passwords are hashed (PBKDF2-HMAC-SHA256,
  200k iterations, per-user salt). Two roles: `reviewer` and `admin`.
- **Per-language gating.** A reviewer is assigned a set of languages and only ever sees / rates
  samples in those languages. They can update their language set themselves.
- **Add languages any time.** New languages appear instantly in the signup form, the reviewer
  profile, and the admin upload/results dropdowns — no code change, no redeploy.
- **Blind evaluation.** The model / system name is stored with each sample but is **never shown
  to reviewers** — they only see "Sample N", so ratings aren't biased by branding.
- **7-criterion MOS form** matching `TTS_Evaluation_Criteria.docx`:
  Naturalness · Intelligibility · Pronunciation Accuracy · Prosody & Expressiveness · Fluency ·
  Audio Quality · Overall Quality — plus a free-text comments box.
- **Resumable rating.** Ratings are upserted (one per reviewer per sample); reviewers can revisit
  and change a rating, and a "Next unrated ▶" button walks them through the queue.
- **Results dashboard.** Per-model MOS, per-sample MOS, reviewer/sample counts, standard
  deviation, and the separate **reference-anchor MOS** (see below). Export to `.xlsx`.

---

## Environment variables

| Variable | Default | Purpose |
|----------|---------|---------|
| `MOS_DATA_DIR` | `./data` | **Persistent** location for audio files, exports, and the DB backup. Point at your bucket mount (e.g. `/data`). |
| `MOS_LOCAL_DIR` | system temp `/…/mos_live` | Fast **local** disk where the live SQLite DB runs. It is backed up to `MOS_DATA_DIR` after every change and restored on startup. Usually leave as default. |
| `ADMIN_CODE` | `plotweaver-admin` | Entered on the signup form to create an admin account. **Override this.** |
| `MOS_SECRET` | derived from `ADMIN_CODE` | Secret used to sign browser session tokens (keeps a refreshed page logged in). Set a stable value in production. |
| `PORT` | `7860` | Port the app binds to. |

---

## Quick start (local)

```bash
pip install -r requirements.txt
python app.py
```

Open http://localhost:7860.

**Create the first admin:** go to *Create account*, fill in a username/password, and enter the
admin code in the *Admin code* field. The default code is printed in the console on startup
(`plotweaver-admin` unless you override it). Override it with an environment variable:

```bash
export ADMIN_CODE="something-only-you-know"
export MOS_JOURNAL_MODE=WAL   # optional: a real local disk supports WAL
python app.py
```

Anyone who signs up **without** the code becomes a normal reviewer.

---

## Deploying on a Hugging Face Space

This app writes its database and uploaded audio to disk, so it needs **persistent storage**. Space
filesystems are otherwise ephemeral and get wiped on rebuilds, restarts, and after the free CPU
tier sleeps. The current way to persist data on a Space is a **Storage Bucket** (HF's replacement
for the older fixed `/data` storage tier).

1. **Create the Space.** New → Space → SDK **Gradio**, hardware **CPU basic (free)**, visibility
   **Private**. ZeroGPU is *not* needed — there's no inference here, only file serving and SQLite.
2. **Mount a Storage Bucket.** Space → Settings → **Storage Buckets** → *Mount a bucket*. Create a
   private bucket (e.g. `mos-eval-data`), mode read-write, and set the **mount path** to `/data`.
   Your audio clips are tiny, so this stays within the free private-storage allowance.
3. **Set variables / secrets.** Space → Settings → *Variables and secrets*:
   - Variable `MOS_DATA_DIR` = `/data`
   - Secret `ADMIN_CODE` = `<your secret>`
   - (Do **not** set `MOS_JOURNAL_MODE` — the bucket-safe `DELETE` default is what you want here.)
4. **Upload `app.py`, `requirements.txt`, `README.md`** (this file, with its YAML front-matter) via
   the Files tab or `git push`.
5. **First boot.** Watch the **Logs** tab — the active admin code is printed on startup. Open the
   app, *Create account*, enter the admin code, and you're the admin.

Notes:
- The app already calls `launch(allowed_paths=[MOS_DATA_DIR])`, so Gradio is permitted to serve the
  audio files stored on the bucket. Without this, the audio player can't load clips from `/data`.
- **Database on buckets:** the live SQLite database does **not** run directly on the bucket, because
  object-store/FUSE mounts don't provide reliable file locking and writes can silently fail to
  appear on later reads (causing "my ratings/files disappeared"). Instead the live DB runs on fast
  local disk (`MOS_LOCAL_DIR`) and is atomically backed up to the bucket (`MOS_DATA_DIR/mos.db`)
  after every change, then restored on startup. Audio files live on the bucket (static, write-once,
  no locking issues). This survives Space restarts as long as the bucket stays mounted.
- **Staying logged in:** a signed token is stored in the browser (`gr.BrowserState`) so a page
  refresh keeps you signed in. Set a stable `MOS_SECRET` in production.

### Self-hosted (AlmaLinux / AWS, behind nginx)

Run it under `nohup` and reverse-proxy `127.0.0.1:7860`. A real local disk supports WAL:

```bash
export MOS_DATA_DIR=/srv/mos/data ADMIN_CODE=... MOS_JOURNAL_MODE=WAL
nohup python app.py > mos.log 2>&1 &
```

Point your nginx `location /` block at `http://127.0.0.1:7860;` with the usual
`proxy_set_header Upgrade/Connection` lines for websockets.

---

## How an evaluation round works

1. **Admin → Languages.** Add the languages in scope (e.g. `yo / Yoruba`, `ha / Hausa`).
2. **Admin → Upload audio samples.** Pick a language, set the **Model / system name**
   (e.g. `F5-TTS`, `XTTS-v2`, `MMS-TTS`, or `human`), drag in the wav/mp3/flac/ogg files, upload.
   Tick **reference / human anchor** for ground-truth human recordings (see below).
3. **Reviewers** sign up, choose their languages, and rate. They see a blind "Sample N", an audio
   player, the 7 radios, and a comments box.
4. **Admin → Results.** Pick a language, click **Compute MOS**, read the per-model and per-sample
   tables, then **Export XLSX** (sheets: *Per Model*, *Per Sample*, *Raw Ratings*).

### Final MOS per language

The summary line reports two headline numbers per language:

- **System MOS (Overall criterion)** — the mean of the dedicated *Overall Quality* ratings. This is
  the conventional single MOS number to quote.
- **System MOS (mean of all criteria)** — the mean across all 7 criteria, useful when you want a
  composite that weights pronunciation/prosody equally with overall impression.

Both are broken down per model so you can compare systems within a language directly
(e.g. F5-TTS vs XTTS-v2 for Yoruba).

---

## Data model (SQLite)

```
users(id, username, email, password_hash, salt, role, is_active, created_at)
languages(id, code, name, created_at)
user_languages(user_id, language_id)                  -- which languages a reviewer may rate
samples(id, language_id, sample_name, model_name, file_path, is_reference, transcript, created_at)
ratings(id, user_id, sample_id, naturalness, intelligibility, pronunciation,
        prosody, fluency, audio_quality, overall, comments, updated_at,
        UNIQUE(user_id, sample_id))
```

Everything lives under `MOS_DATA_DIR`: `mos.db`, `audio/<language_id>/<file>`, and `exports/`.

---

## Suggested additions (beyond what you asked for, already built in where noted)

These come from standard MOS / listening-test practice and matter for a defensible result:

1. **Reference / human anchors (built in).** Upload a few real human recordings flagged as
   *reference*. They are mixed blindly into the reviewer's queue but excluded from the system MOS,
   and their mean is reported separately. If your anchors don't score ~4.5–5.0, that reviewer (or
   the whole batch) is mis-calibrated and you can discount them. This is the single most important
   safeguard for credible MOS.
2. **Blind model names (built in).** Already enforced — reviewers never see which system produced
   a clip.
3. **Inter-rater spread (built in).** The per-model/per-sample tables include the standard
   deviation of the Overall score, so you can spot samples reviewers disagree on.

Worth considering for a v2 (not yet built — happy to add):

4. **Minimum ratings per sample** before a sample counts as "final" (e.g. require ≥5 reviewers),
   and surface samples that are under-rated.
5. **Randomised, balanced presentation order** per reviewer (Latin-square style) so position bias
   averages out, instead of always-ascending sample id.
6. **Listening-setup gate** — a one-time confirmation that the reviewer is on headphones in a quiet
   room, stored with their profile.
7. **Native-speaker / experience metadata** on reviewers, so you can filter MOS to native speakers
   only for the paper.
8. **Attention-trap clips** (e.g. "rate this one 1") to catch click-through reviewers.
9. **Krippendorff's α / ICC** for formal inter-rater reliability, reported per language.
10. **Pairwise / MUSHRA / CMOS modes** if you later want preference tests rather than absolute MOS.
11. **Per-reviewer rate limiting / session timing** to flag implausibly fast ratings.
12. **CI / standard error** on each MOS (the raw export already lets you compute this; could be
    shown inline).
13. **DB backup-to-bucket** (recommended if you go to production scale on a Space) — periodic
    snapshot of `mos.db` to a separate bucket path, with restore-on-startup, to remove the
    mid-write corruption risk noted above.

---

## Notes & limitations

- **SQLite journal mode.** Default is `DELETE` (safe on bucket/FUSE mounts). Set
  `MOS_JOURNAL_MODE=WAL` only on a genuine local disk for better concurrency. Comfortable for a few
  dozen concurrent reviewers; for larger crowdsourcing, move to Postgres — the data layer is
  isolated in a handful of functions and easy to swap.
- **Audio serving.** Files are served from disk via Gradio's file mechanism, enabled by
  `launch(allowed_paths=[MOS_DATA_DIR])`. Keep clips short (a few seconds) as is normal for MOS.
- Deactivating a user (`Admin → Users → active = no`) blocks login but keeps their ratings.
- Deleting a sample also deletes its ratings (cascade) and the audio file on disk.

---

*Generated for Afolabi Abeeb, Plotweaver AI.*