MOS_Evaluation / README.md
PlotweaverModel's picture
Upload 2 files
cd1ae53 verified
|
Raw
History Blame Contribute Delete
11.5 kB
---
title: TTS MOS Evaluation
emoji: 🎧
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: "5.49.1"
app_file: app.py
pinned: false
---
# 🎧 Plotweaver AI β€” TTS MOS Evaluation Platform
A multi-user [Gradio](https://gradio.app) application for collecting **Mean Opinion Score (MOS)**
ratings of synthesised speech across **multiple languages**. Reviewers create accounts,
listen to audio samples in the languages they are competent in, and rate each sample on
**7 criteria (1–5)**. An admin uploads audio per language and reads off aggregated MOS results
per language and per model, with one-click export for your paper.
Built for the Plotweaver AI African-language TTS validation workflow (Yoruba, Hausa, Igbo,
Nigerian English, Akan, Swahili, Nigerian Pidgin, … add more any time).
> **Note on the front-matter above:** Hugging Face Spaces reads the YAML block at the top of this
> file to configure the Space (SDK, app file, etc.). Keep it as the very first thing in the file.
> Set `sdk_version` to whatever current Gradio 5.x the Space build accepts β€” if it complains, the
> build log lists valid versions.
---
## What it does
- **Accounts & roles.** Reviewers self-register; passwords are hashed (PBKDF2-HMAC-SHA256,
200k iterations, per-user salt). Two roles: `reviewer` and `admin`.
- **Per-language gating.** A reviewer is assigned a set of languages and only ever sees / rates
samples in those languages. They can update their language set themselves.
- **Add languages any time.** New languages appear instantly in the signup form, the reviewer
profile, and the admin upload/results dropdowns β€” no code change, no redeploy.
- **Blind evaluation.** The model / system name is stored with each sample but is **never shown
to reviewers** β€” they only see "Sample N", so ratings aren't biased by branding.
- **7-criterion MOS form** matching `TTS_Evaluation_Criteria.docx`:
Naturalness Β· Intelligibility Β· Pronunciation Accuracy Β· Prosody & Expressiveness Β· Fluency Β·
Audio Quality Β· Overall Quality β€” plus a free-text comments box.
- **Resumable rating.** Ratings are upserted (one per reviewer per sample); reviewers can revisit
and change a rating, and a "Next unrated β–Ά" button walks them through the queue.
- **Results dashboard.** Per-model MOS, per-sample MOS, reviewer/sample counts, standard
deviation, and the separate **reference-anchor MOS** (see below). Export to `.xlsx`.
---
## Environment variables
| Variable | Default | Purpose |
|----------|---------|---------|
| `MOS_DATA_DIR` | `./data` | **Persistent** location for audio files, exports, and the DB backup. Point at your bucket mount (e.g. `/data`). |
| `MOS_LOCAL_DIR` | system temp `/…/mos_live` | Fast **local** disk where the live SQLite DB runs. It is backed up to `MOS_DATA_DIR` after every change and restored on startup. Usually leave as default. |
| `ADMIN_CODE` | `plotweaver-admin` | Entered on the signup form to create an admin account. **Override this.** |
| `MOS_SECRET` | derived from `ADMIN_CODE` | Secret used to sign browser session tokens (keeps a refreshed page logged in). Set a stable value in production. |
| `PORT` | `7860` | Port the app binds to. |
---
## Quick start (local)
```bash
pip install -r requirements.txt
python app.py
```
Open http://localhost:7860.
**Create the first admin:** go to *Create account*, fill in a username/password, and enter the
admin code in the *Admin code* field. The default code is printed in the console on startup
(`plotweaver-admin` unless you override it). Override it with an environment variable:
```bash
export ADMIN_CODE="something-only-you-know"
export MOS_JOURNAL_MODE=WAL # optional: a real local disk supports WAL
python app.py
```
Anyone who signs up **without** the code becomes a normal reviewer.
---
## Deploying on a Hugging Face Space
This app writes its database and uploaded audio to disk, so it needs **persistent storage**. Space
filesystems are otherwise ephemeral and get wiped on rebuilds, restarts, and after the free CPU
tier sleeps. The current way to persist data on a Space is a **Storage Bucket** (HF's replacement
for the older fixed `/data` storage tier).
1. **Create the Space.** New β†’ Space β†’ SDK **Gradio**, hardware **CPU basic (free)**, visibility
**Private**. ZeroGPU is *not* needed β€” there's no inference here, only file serving and SQLite.
2. **Mount a Storage Bucket.** Space β†’ Settings β†’ **Storage Buckets** β†’ *Mount a bucket*. Create a
private bucket (e.g. `mos-eval-data`), mode read-write, and set the **mount path** to `/data`.
Your audio clips are tiny, so this stays within the free private-storage allowance.
3. **Set variables / secrets.** Space β†’ Settings β†’ *Variables and secrets*:
- Variable `MOS_DATA_DIR` = `/data`
- Secret `ADMIN_CODE` = `<your secret>`
- (Do **not** set `MOS_JOURNAL_MODE` β€” the bucket-safe `DELETE` default is what you want here.)
4. **Upload `app.py`, `requirements.txt`, `README.md`** (this file, with its YAML front-matter) via
the Files tab or `git push`.
5. **First boot.** Watch the **Logs** tab β€” the active admin code is printed on startup. Open the
app, *Create account*, enter the admin code, and you're the admin.
Notes:
- The app already calls `launch(allowed_paths=[MOS_DATA_DIR])`, so Gradio is permitted to serve the
audio files stored on the bucket. Without this, the audio player can't load clips from `/data`.
- **Database on buckets:** the live SQLite database does **not** run directly on the bucket, because
object-store/FUSE mounts don't provide reliable file locking and writes can silently fail to
appear on later reads (causing "my ratings/files disappeared"). Instead the live DB runs on fast
local disk (`MOS_LOCAL_DIR`) and is atomically backed up to the bucket (`MOS_DATA_DIR/mos.db`)
after every change, then restored on startup. Audio files live on the bucket (static, write-once,
no locking issues). This survives Space restarts as long as the bucket stays mounted.
- **Staying logged in:** a signed token is stored in the browser (`gr.BrowserState`) so a page
refresh keeps you signed in. Set a stable `MOS_SECRET` in production.
### Self-hosted (AlmaLinux / AWS, behind nginx)
Run it under `nohup` and reverse-proxy `127.0.0.1:7860`. A real local disk supports WAL:
```bash
export MOS_DATA_DIR=/srv/mos/data ADMIN_CODE=... MOS_JOURNAL_MODE=WAL
nohup python app.py > mos.log 2>&1 &
```
Point your nginx `location /` block at `http://127.0.0.1:7860;` with the usual
`proxy_set_header Upgrade/Connection` lines for websockets.
---
## How an evaluation round works
1. **Admin β†’ Languages.** Add the languages in scope (e.g. `yo / Yoruba`, `ha / Hausa`).
2. **Admin β†’ Upload audio samples.** Pick a language, set the **Model / system name**
(e.g. `F5-TTS`, `XTTS-v2`, `MMS-TTS`, or `human`), drag in the wav/mp3/flac/ogg files, upload.
Tick **reference / human anchor** for ground-truth human recordings (see below).
3. **Reviewers** sign up, choose their languages, and rate. They see a blind "Sample N", an audio
player, the 7 radios, and a comments box.
4. **Admin β†’ Results.** Pick a language, click **Compute MOS**, read the per-model and per-sample
tables, then **Export XLSX** (sheets: *Per Model*, *Per Sample*, *Raw Ratings*).
### Final MOS per language
The summary line reports two headline numbers per language:
- **System MOS (Overall criterion)** β€” the mean of the dedicated *Overall Quality* ratings. This is
the conventional single MOS number to quote.
- **System MOS (mean of all criteria)** β€” the mean across all 7 criteria, useful when you want a
composite that weights pronunciation/prosody equally with overall impression.
Both are broken down per model so you can compare systems within a language directly
(e.g. F5-TTS vs XTTS-v2 for Yoruba).
---
## Data model (SQLite)
```
users(id, username, email, password_hash, salt, role, is_active, created_at)
languages(id, code, name, created_at)
user_languages(user_id, language_id) -- which languages a reviewer may rate
samples(id, language_id, sample_name, model_name, file_path, is_reference, transcript, created_at)
ratings(id, user_id, sample_id, naturalness, intelligibility, pronunciation,
prosody, fluency, audio_quality, overall, comments, updated_at,
UNIQUE(user_id, sample_id))
```
Everything lives under `MOS_DATA_DIR`: `mos.db`, `audio/<language_id>/<file>`, and `exports/`.
---
## Suggested additions (beyond what you asked for, already built in where noted)
These come from standard MOS / listening-test practice and matter for a defensible result:
1. **Reference / human anchors (built in).** Upload a few real human recordings flagged as
*reference*. They are mixed blindly into the reviewer's queue but excluded from the system MOS,
and their mean is reported separately. If your anchors don't score ~4.5–5.0, that reviewer (or
the whole batch) is mis-calibrated and you can discount them. This is the single most important
safeguard for credible MOS.
2. **Blind model names (built in).** Already enforced β€” reviewers never see which system produced
a clip.
3. **Inter-rater spread (built in).** The per-model/per-sample tables include the standard
deviation of the Overall score, so you can spot samples reviewers disagree on.
Worth considering for a v2 (not yet built β€” happy to add):
4. **Minimum ratings per sample** before a sample counts as "final" (e.g. require β‰₯5 reviewers),
and surface samples that are under-rated.
5. **Randomised, balanced presentation order** per reviewer (Latin-square style) so position bias
averages out, instead of always-ascending sample id.
6. **Listening-setup gate** β€” a one-time confirmation that the reviewer is on headphones in a quiet
room, stored with their profile.
7. **Native-speaker / experience metadata** on reviewers, so you can filter MOS to native speakers
only for the paper.
8. **Attention-trap clips** (e.g. "rate this one 1") to catch click-through reviewers.
9. **Krippendorff's Ξ± / ICC** for formal inter-rater reliability, reported per language.
10. **Pairwise / MUSHRA / CMOS modes** if you later want preference tests rather than absolute MOS.
11. **Per-reviewer rate limiting / session timing** to flag implausibly fast ratings.
12. **CI / standard error** on each MOS (the raw export already lets you compute this; could be
shown inline).
13. **DB backup-to-bucket** (recommended if you go to production scale on a Space) β€” periodic
snapshot of `mos.db` to a separate bucket path, with restore-on-startup, to remove the
mid-write corruption risk noted above.
---
## Notes & limitations
- **SQLite journal mode.** Default is `DELETE` (safe on bucket/FUSE mounts). Set
`MOS_JOURNAL_MODE=WAL` only on a genuine local disk for better concurrency. Comfortable for a few
dozen concurrent reviewers; for larger crowdsourcing, move to Postgres β€” the data layer is
isolated in a handful of functions and easy to swap.
- **Audio serving.** Files are served from disk via Gradio's file mechanism, enabled by
`launch(allowed_paths=[MOS_DATA_DIR])`. Keep clips short (a few seconds) as is normal for MOS.
- Deactivating a user (`Admin β†’ Users β†’ active = no`) blocks login but keeps their ratings.
- Deleting a sample also deletes its ratings (cascade) and the audio file on disk.
---
*Generated for Afolabi Abeeb, Plotweaver AI.*