Spaces:
Running
Running
File size: 11,493 Bytes
4c27592 f601a65 4c27592 f601a65 4c27592 f601a65 cd1ae53 f601a65 cd1ae53 f601a65 cd1ae53 f601a65 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 | ---
title: TTS MOS Evaluation
emoji: π§
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: "5.49.1"
app_file: app.py
pinned: false
---
# π§ Plotweaver AI β TTS MOS Evaluation Platform
A multi-user [Gradio](https://gradio.app) application for collecting **Mean Opinion Score (MOS)**
ratings of synthesised speech across **multiple languages**. Reviewers create accounts,
listen to audio samples in the languages they are competent in, and rate each sample on
**7 criteria (1β5)**. An admin uploads audio per language and reads off aggregated MOS results
per language and per model, with one-click export for your paper.
Built for the Plotweaver AI African-language TTS validation workflow (Yoruba, Hausa, Igbo,
Nigerian English, Akan, Swahili, Nigerian Pidgin, β¦ add more any time).
> **Note on the front-matter above:** Hugging Face Spaces reads the YAML block at the top of this
> file to configure the Space (SDK, app file, etc.). Keep it as the very first thing in the file.
> Set `sdk_version` to whatever current Gradio 5.x the Space build accepts β if it complains, the
> build log lists valid versions.
---
## What it does
- **Accounts & roles.** Reviewers self-register; passwords are hashed (PBKDF2-HMAC-SHA256,
200k iterations, per-user salt). Two roles: `reviewer` and `admin`.
- **Per-language gating.** A reviewer is assigned a set of languages and only ever sees / rates
samples in those languages. They can update their language set themselves.
- **Add languages any time.** New languages appear instantly in the signup form, the reviewer
profile, and the admin upload/results dropdowns β no code change, no redeploy.
- **Blind evaluation.** The model / system name is stored with each sample but is **never shown
to reviewers** β they only see "Sample N", so ratings aren't biased by branding.
- **7-criterion MOS form** matching `TTS_Evaluation_Criteria.docx`:
Naturalness Β· Intelligibility Β· Pronunciation Accuracy Β· Prosody & Expressiveness Β· Fluency Β·
Audio Quality Β· Overall Quality β plus a free-text comments box.
- **Resumable rating.** Ratings are upserted (one per reviewer per sample); reviewers can revisit
and change a rating, and a "Next unrated βΆ" button walks them through the queue.
- **Results dashboard.** Per-model MOS, per-sample MOS, reviewer/sample counts, standard
deviation, and the separate **reference-anchor MOS** (see below). Export to `.xlsx`.
---
## Environment variables
| Variable | Default | Purpose |
|----------|---------|---------|
| `MOS_DATA_DIR` | `./data` | **Persistent** location for audio files, exports, and the DB backup. Point at your bucket mount (e.g. `/data`). |
| `MOS_LOCAL_DIR` | system temp `/β¦/mos_live` | Fast **local** disk where the live SQLite DB runs. It is backed up to `MOS_DATA_DIR` after every change and restored on startup. Usually leave as default. |
| `ADMIN_CODE` | `plotweaver-admin` | Entered on the signup form to create an admin account. **Override this.** |
| `MOS_SECRET` | derived from `ADMIN_CODE` | Secret used to sign browser session tokens (keeps a refreshed page logged in). Set a stable value in production. |
| `PORT` | `7860` | Port the app binds to. |
---
## Quick start (local)
```bash
pip install -r requirements.txt
python app.py
```
Open http://localhost:7860.
**Create the first admin:** go to *Create account*, fill in a username/password, and enter the
admin code in the *Admin code* field. The default code is printed in the console on startup
(`plotweaver-admin` unless you override it). Override it with an environment variable:
```bash
export ADMIN_CODE="something-only-you-know"
export MOS_JOURNAL_MODE=WAL # optional: a real local disk supports WAL
python app.py
```
Anyone who signs up **without** the code becomes a normal reviewer.
---
## Deploying on a Hugging Face Space
This app writes its database and uploaded audio to disk, so it needs **persistent storage**. Space
filesystems are otherwise ephemeral and get wiped on rebuilds, restarts, and after the free CPU
tier sleeps. The current way to persist data on a Space is a **Storage Bucket** (HF's replacement
for the older fixed `/data` storage tier).
1. **Create the Space.** New β Space β SDK **Gradio**, hardware **CPU basic (free)**, visibility
**Private**. ZeroGPU is *not* needed β there's no inference here, only file serving and SQLite.
2. **Mount a Storage Bucket.** Space β Settings β **Storage Buckets** β *Mount a bucket*. Create a
private bucket (e.g. `mos-eval-data`), mode read-write, and set the **mount path** to `/data`.
Your audio clips are tiny, so this stays within the free private-storage allowance.
3. **Set variables / secrets.** Space β Settings β *Variables and secrets*:
- Variable `MOS_DATA_DIR` = `/data`
- Secret `ADMIN_CODE` = `<your secret>`
- (Do **not** set `MOS_JOURNAL_MODE` β the bucket-safe `DELETE` default is what you want here.)
4. **Upload `app.py`, `requirements.txt`, `README.md`** (this file, with its YAML front-matter) via
the Files tab or `git push`.
5. **First boot.** Watch the **Logs** tab β the active admin code is printed on startup. Open the
app, *Create account*, enter the admin code, and you're the admin.
Notes:
- The app already calls `launch(allowed_paths=[MOS_DATA_DIR])`, so Gradio is permitted to serve the
audio files stored on the bucket. Without this, the audio player can't load clips from `/data`.
- **Database on buckets:** the live SQLite database does **not** run directly on the bucket, because
object-store/FUSE mounts don't provide reliable file locking and writes can silently fail to
appear on later reads (causing "my ratings/files disappeared"). Instead the live DB runs on fast
local disk (`MOS_LOCAL_DIR`) and is atomically backed up to the bucket (`MOS_DATA_DIR/mos.db`)
after every change, then restored on startup. Audio files live on the bucket (static, write-once,
no locking issues). This survives Space restarts as long as the bucket stays mounted.
- **Staying logged in:** a signed token is stored in the browser (`gr.BrowserState`) so a page
refresh keeps you signed in. Set a stable `MOS_SECRET` in production.
### Self-hosted (AlmaLinux / AWS, behind nginx)
Run it under `nohup` and reverse-proxy `127.0.0.1:7860`. A real local disk supports WAL:
```bash
export MOS_DATA_DIR=/srv/mos/data ADMIN_CODE=... MOS_JOURNAL_MODE=WAL
nohup python app.py > mos.log 2>&1 &
```
Point your nginx `location /` block at `http://127.0.0.1:7860;` with the usual
`proxy_set_header Upgrade/Connection` lines for websockets.
---
## How an evaluation round works
1. **Admin β Languages.** Add the languages in scope (e.g. `yo / Yoruba`, `ha / Hausa`).
2. **Admin β Upload audio samples.** Pick a language, set the **Model / system name**
(e.g. `F5-TTS`, `XTTS-v2`, `MMS-TTS`, or `human`), drag in the wav/mp3/flac/ogg files, upload.
Tick **reference / human anchor** for ground-truth human recordings (see below).
3. **Reviewers** sign up, choose their languages, and rate. They see a blind "Sample N", an audio
player, the 7 radios, and a comments box.
4. **Admin β Results.** Pick a language, click **Compute MOS**, read the per-model and per-sample
tables, then **Export XLSX** (sheets: *Per Model*, *Per Sample*, *Raw Ratings*).
### Final MOS per language
The summary line reports two headline numbers per language:
- **System MOS (Overall criterion)** β the mean of the dedicated *Overall Quality* ratings. This is
the conventional single MOS number to quote.
- **System MOS (mean of all criteria)** β the mean across all 7 criteria, useful when you want a
composite that weights pronunciation/prosody equally with overall impression.
Both are broken down per model so you can compare systems within a language directly
(e.g. F5-TTS vs XTTS-v2 for Yoruba).
---
## Data model (SQLite)
```
users(id, username, email, password_hash, salt, role, is_active, created_at)
languages(id, code, name, created_at)
user_languages(user_id, language_id) -- which languages a reviewer may rate
samples(id, language_id, sample_name, model_name, file_path, is_reference, transcript, created_at)
ratings(id, user_id, sample_id, naturalness, intelligibility, pronunciation,
prosody, fluency, audio_quality, overall, comments, updated_at,
UNIQUE(user_id, sample_id))
```
Everything lives under `MOS_DATA_DIR`: `mos.db`, `audio/<language_id>/<file>`, and `exports/`.
---
## Suggested additions (beyond what you asked for, already built in where noted)
These come from standard MOS / listening-test practice and matter for a defensible result:
1. **Reference / human anchors (built in).** Upload a few real human recordings flagged as
*reference*. They are mixed blindly into the reviewer's queue but excluded from the system MOS,
and their mean is reported separately. If your anchors don't score ~4.5β5.0, that reviewer (or
the whole batch) is mis-calibrated and you can discount them. This is the single most important
safeguard for credible MOS.
2. **Blind model names (built in).** Already enforced β reviewers never see which system produced
a clip.
3. **Inter-rater spread (built in).** The per-model/per-sample tables include the standard
deviation of the Overall score, so you can spot samples reviewers disagree on.
Worth considering for a v2 (not yet built β happy to add):
4. **Minimum ratings per sample** before a sample counts as "final" (e.g. require β₯5 reviewers),
and surface samples that are under-rated.
5. **Randomised, balanced presentation order** per reviewer (Latin-square style) so position bias
averages out, instead of always-ascending sample id.
6. **Listening-setup gate** β a one-time confirmation that the reviewer is on headphones in a quiet
room, stored with their profile.
7. **Native-speaker / experience metadata** on reviewers, so you can filter MOS to native speakers
only for the paper.
8. **Attention-trap clips** (e.g. "rate this one 1") to catch click-through reviewers.
9. **Krippendorff's Ξ± / ICC** for formal inter-rater reliability, reported per language.
10. **Pairwise / MUSHRA / CMOS modes** if you later want preference tests rather than absolute MOS.
11. **Per-reviewer rate limiting / session timing** to flag implausibly fast ratings.
12. **CI / standard error** on each MOS (the raw export already lets you compute this; could be
shown inline).
13. **DB backup-to-bucket** (recommended if you go to production scale on a Space) β periodic
snapshot of `mos.db` to a separate bucket path, with restore-on-startup, to remove the
mid-write corruption risk noted above.
---
## Notes & limitations
- **SQLite journal mode.** Default is `DELETE` (safe on bucket/FUSE mounts). Set
`MOS_JOURNAL_MODE=WAL` only on a genuine local disk for better concurrency. Comfortable for a few
dozen concurrent reviewers; for larger crowdsourcing, move to Postgres β the data layer is
isolated in a handful of functions and easy to swap.
- **Audio serving.** Files are served from disk via Gradio's file mechanism, enabled by
`launch(allowed_paths=[MOS_DATA_DIR])`. Keep clips short (a few seconds) as is normal for MOS.
- Deactivating a user (`Admin β Users β active = no`) blocks login but keeps their ratings.
- Deleting a sample also deletes its ratings (cascade) and the audio file on disk.
---
*Generated for Afolabi Abeeb, Plotweaver AI.*
|