Spaces:
Running
Running
| title: TTS MOS Evaluation | |
| emoji: π§ | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: gradio | |
| sdk_version: "5.49.1" | |
| app_file: app.py | |
| pinned: false | |
| # π§ Plotweaver AI β TTS MOS Evaluation Platform | |
| A multi-user [Gradio](https://gradio.app) application for collecting **Mean Opinion Score (MOS)** | |
| ratings of synthesised speech across **multiple languages**. Reviewers create accounts, | |
| listen to audio samples in the languages they are competent in, and rate each sample on | |
| **7 criteria (1β5)**. An admin uploads audio per language and reads off aggregated MOS results | |
| per language and per model, with one-click export for your paper. | |
| Built for the Plotweaver AI African-language TTS validation workflow (Yoruba, Hausa, Igbo, | |
| Nigerian English, Akan, Swahili, Nigerian Pidgin, β¦ add more any time). | |
| > **Note on the front-matter above:** Hugging Face Spaces reads the YAML block at the top of this | |
| > file to configure the Space (SDK, app file, etc.). Keep it as the very first thing in the file. | |
| > Set `sdk_version` to whatever current Gradio 5.x the Space build accepts β if it complains, the | |
| > build log lists valid versions. | |
| --- | |
| ## What it does | |
| - **Accounts & roles.** Reviewers self-register; passwords are hashed (PBKDF2-HMAC-SHA256, | |
| 200k iterations, per-user salt). Two roles: `reviewer` and `admin`. | |
| - **Per-language gating.** A reviewer is assigned a set of languages and only ever sees / rates | |
| samples in those languages. They can update their language set themselves. | |
| - **Add languages any time.** New languages appear instantly in the signup form, the reviewer | |
| profile, and the admin upload/results dropdowns β no code change, no redeploy. | |
| - **Blind evaluation.** The model / system name is stored with each sample but is **never shown | |
| to reviewers** β they only see "Sample N", so ratings aren't biased by branding. | |
| - **7-criterion MOS form** matching `TTS_Evaluation_Criteria.docx`: | |
| Naturalness Β· Intelligibility Β· Pronunciation Accuracy Β· Prosody & Expressiveness Β· Fluency Β· | |
| Audio Quality Β· Overall Quality β plus a free-text comments box. | |
| - **Resumable rating.** Ratings are upserted (one per reviewer per sample); reviewers can revisit | |
| and change a rating, and a "Next unrated βΆ" button walks them through the queue. | |
| - **Results dashboard.** Per-model MOS, per-sample MOS, reviewer/sample counts, standard | |
| deviation, and the separate **reference-anchor MOS** (see below). Export to `.xlsx`. | |
| --- | |
| ## Environment variables | |
| | Variable | Default | Purpose | | |
| |----------|---------|---------| | |
| | `MOS_DATA_DIR` | `./data` | **Persistent** location for audio files, exports, and the DB backup. Point at your bucket mount (e.g. `/data`). | | |
| | `MOS_LOCAL_DIR` | system temp `/β¦/mos_live` | Fast **local** disk where the live SQLite DB runs. It is backed up to `MOS_DATA_DIR` after every change and restored on startup. Usually leave as default. | | |
| | `ADMIN_CODE` | `plotweaver-admin` | Entered on the signup form to create an admin account. **Override this.** | | |
| | `MOS_SECRET` | derived from `ADMIN_CODE` | Secret used to sign browser session tokens (keeps a refreshed page logged in). Set a stable value in production. | | |
| | `PORT` | `7860` | Port the app binds to. | | |
| --- | |
| ## Quick start (local) | |
| ```bash | |
| pip install -r requirements.txt | |
| python app.py | |
| ``` | |
| Open http://localhost:7860. | |
| **Create the first admin:** go to *Create account*, fill in a username/password, and enter the | |
| admin code in the *Admin code* field. The default code is printed in the console on startup | |
| (`plotweaver-admin` unless you override it). Override it with an environment variable: | |
| ```bash | |
| export ADMIN_CODE="something-only-you-know" | |
| export MOS_JOURNAL_MODE=WAL # optional: a real local disk supports WAL | |
| python app.py | |
| ``` | |
| Anyone who signs up **without** the code becomes a normal reviewer. | |
| --- | |
| ## Deploying on a Hugging Face Space | |
| This app writes its database and uploaded audio to disk, so it needs **persistent storage**. Space | |
| filesystems are otherwise ephemeral and get wiped on rebuilds, restarts, and after the free CPU | |
| tier sleeps. The current way to persist data on a Space is a **Storage Bucket** (HF's replacement | |
| for the older fixed `/data` storage tier). | |
| 1. **Create the Space.** New β Space β SDK **Gradio**, hardware **CPU basic (free)**, visibility | |
| **Private**. ZeroGPU is *not* needed β there's no inference here, only file serving and SQLite. | |
| 2. **Mount a Storage Bucket.** Space β Settings β **Storage Buckets** β *Mount a bucket*. Create a | |
| private bucket (e.g. `mos-eval-data`), mode read-write, and set the **mount path** to `/data`. | |
| Your audio clips are tiny, so this stays within the free private-storage allowance. | |
| 3. **Set variables / secrets.** Space β Settings β *Variables and secrets*: | |
| - Variable `MOS_DATA_DIR` = `/data` | |
| - Secret `ADMIN_CODE` = `<your secret>` | |
| - (Do **not** set `MOS_JOURNAL_MODE` β the bucket-safe `DELETE` default is what you want here.) | |
| 4. **Upload `app.py`, `requirements.txt`, `README.md`** (this file, with its YAML front-matter) via | |
| the Files tab or `git push`. | |
| 5. **First boot.** Watch the **Logs** tab β the active admin code is printed on startup. Open the | |
| app, *Create account*, enter the admin code, and you're the admin. | |
| Notes: | |
| - The app already calls `launch(allowed_paths=[MOS_DATA_DIR])`, so Gradio is permitted to serve the | |
| audio files stored on the bucket. Without this, the audio player can't load clips from `/data`. | |
| - **Database on buckets:** the live SQLite database does **not** run directly on the bucket, because | |
| object-store/FUSE mounts don't provide reliable file locking and writes can silently fail to | |
| appear on later reads (causing "my ratings/files disappeared"). Instead the live DB runs on fast | |
| local disk (`MOS_LOCAL_DIR`) and is atomically backed up to the bucket (`MOS_DATA_DIR/mos.db`) | |
| after every change, then restored on startup. Audio files live on the bucket (static, write-once, | |
| no locking issues). This survives Space restarts as long as the bucket stays mounted. | |
| - **Staying logged in:** a signed token is stored in the browser (`gr.BrowserState`) so a page | |
| refresh keeps you signed in. Set a stable `MOS_SECRET` in production. | |
| ### Self-hosted (AlmaLinux / AWS, behind nginx) | |
| Run it under `nohup` and reverse-proxy `127.0.0.1:7860`. A real local disk supports WAL: | |
| ```bash | |
| export MOS_DATA_DIR=/srv/mos/data ADMIN_CODE=... MOS_JOURNAL_MODE=WAL | |
| nohup python app.py > mos.log 2>&1 & | |
| ``` | |
| Point your nginx `location /` block at `http://127.0.0.1:7860;` with the usual | |
| `proxy_set_header Upgrade/Connection` lines for websockets. | |
| --- | |
| ## How an evaluation round works | |
| 1. **Admin β Languages.** Add the languages in scope (e.g. `yo / Yoruba`, `ha / Hausa`). | |
| 2. **Admin β Upload audio samples.** Pick a language, set the **Model / system name** | |
| (e.g. `F5-TTS`, `XTTS-v2`, `MMS-TTS`, or `human`), drag in the wav/mp3/flac/ogg files, upload. | |
| Tick **reference / human anchor** for ground-truth human recordings (see below). | |
| 3. **Reviewers** sign up, choose their languages, and rate. They see a blind "Sample N", an audio | |
| player, the 7 radios, and a comments box. | |
| 4. **Admin β Results.** Pick a language, click **Compute MOS**, read the per-model and per-sample | |
| tables, then **Export XLSX** (sheets: *Per Model*, *Per Sample*, *Raw Ratings*). | |
| ### Final MOS per language | |
| The summary line reports two headline numbers per language: | |
| - **System MOS (Overall criterion)** β the mean of the dedicated *Overall Quality* ratings. This is | |
| the conventional single MOS number to quote. | |
| - **System MOS (mean of all criteria)** β the mean across all 7 criteria, useful when you want a | |
| composite that weights pronunciation/prosody equally with overall impression. | |
| Both are broken down per model so you can compare systems within a language directly | |
| (e.g. F5-TTS vs XTTS-v2 for Yoruba). | |
| --- | |
| ## Data model (SQLite) | |
| ``` | |
| users(id, username, email, password_hash, salt, role, is_active, created_at) | |
| languages(id, code, name, created_at) | |
| user_languages(user_id, language_id) -- which languages a reviewer may rate | |
| samples(id, language_id, sample_name, model_name, file_path, is_reference, transcript, created_at) | |
| ratings(id, user_id, sample_id, naturalness, intelligibility, pronunciation, | |
| prosody, fluency, audio_quality, overall, comments, updated_at, | |
| UNIQUE(user_id, sample_id)) | |
| ``` | |
| Everything lives under `MOS_DATA_DIR`: `mos.db`, `audio/<language_id>/<file>`, and `exports/`. | |
| --- | |
| ## Suggested additions (beyond what you asked for, already built in where noted) | |
| These come from standard MOS / listening-test practice and matter for a defensible result: | |
| 1. **Reference / human anchors (built in).** Upload a few real human recordings flagged as | |
| *reference*. They are mixed blindly into the reviewer's queue but excluded from the system MOS, | |
| and their mean is reported separately. If your anchors don't score ~4.5β5.0, that reviewer (or | |
| the whole batch) is mis-calibrated and you can discount them. This is the single most important | |
| safeguard for credible MOS. | |
| 2. **Blind model names (built in).** Already enforced β reviewers never see which system produced | |
| a clip. | |
| 3. **Inter-rater spread (built in).** The per-model/per-sample tables include the standard | |
| deviation of the Overall score, so you can spot samples reviewers disagree on. | |
| Worth considering for a v2 (not yet built β happy to add): | |
| 4. **Minimum ratings per sample** before a sample counts as "final" (e.g. require β₯5 reviewers), | |
| and surface samples that are under-rated. | |
| 5. **Randomised, balanced presentation order** per reviewer (Latin-square style) so position bias | |
| averages out, instead of always-ascending sample id. | |
| 6. **Listening-setup gate** β a one-time confirmation that the reviewer is on headphones in a quiet | |
| room, stored with their profile. | |
| 7. **Native-speaker / experience metadata** on reviewers, so you can filter MOS to native speakers | |
| only for the paper. | |
| 8. **Attention-trap clips** (e.g. "rate this one 1") to catch click-through reviewers. | |
| 9. **Krippendorff's Ξ± / ICC** for formal inter-rater reliability, reported per language. | |
| 10. **Pairwise / MUSHRA / CMOS modes** if you later want preference tests rather than absolute MOS. | |
| 11. **Per-reviewer rate limiting / session timing** to flag implausibly fast ratings. | |
| 12. **CI / standard error** on each MOS (the raw export already lets you compute this; could be | |
| shown inline). | |
| 13. **DB backup-to-bucket** (recommended if you go to production scale on a Space) β periodic | |
| snapshot of `mos.db` to a separate bucket path, with restore-on-startup, to remove the | |
| mid-write corruption risk noted above. | |
| --- | |
| ## Notes & limitations | |
| - **SQLite journal mode.** Default is `DELETE` (safe on bucket/FUSE mounts). Set | |
| `MOS_JOURNAL_MODE=WAL` only on a genuine local disk for better concurrency. Comfortable for a few | |
| dozen concurrent reviewers; for larger crowdsourcing, move to Postgres β the data layer is | |
| isolated in a handful of functions and easy to swap. | |
| - **Audio serving.** Files are served from disk via Gradio's file mechanism, enabled by | |
| `launch(allowed_paths=[MOS_DATA_DIR])`. Keep clips short (a few seconds) as is normal for MOS. | |
| - Deactivating a user (`Admin β Users β active = no`) blocks login but keeps their ratings. | |
| - Deleting a sample also deletes its ratings (cascade) and the audio file on disk. | |
| --- | |
| *Generated for Afolabi Abeeb, Plotweaver AI.* | |