MOS_Evaluation / README.md
PlotweaverModel's picture
Upload 2 files
cd1ae53 verified
|
Raw
History Blame Contribute Delete
11.5 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade
metadata
title: TTS MOS Evaluation
emoji: 🎧
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false

🎧 Plotweaver AI β€” TTS MOS Evaluation Platform

A multi-user Gradio application for collecting Mean Opinion Score (MOS) ratings of synthesised speech across multiple languages. Reviewers create accounts, listen to audio samples in the languages they are competent in, and rate each sample on 7 criteria (1–5). An admin uploads audio per language and reads off aggregated MOS results per language and per model, with one-click export for your paper.

Built for the Plotweaver AI African-language TTS validation workflow (Yoruba, Hausa, Igbo, Nigerian English, Akan, Swahili, Nigerian Pidgin, … add more any time).

Note on the front-matter above: Hugging Face Spaces reads the YAML block at the top of this file to configure the Space (SDK, app file, etc.). Keep it as the very first thing in the file. Set sdk_version to whatever current Gradio 5.x the Space build accepts β€” if it complains, the build log lists valid versions.


What it does

  • Accounts & roles. Reviewers self-register; passwords are hashed (PBKDF2-HMAC-SHA256, 200k iterations, per-user salt). Two roles: reviewer and admin.
  • Per-language gating. A reviewer is assigned a set of languages and only ever sees / rates samples in those languages. They can update their language set themselves.
  • Add languages any time. New languages appear instantly in the signup form, the reviewer profile, and the admin upload/results dropdowns β€” no code change, no redeploy.
  • Blind evaluation. The model / system name is stored with each sample but is never shown to reviewers β€” they only see "Sample N", so ratings aren't biased by branding.
  • 7-criterion MOS form matching TTS_Evaluation_Criteria.docx: Naturalness Β· Intelligibility Β· Pronunciation Accuracy Β· Prosody & Expressiveness Β· Fluency Β· Audio Quality Β· Overall Quality β€” plus a free-text comments box.
  • Resumable rating. Ratings are upserted (one per reviewer per sample); reviewers can revisit and change a rating, and a "Next unrated β–Ά" button walks them through the queue.
  • Results dashboard. Per-model MOS, per-sample MOS, reviewer/sample counts, standard deviation, and the separate reference-anchor MOS (see below). Export to .xlsx.

Environment variables

Variable Default Purpose
MOS_DATA_DIR ./data Persistent location for audio files, exports, and the DB backup. Point at your bucket mount (e.g. /data).
MOS_LOCAL_DIR system temp /…/mos_live Fast local disk where the live SQLite DB runs. It is backed up to MOS_DATA_DIR after every change and restored on startup. Usually leave as default.
ADMIN_CODE plotweaver-admin Entered on the signup form to create an admin account. Override this.
MOS_SECRET derived from ADMIN_CODE Secret used to sign browser session tokens (keeps a refreshed page logged in). Set a stable value in production.
PORT 7860 Port the app binds to.

Quick start (local)

pip install -r requirements.txt
python app.py

Open http://localhost:7860.

Create the first admin: go to Create account, fill in a username/password, and enter the admin code in the Admin code field. The default code is printed in the console on startup (plotweaver-admin unless you override it). Override it with an environment variable:

export ADMIN_CODE="something-only-you-know"
export MOS_JOURNAL_MODE=WAL   # optional: a real local disk supports WAL
python app.py

Anyone who signs up without the code becomes a normal reviewer.


Deploying on a Hugging Face Space

This app writes its database and uploaded audio to disk, so it needs persistent storage. Space filesystems are otherwise ephemeral and get wiped on rebuilds, restarts, and after the free CPU tier sleeps. The current way to persist data on a Space is a Storage Bucket (HF's replacement for the older fixed /data storage tier).

  1. Create the Space. New β†’ Space β†’ SDK Gradio, hardware CPU basic (free), visibility Private. ZeroGPU is not needed β€” there's no inference here, only file serving and SQLite.
  2. Mount a Storage Bucket. Space β†’ Settings β†’ Storage Buckets β†’ Mount a bucket. Create a private bucket (e.g. mos-eval-data), mode read-write, and set the mount path to /data. Your audio clips are tiny, so this stays within the free private-storage allowance.
  3. Set variables / secrets. Space β†’ Settings β†’ Variables and secrets:
    • Variable MOS_DATA_DIR = /data
    • Secret ADMIN_CODE = <your secret>
    • (Do not set MOS_JOURNAL_MODE β€” the bucket-safe DELETE default is what you want here.)
  4. Upload app.py, requirements.txt, README.md (this file, with its YAML front-matter) via the Files tab or git push.
  5. First boot. Watch the Logs tab β€” the active admin code is printed on startup. Open the app, Create account, enter the admin code, and you're the admin.

Notes:

  • The app already calls launch(allowed_paths=[MOS_DATA_DIR]), so Gradio is permitted to serve the audio files stored on the bucket. Without this, the audio player can't load clips from /data.
  • Database on buckets: the live SQLite database does not run directly on the bucket, because object-store/FUSE mounts don't provide reliable file locking and writes can silently fail to appear on later reads (causing "my ratings/files disappeared"). Instead the live DB runs on fast local disk (MOS_LOCAL_DIR) and is atomically backed up to the bucket (MOS_DATA_DIR/mos.db) after every change, then restored on startup. Audio files live on the bucket (static, write-once, no locking issues). This survives Space restarts as long as the bucket stays mounted.
  • Staying logged in: a signed token is stored in the browser (gr.BrowserState) so a page refresh keeps you signed in. Set a stable MOS_SECRET in production.

Self-hosted (AlmaLinux / AWS, behind nginx)

Run it under nohup and reverse-proxy 127.0.0.1:7860. A real local disk supports WAL:

export MOS_DATA_DIR=/srv/mos/data ADMIN_CODE=... MOS_JOURNAL_MODE=WAL
nohup python app.py > mos.log 2>&1 &

Point your nginx location / block at http://127.0.0.1:7860; with the usual proxy_set_header Upgrade/Connection lines for websockets.


How an evaluation round works

  1. Admin β†’ Languages. Add the languages in scope (e.g. yo / Yoruba, ha / Hausa).
  2. Admin β†’ Upload audio samples. Pick a language, set the Model / system name (e.g. F5-TTS, XTTS-v2, MMS-TTS, or human), drag in the wav/mp3/flac/ogg files, upload. Tick reference / human anchor for ground-truth human recordings (see below).
  3. Reviewers sign up, choose their languages, and rate. They see a blind "Sample N", an audio player, the 7 radios, and a comments box.
  4. Admin β†’ Results. Pick a language, click Compute MOS, read the per-model and per-sample tables, then Export XLSX (sheets: Per Model, Per Sample, Raw Ratings).

Final MOS per language

The summary line reports two headline numbers per language:

  • System MOS (Overall criterion) β€” the mean of the dedicated Overall Quality ratings. This is the conventional single MOS number to quote.
  • System MOS (mean of all criteria) β€” the mean across all 7 criteria, useful when you want a composite that weights pronunciation/prosody equally with overall impression.

Both are broken down per model so you can compare systems within a language directly (e.g. F5-TTS vs XTTS-v2 for Yoruba).


Data model (SQLite)

users(id, username, email, password_hash, salt, role, is_active, created_at)
languages(id, code, name, created_at)
user_languages(user_id, language_id)                  -- which languages a reviewer may rate
samples(id, language_id, sample_name, model_name, file_path, is_reference, transcript, created_at)
ratings(id, user_id, sample_id, naturalness, intelligibility, pronunciation,
        prosody, fluency, audio_quality, overall, comments, updated_at,
        UNIQUE(user_id, sample_id))

Everything lives under MOS_DATA_DIR: mos.db, audio/<language_id>/<file>, and exports/.


Suggested additions (beyond what you asked for, already built in where noted)

These come from standard MOS / listening-test practice and matter for a defensible result:

  1. Reference / human anchors (built in). Upload a few real human recordings flagged as reference. They are mixed blindly into the reviewer's queue but excluded from the system MOS, and their mean is reported separately. If your anchors don't score ~4.5–5.0, that reviewer (or the whole batch) is mis-calibrated and you can discount them. This is the single most important safeguard for credible MOS.
  2. Blind model names (built in). Already enforced β€” reviewers never see which system produced a clip.
  3. Inter-rater spread (built in). The per-model/per-sample tables include the standard deviation of the Overall score, so you can spot samples reviewers disagree on.

Worth considering for a v2 (not yet built β€” happy to add):

  1. Minimum ratings per sample before a sample counts as "final" (e.g. require β‰₯5 reviewers), and surface samples that are under-rated.
  2. Randomised, balanced presentation order per reviewer (Latin-square style) so position bias averages out, instead of always-ascending sample id.
  3. Listening-setup gate β€” a one-time confirmation that the reviewer is on headphones in a quiet room, stored with their profile.
  4. Native-speaker / experience metadata on reviewers, so you can filter MOS to native speakers only for the paper.
  5. Attention-trap clips (e.g. "rate this one 1") to catch click-through reviewers.
  6. Krippendorff's Ξ± / ICC for formal inter-rater reliability, reported per language.
  7. Pairwise / MUSHRA / CMOS modes if you later want preference tests rather than absolute MOS.
  8. Per-reviewer rate limiting / session timing to flag implausibly fast ratings.
  9. CI / standard error on each MOS (the raw export already lets you compute this; could be shown inline).
  10. DB backup-to-bucket (recommended if you go to production scale on a Space) β€” periodic snapshot of mos.db to a separate bucket path, with restore-on-startup, to remove the mid-write corruption risk noted above.

Notes & limitations

  • SQLite journal mode. Default is DELETE (safe on bucket/FUSE mounts). Set MOS_JOURNAL_MODE=WAL only on a genuine local disk for better concurrency. Comfortable for a few dozen concurrent reviewers; for larger crowdsourcing, move to Postgres β€” the data layer is isolated in a handful of functions and easy to swap.
  • Audio serving. Files are served from disk via Gradio's file mechanism, enabled by launch(allowed_paths=[MOS_DATA_DIR]). Keep clips short (a few seconds) as is normal for MOS.
  • Deactivating a user (Admin β†’ Users β†’ active = no) blocks login but keeps their ratings.
  • Deleting a sample also deletes its ratings (cascade) and the audio file on disk.

Generated for Afolabi Abeeb, Plotweaver AI.