--- title: TTS MOS Evaluation emoji: ๐ŸŽง colorFrom: blue colorTo: indigo sdk: gradio sdk_version: "5.49.1" app_file: app.py pinned: false --- # ๐ŸŽง Plotweaver AI โ€” TTS MOS Evaluation Platform A multi-user [Gradio](https://gradio.app) application for collecting **Mean Opinion Score (MOS)** ratings of synthesised speech across **multiple languages**. Reviewers create accounts, listen to audio samples in the languages they are competent in, and rate each sample on **7 criteria (1โ€“5)**. An admin uploads audio per language and reads off aggregated MOS results per language and per model, with one-click export for your paper. Built for the Plotweaver AI African-language TTS validation workflow (Yoruba, Hausa, Igbo, Nigerian English, Akan, Swahili, Nigerian Pidgin, โ€ฆ add more any time). > **Note on the front-matter above:** Hugging Face Spaces reads the YAML block at the top of this > file to configure the Space (SDK, app file, etc.). Keep it as the very first thing in the file. > Set `sdk_version` to whatever current Gradio 5.x the Space build accepts โ€” if it complains, the > build log lists valid versions. --- ## What it does - **Accounts & roles.** Reviewers self-register; passwords are hashed (PBKDF2-HMAC-SHA256, 200k iterations, per-user salt). Two roles: `reviewer` and `admin`. - **Per-language gating.** A reviewer is assigned a set of languages and only ever sees / rates samples in those languages. They can update their language set themselves. - **Add languages any time.** New languages appear instantly in the signup form, the reviewer profile, and the admin upload/results dropdowns โ€” no code change, no redeploy. - **Blind evaluation.** The model / system name is stored with each sample but is **never shown to reviewers** โ€” they only see "Sample N", so ratings aren't biased by branding. - **7-criterion MOS form** matching `TTS_Evaluation_Criteria.docx`: Naturalness ยท Intelligibility ยท Pronunciation Accuracy ยท Prosody & Expressiveness ยท Fluency ยท Audio Quality ยท Overall Quality โ€” plus a free-text comments box. - **Resumable rating.** Ratings are upserted (one per reviewer per sample); reviewers can revisit and change a rating, and a "Next unrated โ–ถ" button walks them through the queue. - **Results dashboard.** Per-model MOS, per-sample MOS, reviewer/sample counts, standard deviation, and the separate **reference-anchor MOS** (see below). Export to `.xlsx`. --- ## Environment variables | Variable | Default | Purpose | |----------|---------|---------| | `MOS_DATA_DIR` | `./data` | **Persistent** location for audio files, exports, and the DB backup. Point at your bucket mount (e.g. `/data`). | | `MOS_LOCAL_DIR` | system temp `/โ€ฆ/mos_live` | Fast **local** disk where the live SQLite DB runs. It is backed up to `MOS_DATA_DIR` after every change and restored on startup. Usually leave as default. | | `ADMIN_CODE` | `plotweaver-admin` | Entered on the signup form to create an admin account. **Override this.** | | `MOS_SECRET` | derived from `ADMIN_CODE` | Secret used to sign browser session tokens (keeps a refreshed page logged in). Set a stable value in production. | | `PORT` | `7860` | Port the app binds to. | --- ## Quick start (local) ```bash pip install -r requirements.txt python app.py ``` Open http://localhost:7860. **Create the first admin:** go to *Create account*, fill in a username/password, and enter the admin code in the *Admin code* field. The default code is printed in the console on startup (`plotweaver-admin` unless you override it). Override it with an environment variable: ```bash export ADMIN_CODE="something-only-you-know" export MOS_JOURNAL_MODE=WAL # optional: a real local disk supports WAL python app.py ``` Anyone who signs up **without** the code becomes a normal reviewer. --- ## Deploying on a Hugging Face Space This app writes its database and uploaded audio to disk, so it needs **persistent storage**. Space filesystems are otherwise ephemeral and get wiped on rebuilds, restarts, and after the free CPU tier sleeps. The current way to persist data on a Space is a **Storage Bucket** (HF's replacement for the older fixed `/data` storage tier). 1. **Create the Space.** New โ†’ Space โ†’ SDK **Gradio**, hardware **CPU basic (free)**, visibility **Private**. ZeroGPU is *not* needed โ€” there's no inference here, only file serving and SQLite. 2. **Mount a Storage Bucket.** Space โ†’ Settings โ†’ **Storage Buckets** โ†’ *Mount a bucket*. Create a private bucket (e.g. `mos-eval-data`), mode read-write, and set the **mount path** to `/data`. Your audio clips are tiny, so this stays within the free private-storage allowance. 3. **Set variables / secrets.** Space โ†’ Settings โ†’ *Variables and secrets*: - Variable `MOS_DATA_DIR` = `/data` - Secret `ADMIN_CODE` = `` - (Do **not** set `MOS_JOURNAL_MODE` โ€” the bucket-safe `DELETE` default is what you want here.) 4. **Upload `app.py`, `requirements.txt`, `README.md`** (this file, with its YAML front-matter) via the Files tab or `git push`. 5. **First boot.** Watch the **Logs** tab โ€” the active admin code is printed on startup. Open the app, *Create account*, enter the admin code, and you're the admin. Notes: - The app already calls `launch(allowed_paths=[MOS_DATA_DIR])`, so Gradio is permitted to serve the audio files stored on the bucket. Without this, the audio player can't load clips from `/data`. - **Database on buckets:** the live SQLite database does **not** run directly on the bucket, because object-store/FUSE mounts don't provide reliable file locking and writes can silently fail to appear on later reads (causing "my ratings/files disappeared"). Instead the live DB runs on fast local disk (`MOS_LOCAL_DIR`) and is atomically backed up to the bucket (`MOS_DATA_DIR/mos.db`) after every change, then restored on startup. Audio files live on the bucket (static, write-once, no locking issues). This survives Space restarts as long as the bucket stays mounted. - **Staying logged in:** a signed token is stored in the browser (`gr.BrowserState`) so a page refresh keeps you signed in. Set a stable `MOS_SECRET` in production. ### Self-hosted (AlmaLinux / AWS, behind nginx) Run it under `nohup` and reverse-proxy `127.0.0.1:7860`. A real local disk supports WAL: ```bash export MOS_DATA_DIR=/srv/mos/data ADMIN_CODE=... MOS_JOURNAL_MODE=WAL nohup python app.py > mos.log 2>&1 & ``` Point your nginx `location /` block at `http://127.0.0.1:7860;` with the usual `proxy_set_header Upgrade/Connection` lines for websockets. --- ## How an evaluation round works 1. **Admin โ†’ Languages.** Add the languages in scope (e.g. `yo / Yoruba`, `ha / Hausa`). 2. **Admin โ†’ Upload audio samples.** Pick a language, set the **Model / system name** (e.g. `F5-TTS`, `XTTS-v2`, `MMS-TTS`, or `human`), drag in the wav/mp3/flac/ogg files, upload. Tick **reference / human anchor** for ground-truth human recordings (see below). 3. **Reviewers** sign up, choose their languages, and rate. They see a blind "Sample N", an audio player, the 7 radios, and a comments box. 4. **Admin โ†’ Results.** Pick a language, click **Compute MOS**, read the per-model and per-sample tables, then **Export XLSX** (sheets: *Per Model*, *Per Sample*, *Raw Ratings*). ### Final MOS per language The summary line reports two headline numbers per language: - **System MOS (Overall criterion)** โ€” the mean of the dedicated *Overall Quality* ratings. This is the conventional single MOS number to quote. - **System MOS (mean of all criteria)** โ€” the mean across all 7 criteria, useful when you want a composite that weights pronunciation/prosody equally with overall impression. Both are broken down per model so you can compare systems within a language directly (e.g. F5-TTS vs XTTS-v2 for Yoruba). --- ## Data model (SQLite) ``` users(id, username, email, password_hash, salt, role, is_active, created_at) languages(id, code, name, created_at) user_languages(user_id, language_id) -- which languages a reviewer may rate samples(id, language_id, sample_name, model_name, file_path, is_reference, transcript, created_at) ratings(id, user_id, sample_id, naturalness, intelligibility, pronunciation, prosody, fluency, audio_quality, overall, comments, updated_at, UNIQUE(user_id, sample_id)) ``` Everything lives under `MOS_DATA_DIR`: `mos.db`, `audio//`, and `exports/`. --- ## Suggested additions (beyond what you asked for, already built in where noted) These come from standard MOS / listening-test practice and matter for a defensible result: 1. **Reference / human anchors (built in).** Upload a few real human recordings flagged as *reference*. They are mixed blindly into the reviewer's queue but excluded from the system MOS, and their mean is reported separately. If your anchors don't score ~4.5โ€“5.0, that reviewer (or the whole batch) is mis-calibrated and you can discount them. This is the single most important safeguard for credible MOS. 2. **Blind model names (built in).** Already enforced โ€” reviewers never see which system produced a clip. 3. **Inter-rater spread (built in).** The per-model/per-sample tables include the standard deviation of the Overall score, so you can spot samples reviewers disagree on. Worth considering for a v2 (not yet built โ€” happy to add): 4. **Minimum ratings per sample** before a sample counts as "final" (e.g. require โ‰ฅ5 reviewers), and surface samples that are under-rated. 5. **Randomised, balanced presentation order** per reviewer (Latin-square style) so position bias averages out, instead of always-ascending sample id. 6. **Listening-setup gate** โ€” a one-time confirmation that the reviewer is on headphones in a quiet room, stored with their profile. 7. **Native-speaker / experience metadata** on reviewers, so you can filter MOS to native speakers only for the paper. 8. **Attention-trap clips** (e.g. "rate this one 1") to catch click-through reviewers. 9. **Krippendorff's ฮฑ / ICC** for formal inter-rater reliability, reported per language. 10. **Pairwise / MUSHRA / CMOS modes** if you later want preference tests rather than absolute MOS. 11. **Per-reviewer rate limiting / session timing** to flag implausibly fast ratings. 12. **CI / standard error** on each MOS (the raw export already lets you compute this; could be shown inline). 13. **DB backup-to-bucket** (recommended if you go to production scale on a Space) โ€” periodic snapshot of `mos.db` to a separate bucket path, with restore-on-startup, to remove the mid-write corruption risk noted above. --- ## Notes & limitations - **SQLite journal mode.** Default is `DELETE` (safe on bucket/FUSE mounts). Set `MOS_JOURNAL_MODE=WAL` only on a genuine local disk for better concurrency. Comfortable for a few dozen concurrent reviewers; for larger crowdsourcing, move to Postgres โ€” the data layer is isolated in a handful of functions and easy to swap. - **Audio serving.** Files are served from disk via Gradio's file mechanism, enabled by `launch(allowed_paths=[MOS_DATA_DIR])`. Keep clips short (a few seconds) as is normal for MOS. - Deactivating a user (`Admin โ†’ Users โ†’ active = no`) blocks login but keeps their ratings. - Deleting a sample also deletes its ratings (cascade) and the audio file on disk. --- *Generated for Afolabi Abeeb, Plotweaver AI.*