Spaces:

PlotweaverModel
/

MOS_Evaluation

Running

App Files Files Community

MOS_Evaluation / README.md

PlotweaverModel

Upload 2 files

cd1ae53 verified 4 days ago

preview code

Raw

History Blame Contribute Delete

11.5 kB

	---
	title: TTS MOS Evaluation
	emoji: 🎧
	colorFrom: blue
	colorTo: indigo
	sdk: gradio
	sdk_version: "5.49.1"
	app_file: app.py
	pinned: false
	---

	# 🎧 Plotweaver AI — TTS MOS Evaluation Platform

	A multi-user [Gradio](https://gradio.app) application for collecting Mean Opinion Score (MOS)
	ratings of synthesised speech across multiple languages. Reviewers create accounts,
	listen to audio samples in the languages they are competent in, and rate each sample on
	7 criteria (1–5). An admin uploads audio per language and reads off aggregated MOS results
	per language and per model, with one-click export for your paper.

	Built for the Plotweaver AI African-language TTS validation workflow (Yoruba, Hausa, Igbo,
	Nigerian English, Akan, Swahili, Nigerian Pidgin, … add more any time).

	> Note on the front-matter above: Hugging Face Spaces reads the YAML block at the top of this
	> file to configure the Space (SDK, app file, etc.). Keep it as the very first thing in the file.
	> Set `sdk_version` to whatever current Gradio 5.x the Space build accepts — if it complains, the
	> build log lists valid versions.

	---

	## What it does

	- Accounts & roles. Reviewers self-register; passwords are hashed (PBKDF2-HMAC-SHA256,
	200k iterations, per-user salt). Two roles: `reviewer` and `admin`.
	- Per-language gating. A reviewer is assigned a set of languages and only ever sees / rates
	samples in those languages. They can update their language set themselves.
	- Add languages any time. New languages appear instantly in the signup form, the reviewer
	profile, and the admin upload/results dropdowns — no code change, no redeploy.
	- Blind evaluation. The model / system name is stored with each sample but is **never shown
	to reviewers** — they only see "Sample N", so ratings aren't biased by branding.
	- 7-criterion MOS form matching `TTS_Evaluation_Criteria.docx`:
	Naturalness · Intelligibility · Pronunciation Accuracy · Prosody & Expressiveness · Fluency ·
	Audio Quality · Overall Quality — plus a free-text comments box.
	- Resumable rating. Ratings are upserted (one per reviewer per sample); reviewers can revisit
	and change a rating, and a "Next unrated ▶" button walks them through the queue.
	- Results dashboard. Per-model MOS, per-sample MOS, reviewer/sample counts, standard
	deviation, and the separate reference-anchor MOS (see below). Export to `.xlsx`.

	---

	## Environment variables

	\| Variable \| Default \| Purpose \|
	\|----------\|---------\|---------\|
	\| `MOS_DATA_DIR` \| `./data` \| Persistent location for audio files, exports, and the DB backup. Point at your bucket mount (e.g. `/data`). \|
	\| `MOS_LOCAL_DIR` \| system temp `/…/mos_live` \| Fast local disk where the live SQLite DB runs. It is backed up to `MOS_DATA_DIR` after every change and restored on startup. Usually leave as default. \|
	\| `ADMIN_CODE` \| `plotweaver-admin` \| Entered on the signup form to create an admin account. Override this. \|
	\| `MOS_SECRET` \| derived from `ADMIN_CODE` \| Secret used to sign browser session tokens (keeps a refreshed page logged in). Set a stable value in production. \|
	\| `PORT` \| `7860` \| Port the app binds to. \|

	---

	## Quick start (local)

	```bash
	pip install -r requirements.txt
	python app.py
	```

	Open http://localhost:7860.

	Create the first admin: go to Create account, fill in a username/password, and enter the
	admin code in the Admin code field. The default code is printed in the console on startup
	(`plotweaver-admin` unless you override it). Override it with an environment variable:

	```bash
	export ADMIN_CODE="something-only-you-know"
	export MOS_JOURNAL_MODE=WAL # optional: a real local disk supports WAL
	python app.py
	```

	Anyone who signs up without the code becomes a normal reviewer.

	---

	## Deploying on a Hugging Face Space

	This app writes its database and uploaded audio to disk, so it needs persistent storage. Space
	filesystems are otherwise ephemeral and get wiped on rebuilds, restarts, and after the free CPU
	tier sleeps. The current way to persist data on a Space is a Storage Bucket (HF's replacement
	for the older fixed `/data` storage tier).

	1. Create the Space. New → Space → SDK Gradio, hardware CPU basic (free), visibility
	Private. ZeroGPU is not needed — there's no inference here, only file serving and SQLite.
	2. Mount a Storage Bucket. Space → Settings → Storage Buckets → Mount a bucket. Create a
	private bucket (e.g. `mos-eval-data`), mode read-write, and set the mount path to `/data`.
	Your audio clips are tiny, so this stays within the free private-storage allowance.
	3. Set variables / secrets. Space → Settings → Variables and secrets:
	- Variable `MOS_DATA_DIR` = `/data`
	- Secret `ADMIN_CODE` = `<your secret>`
	- (Do not set `MOS_JOURNAL_MODE` — the bucket-safe `DELETE` default is what you want here.)
	4. Upload `app.py`, `requirements.txt`, `README.md` (this file, with its YAML front-matter) via
	the Files tab or `git push`.
	5. First boot. Watch the Logs tab — the active admin code is printed on startup. Open the
	app, Create account, enter the admin code, and you're the admin.

	Notes:
	- The app already calls `launch(allowed_paths=[MOS_DATA_DIR])`, so Gradio is permitted to serve the
	audio files stored on the bucket. Without this, the audio player can't load clips from `/data`.
	- Database on buckets: the live SQLite database does not run directly on the bucket, because
	object-store/FUSE mounts don't provide reliable file locking and writes can silently fail to
	appear on later reads (causing "my ratings/files disappeared"). Instead the live DB runs on fast
	local disk (`MOS_LOCAL_DIR`) and is atomically backed up to the bucket (`MOS_DATA_DIR/mos.db`)
	after every change, then restored on startup. Audio files live on the bucket (static, write-once,
	no locking issues). This survives Space restarts as long as the bucket stays mounted.
	- Staying logged in: a signed token is stored in the browser (`gr.BrowserState`) so a page
	refresh keeps you signed in. Set a stable `MOS_SECRET` in production.

	### Self-hosted (AlmaLinux / AWS, behind nginx)

	Run it under `nohup` and reverse-proxy `127.0.0.1:7860`. A real local disk supports WAL:

	```bash
	export MOS_DATA_DIR=/srv/mos/data ADMIN_CODE=... MOS_JOURNAL_MODE=WAL
	nohup python app.py > mos.log 2>&1 &
	```

	Point your nginx `location /` block at `http://127.0.0.1:7860;` with the usual
	`proxy_set_header Upgrade/Connection` lines for websockets.

	---

	## How an evaluation round works

	1. Admin → Languages. Add the languages in scope (e.g. `yo / Yoruba`, `ha / Hausa`).
	2. Admin → Upload audio samples. Pick a language, set the Model / system name
	(e.g. `F5-TTS`, `XTTS-v2`, `MMS-TTS`, or `human`), drag in the wav/mp3/flac/ogg files, upload.
	Tick reference / human anchor for ground-truth human recordings (see below).
	3. Reviewers sign up, choose their languages, and rate. They see a blind "Sample N", an audio
	player, the 7 radios, and a comments box.
	4. Admin → Results. Pick a language, click Compute MOS, read the per-model and per-sample
	tables, then Export XLSX (sheets: Per Model, Per Sample, Raw Ratings).

	### Final MOS per language

	The summary line reports two headline numbers per language:

	- System MOS (Overall criterion) — the mean of the dedicated Overall Quality ratings. This is
	the conventional single MOS number to quote.
	- System MOS (mean of all criteria) — the mean across all 7 criteria, useful when you want a
	composite that weights pronunciation/prosody equally with overall impression.

	Both are broken down per model so you can compare systems within a language directly
	(e.g. F5-TTS vs XTTS-v2 for Yoruba).

	---

	## Data model (SQLite)

	```
	users(id, username, email, password_hash, salt, role, is_active, created_at)
	languages(id, code, name, created_at)
	user_languages(user_id, language_id) -- which languages a reviewer may rate
	samples(id, language_id, sample_name, model_name, file_path, is_reference, transcript, created_at)
	ratings(id, user_id, sample_id, naturalness, intelligibility, pronunciation,
	prosody, fluency, audio_quality, overall, comments, updated_at,
	UNIQUE(user_id, sample_id))
	```

	Everything lives under `MOS_DATA_DIR`: `mos.db`, `audio/<language_id>/<file>`, and `exports/`.

	---

	## Suggested additions (beyond what you asked for, already built in where noted)

	These come from standard MOS / listening-test practice and matter for a defensible result:

	1. Reference / human anchors (built in). Upload a few real human recordings flagged as
	reference. They are mixed blindly into the reviewer's queue but excluded from the system MOS,
	and their mean is reported separately. If your anchors don't score ~4.5–5.0, that reviewer (or
	the whole batch) is mis-calibrated and you can discount them. This is the single most important
	safeguard for credible MOS.
	2. Blind model names (built in). Already enforced — reviewers never see which system produced
	a clip.
	3. Inter-rater spread (built in). The per-model/per-sample tables include the standard
	deviation of the Overall score, so you can spot samples reviewers disagree on.

	Worth considering for a v2 (not yet built — happy to add):

	4. Minimum ratings per sample before a sample counts as "final" (e.g. require ≥5 reviewers),
	and surface samples that are under-rated.
	5. Randomised, balanced presentation order per reviewer (Latin-square style) so position bias
	averages out, instead of always-ascending sample id.
	6. Listening-setup gate — a one-time confirmation that the reviewer is on headphones in a quiet
	room, stored with their profile.
	7. Native-speaker / experience metadata on reviewers, so you can filter MOS to native speakers
	only for the paper.
	8. Attention-trap clips (e.g. "rate this one 1") to catch click-through reviewers.
	9. Krippendorff's α / ICC for formal inter-rater reliability, reported per language.
	10. Pairwise / MUSHRA / CMOS modes if you later want preference tests rather than absolute MOS.
	11. Per-reviewer rate limiting / session timing to flag implausibly fast ratings.
	12. CI / standard error on each MOS (the raw export already lets you compute this; could be
	shown inline).
	13. DB backup-to-bucket (recommended if you go to production scale on a Space) — periodic
	snapshot of `mos.db` to a separate bucket path, with restore-on-startup, to remove the
	mid-write corruption risk noted above.

	---

	## Notes & limitations

	- SQLite journal mode. Default is `DELETE` (safe on bucket/FUSE mounts). Set
	`MOS_JOURNAL_MODE=WAL` only on a genuine local disk for better concurrency. Comfortable for a few
	dozen concurrent reviewers; for larger crowdsourcing, move to Postgres — the data layer is
	isolated in a handful of functions and easy to swap.
	- Audio serving. Files are served from disk via Gradio's file mechanism, enabled by
	`launch(allowed_paths=[MOS_DATA_DIR])`. Keep clips short (a few seconds) as is normal for MOS.
	- Deactivating a user (`Admin → Users → active = no`) blocks login but keeps their ratings.
	- Deleting a sample also deletes its ratings (cascade) and the audio file on disk.

	---

	Generated for Afolabi Abeeb, Plotweaver AI.