File size: 11,493 Bytes
4c27592
f601a65
 
 
4c27592
 
f601a65
4c27592
 
 
 
f601a65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cd1ae53
 
f601a65
cd1ae53
f601a65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cd1ae53
 
 
 
 
 
 
 
f601a65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
---
title: TTS MOS Evaluation
emoji: 🎧
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: "5.49.1"
app_file: app.py
pinned: false
---

# 🎧 Plotweaver AI β€” TTS MOS Evaluation Platform

A multi-user [Gradio](https://gradio.app) application for collecting **Mean Opinion Score (MOS)**
ratings of synthesised speech across **multiple languages**. Reviewers create accounts,
listen to audio samples in the languages they are competent in, and rate each sample on
**7 criteria (1–5)**. An admin uploads audio per language and reads off aggregated MOS results
per language and per model, with one-click export for your paper.

Built for the Plotweaver AI African-language TTS validation workflow (Yoruba, Hausa, Igbo,
Nigerian English, Akan, Swahili, Nigerian Pidgin, … add more any time).

> **Note on the front-matter above:** Hugging Face Spaces reads the YAML block at the top of this
> file to configure the Space (SDK, app file, etc.). Keep it as the very first thing in the file.
> Set `sdk_version` to whatever current Gradio 5.x the Space build accepts β€” if it complains, the
> build log lists valid versions.

---

## What it does

- **Accounts & roles.** Reviewers self-register; passwords are hashed (PBKDF2-HMAC-SHA256,
  200k iterations, per-user salt). Two roles: `reviewer` and `admin`.
- **Per-language gating.** A reviewer is assigned a set of languages and only ever sees / rates
  samples in those languages. They can update their language set themselves.
- **Add languages any time.** New languages appear instantly in the signup form, the reviewer
  profile, and the admin upload/results dropdowns β€” no code change, no redeploy.
- **Blind evaluation.** The model / system name is stored with each sample but is **never shown
  to reviewers** β€” they only see "Sample N", so ratings aren't biased by branding.
- **7-criterion MOS form** matching `TTS_Evaluation_Criteria.docx`:
  Naturalness Β· Intelligibility Β· Pronunciation Accuracy Β· Prosody & Expressiveness Β· Fluency Β·
  Audio Quality Β· Overall Quality β€” plus a free-text comments box.
- **Resumable rating.** Ratings are upserted (one per reviewer per sample); reviewers can revisit
  and change a rating, and a "Next unrated β–Ά" button walks them through the queue.
- **Results dashboard.** Per-model MOS, per-sample MOS, reviewer/sample counts, standard
  deviation, and the separate **reference-anchor MOS** (see below). Export to `.xlsx`.

---

## Environment variables

| Variable | Default | Purpose |
|----------|---------|---------|
| `MOS_DATA_DIR` | `./data` | **Persistent** location for audio files, exports, and the DB backup. Point at your bucket mount (e.g. `/data`). |
| `MOS_LOCAL_DIR` | system temp `/…/mos_live` | Fast **local** disk where the live SQLite DB runs. It is backed up to `MOS_DATA_DIR` after every change and restored on startup. Usually leave as default. |
| `ADMIN_CODE` | `plotweaver-admin` | Entered on the signup form to create an admin account. **Override this.** |
| `MOS_SECRET` | derived from `ADMIN_CODE` | Secret used to sign browser session tokens (keeps a refreshed page logged in). Set a stable value in production. |
| `PORT` | `7860` | Port the app binds to. |

---

## Quick start (local)

```bash
pip install -r requirements.txt
python app.py
```

Open http://localhost:7860.

**Create the first admin:** go to *Create account*, fill in a username/password, and enter the
admin code in the *Admin code* field. The default code is printed in the console on startup
(`plotweaver-admin` unless you override it). Override it with an environment variable:

```bash
export ADMIN_CODE="something-only-you-know"
export MOS_JOURNAL_MODE=WAL   # optional: a real local disk supports WAL
python app.py
```

Anyone who signs up **without** the code becomes a normal reviewer.

---

## Deploying on a Hugging Face Space

This app writes its database and uploaded audio to disk, so it needs **persistent storage**. Space
filesystems are otherwise ephemeral and get wiped on rebuilds, restarts, and after the free CPU
tier sleeps. The current way to persist data on a Space is a **Storage Bucket** (HF's replacement
for the older fixed `/data` storage tier).

1. **Create the Space.** New β†’ Space β†’ SDK **Gradio**, hardware **CPU basic (free)**, visibility
   **Private**. ZeroGPU is *not* needed β€” there's no inference here, only file serving and SQLite.
2. **Mount a Storage Bucket.** Space β†’ Settings β†’ **Storage Buckets** β†’ *Mount a bucket*. Create a
   private bucket (e.g. `mos-eval-data`), mode read-write, and set the **mount path** to `/data`.
   Your audio clips are tiny, so this stays within the free private-storage allowance.
3. **Set variables / secrets.** Space β†’ Settings β†’ *Variables and secrets*:
   - Variable `MOS_DATA_DIR` = `/data`
   - Secret `ADMIN_CODE` = `<your secret>`
   - (Do **not** set `MOS_JOURNAL_MODE` β€” the bucket-safe `DELETE` default is what you want here.)
4. **Upload `app.py`, `requirements.txt`, `README.md`** (this file, with its YAML front-matter) via
   the Files tab or `git push`.
5. **First boot.** Watch the **Logs** tab β€” the active admin code is printed on startup. Open the
   app, *Create account*, enter the admin code, and you're the admin.

Notes:
- The app already calls `launch(allowed_paths=[MOS_DATA_DIR])`, so Gradio is permitted to serve the
  audio files stored on the bucket. Without this, the audio player can't load clips from `/data`.
- **Database on buckets:** the live SQLite database does **not** run directly on the bucket, because
  object-store/FUSE mounts don't provide reliable file locking and writes can silently fail to
  appear on later reads (causing "my ratings/files disappeared"). Instead the live DB runs on fast
  local disk (`MOS_LOCAL_DIR`) and is atomically backed up to the bucket (`MOS_DATA_DIR/mos.db`)
  after every change, then restored on startup. Audio files live on the bucket (static, write-once,
  no locking issues). This survives Space restarts as long as the bucket stays mounted.
- **Staying logged in:** a signed token is stored in the browser (`gr.BrowserState`) so a page
  refresh keeps you signed in. Set a stable `MOS_SECRET` in production.

### Self-hosted (AlmaLinux / AWS, behind nginx)

Run it under `nohup` and reverse-proxy `127.0.0.1:7860`. A real local disk supports WAL:

```bash
export MOS_DATA_DIR=/srv/mos/data ADMIN_CODE=... MOS_JOURNAL_MODE=WAL
nohup python app.py > mos.log 2>&1 &
```

Point your nginx `location /` block at `http://127.0.0.1:7860;` with the usual
`proxy_set_header Upgrade/Connection` lines for websockets.

---

## How an evaluation round works

1. **Admin β†’ Languages.** Add the languages in scope (e.g. `yo / Yoruba`, `ha / Hausa`).
2. **Admin β†’ Upload audio samples.** Pick a language, set the **Model / system name**
   (e.g. `F5-TTS`, `XTTS-v2`, `MMS-TTS`, or `human`), drag in the wav/mp3/flac/ogg files, upload.
   Tick **reference / human anchor** for ground-truth human recordings (see below).
3. **Reviewers** sign up, choose their languages, and rate. They see a blind "Sample N", an audio
   player, the 7 radios, and a comments box.
4. **Admin β†’ Results.** Pick a language, click **Compute MOS**, read the per-model and per-sample
   tables, then **Export XLSX** (sheets: *Per Model*, *Per Sample*, *Raw Ratings*).

### Final MOS per language

The summary line reports two headline numbers per language:

- **System MOS (Overall criterion)** β€” the mean of the dedicated *Overall Quality* ratings. This is
  the conventional single MOS number to quote.
- **System MOS (mean of all criteria)** β€” the mean across all 7 criteria, useful when you want a
  composite that weights pronunciation/prosody equally with overall impression.

Both are broken down per model so you can compare systems within a language directly
(e.g. F5-TTS vs XTTS-v2 for Yoruba).

---

## Data model (SQLite)

```
users(id, username, email, password_hash, salt, role, is_active, created_at)
languages(id, code, name, created_at)
user_languages(user_id, language_id)                  -- which languages a reviewer may rate
samples(id, language_id, sample_name, model_name, file_path, is_reference, transcript, created_at)
ratings(id, user_id, sample_id, naturalness, intelligibility, pronunciation,
        prosody, fluency, audio_quality, overall, comments, updated_at,
        UNIQUE(user_id, sample_id))
```

Everything lives under `MOS_DATA_DIR`: `mos.db`, `audio/<language_id>/<file>`, and `exports/`.

---

## Suggested additions (beyond what you asked for, already built in where noted)

These come from standard MOS / listening-test practice and matter for a defensible result:

1. **Reference / human anchors (built in).** Upload a few real human recordings flagged as
   *reference*. They are mixed blindly into the reviewer's queue but excluded from the system MOS,
   and their mean is reported separately. If your anchors don't score ~4.5–5.0, that reviewer (or
   the whole batch) is mis-calibrated and you can discount them. This is the single most important
   safeguard for credible MOS.
2. **Blind model names (built in).** Already enforced β€” reviewers never see which system produced
   a clip.
3. **Inter-rater spread (built in).** The per-model/per-sample tables include the standard
   deviation of the Overall score, so you can spot samples reviewers disagree on.

Worth considering for a v2 (not yet built β€” happy to add):

4. **Minimum ratings per sample** before a sample counts as "final" (e.g. require β‰₯5 reviewers),
   and surface samples that are under-rated.
5. **Randomised, balanced presentation order** per reviewer (Latin-square style) so position bias
   averages out, instead of always-ascending sample id.
6. **Listening-setup gate** β€” a one-time confirmation that the reviewer is on headphones in a quiet
   room, stored with their profile.
7. **Native-speaker / experience metadata** on reviewers, so you can filter MOS to native speakers
   only for the paper.
8. **Attention-trap clips** (e.g. "rate this one 1") to catch click-through reviewers.
9. **Krippendorff's Ξ± / ICC** for formal inter-rater reliability, reported per language.
10. **Pairwise / MUSHRA / CMOS modes** if you later want preference tests rather than absolute MOS.
11. **Per-reviewer rate limiting / session timing** to flag implausibly fast ratings.
12. **CI / standard error** on each MOS (the raw export already lets you compute this; could be
    shown inline).
13. **DB backup-to-bucket** (recommended if you go to production scale on a Space) β€” periodic
    snapshot of `mos.db` to a separate bucket path, with restore-on-startup, to remove the
    mid-write corruption risk noted above.

---

## Notes & limitations

- **SQLite journal mode.** Default is `DELETE` (safe on bucket/FUSE mounts). Set
  `MOS_JOURNAL_MODE=WAL` only on a genuine local disk for better concurrency. Comfortable for a few
  dozen concurrent reviewers; for larger crowdsourcing, move to Postgres β€” the data layer is
  isolated in a handful of functions and easy to swap.
- **Audio serving.** Files are served from disk via Gradio's file mechanism, enabled by
  `launch(allowed_paths=[MOS_DATA_DIR])`. Keep clips short (a few seconds) as is normal for MOS.
- Deactivating a user (`Admin β†’ Users β†’ active = no`) blocks login but keeps their ratings.
- Deleting a sample also deletes its ratings (cascade) and the audio file on disk.

---

*Generated for Afolabi Abeeb, Plotweaver AI.*