Spaces:

PlotweaverModel
/

MOS_Evaluation

Paused

App Files Files Community

PlotweaverModel commited on 9 days ago

Commit

f601a65

verified ·

1 Parent(s): 4c27592

Upload 3 files

Browse files

Files changed (3) hide show

README.md +211 -6
app.py +783 -0
requirements.txt +3 -0

README.md CHANGED Viewed

@@ -1,13 +1,218 @@
 ---
-title: MOS Evaluation
-emoji: 🔥
-colorFrom: gray
 colorTo: indigo
 sdk: gradio
-sdk_version: 6.19.0
-python_version: '3.13'
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: TTS MOS Evaluation
+emoji: 🎧
+colorFrom: blue
 colorTo: indigo
 sdk: gradio
+sdk_version: "5.49.1"
 app_file: app.py
 pinned: false
 ---
+# 🎧 Plotweaver AI — TTS MOS Evaluation Platform
+A multi-user [Gradio](https://gradio.app) application for collecting **Mean Opinion Score (MOS)**
+ratings of synthesised speech across **multiple languages**. Reviewers create accounts,
+listen to audio samples in the languages they are competent in, and rate each sample on
+**7 criteria (1–5)**. An admin uploads audio per language and reads off aggregated MOS results
+per language and per model, with one-click export for your paper.
+Built for the Plotweaver AI African-language TTS validation workflow (Yoruba, Hausa, Igbo,
+Nigerian English, Akan, Swahili, Nigerian Pidgin, … add more any time).
+> **Note on the front-matter above:** Hugging Face Spaces reads the YAML block at the top of this
+> file to configure the Space (SDK, app file, etc.). Keep it as the very first thing in the file.
+> Set `sdk_version` to whatever current Gradio 5.x the Space build accepts — if it complains, the
+> build log lists valid versions.
+---
+## What it does
+- **Accounts & roles.** Reviewers self-register; passwords are hashed (PBKDF2-HMAC-SHA256,
+  200k iterations, per-user salt). Two roles: `reviewer` and `admin`.
+- **Per-language gating.** A reviewer is assigned a set of languages and only ever sees / rates
+  samples in those languages. They can update their language set themselves.
+- **Add languages any time.** New languages appear instantly in the signup form, the reviewer
+  profile, and the admin upload/results dropdowns — no code change, no redeploy.
+- **Blind evaluation.** The model / system name is stored with each sample but is **never shown
+  to reviewers** — they only see "Sample N", so ratings aren't biased by branding.
+- **7-criterion MOS form** matching `TTS_Evaluation_Criteria.docx`:
+  Naturalness · Intelligibility · Pronunciation Accuracy · Prosody & Expressiveness · Fluency ·
+  Audio Quality · Overall Quality — plus a free-text comments box.
+- **Resumable rating.** Ratings are upserted (one per reviewer per sample); reviewers can revisit
+  and change a rating, and a "Next unrated ▶" button walks them through the queue.
+- **Results dashboard.** Per-model MOS, per-sample MOS, reviewer/sample counts, standard
+  deviation, and the separate **reference-anchor MOS** (see below). Export to `.xlsx`.
+---
+## Environment variables
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `MOS_DATA_DIR` | `./data` | Where the SQLite DB, audio, and exports live. Point at your persistent mount (e.g. `/data` on a Space bucket). |
+| `ADMIN_CODE` | `plotweaver-admin` | Entered on the signup form to create an admin account. **Override this.** |
+| `MOS_JOURNAL_MODE` | `DELETE` | SQLite journal mode. Leave as `DELETE` on a bucket/network mount; set `WAL` only on a real local disk for extra concurrency. |
+| `PORT` | `7860` | Port the app binds to. |
+---
+## Quick start (local)
+```bash
+pip install -r requirements.txt
+python app.py
+```
+Open http://localhost:7860.
+**Create the first admin:** go to *Create account*, fill in a username/password, and enter the
+admin code in the *Admin code* field. The default code is printed in the console on startup
+(`plotweaver-admin` unless you override it). Override it with an environment variable:
+```bash
+export ADMIN_CODE="something-only-you-know"
+export MOS_JOURNAL_MODE=WAL   # optional: a real local disk supports WAL
+python app.py
+```
+Anyone who signs up **without** the code becomes a normal reviewer.
+---
+## Deploying on a Hugging Face Space
+This app writes its database and uploaded audio to disk, so it needs **persistent storage**. Space
+filesystems are otherwise ephemeral and get wiped on rebuilds, restarts, and after the free CPU
+tier sleeps. The current way to persist data on a Space is a **Storage Bucket** (HF's replacement
+for the older fixed `/data` storage tier).
+1. **Create the Space.** New → Space → SDK **Gradio**, hardware **CPU basic (free)**, visibility
+   **Private**. ZeroGPU is *not* needed — there's no inference here, only file serving and SQLite.
+2. **Mount a Storage Bucket.** Space → Settings → **Storage Buckets** → *Mount a bucket*. Create a
+   private bucket (e.g. `mos-eval-data`), mode read-write, and set the **mount path** to `/data`.
+   Your audio clips are tiny, so this stays within the free private-storage allowance.
+3. **Set variables / secrets.** Space → Settings → *Variables and secrets*:
+   - Variable `MOS_DATA_DIR` = `/data`
+   - Secret `ADMIN_CODE` = `<your secret>`
+   - (Do **not** set `MOS_JOURNAL_MODE` — the bucket-safe `DELETE` default is what you want here.)
+4. **Upload `app.py`, `requirements.txt`, `README.md`** (this file, with its YAML front-matter) via
+   the Files tab or `git push`.
+5. **First boot.** Watch the **Logs** tab — the active admin code is printed on startup. Open the
+   app, *Create account*, enter the admin code, and you're the admin.
+Notes:
+- The app already calls `launch(allowed_paths=[MOS_DATA_DIR])`, so Gradio is permitted to serve the
+  audio files stored on the bucket. Without this, the audio player can't load clips from `/data`.
+- Buckets are object storage mounted as a filesystem (FUSE). SQLite's **WAL** mode relies on shared
+  memory that doesn't work on such mounts, which is why the default journal mode is `DELETE`. At a
+  few dozen reviewers this is comfortable; the only risk is a small chance of DB corruption if the
+  Space is killed mid-write. For a bulletproof setup, keep the DB on local disk and sync periodic
+  backups to the bucket (not built in — easy to add if you need it).
+### Self-hosted (AlmaLinux / AWS, behind nginx)
+Run it under `nohup` and reverse-proxy `127.0.0.1:7860`. A real local disk supports WAL:
+```bash
+export MOS_DATA_DIR=/srv/mos/data ADMIN_CODE=... MOS_JOURNAL_MODE=WAL
+nohup python app.py > mos.log 2>&1 &
+```
+Point your nginx `location /` block at `http://127.0.0.1:7860;` with the usual
+`proxy_set_header Upgrade/Connection` lines for websockets.
+---
+## How an evaluation round works
+1. **Admin → Languages.** Add the languages in scope (e.g. `yo / Yoruba`, `ha / Hausa`).
+2. **Admin → Upload audio samples.** Pick a language, set the **Model / system name**
+   (e.g. `F5-TTS`, `XTTS-v2`, `MMS-TTS`, or `human`), drag in the wav/mp3/flac/ogg files, upload.
+   Tick **reference / human anchor** for ground-truth human recordings (see below).
+3. **Reviewers** sign up, choose their languages, and rate. They see a blind "Sample N", an audio
+   player, the 7 radios, and a comments box.
+4. **Admin → Results.** Pick a language, click **Compute MOS**, read the per-model and per-sample
+   tables, then **Export XLSX** (sheets: *Per Model*, *Per Sample*, *Raw Ratings*).
+### Final MOS per language
+The summary line reports two headline numbers per language:
+- **System MOS (Overall criterion)** — the mean of the dedicated *Overall Quality* ratings. This is
+  the conventional single MOS number to quote.
+- **System MOS (mean of all criteria)** — the mean across all 7 criteria, useful when you want a
+  composite that weights pronunciation/prosody equally with overall impression.
+Both are broken down per model so you can compare systems within a language directly
+(e.g. F5-TTS vs XTTS-v2 for Yoruba).
+---
+## Data model (SQLite)
+```
+users(id, username, email, password_hash, salt, role, is_active, created_at)
+languages(id, code, name, created_at)
+user_languages(user_id, language_id)                  -- which languages a reviewer may rate
+samples(id, language_id, sample_name, model_name, file_path, is_reference, transcript, created_at)
+ratings(id, user_id, sample_id, naturalness, intelligibility, pronunciation,
+        prosody, fluency, audio_quality, overall, comments, updated_at,
+        UNIQUE(user_id, sample_id))
+```
+Everything lives under `MOS_DATA_DIR`: `mos.db`, `audio/<language_id>/<file>`, and `exports/`.
+---
+## Suggested additions (beyond what you asked for, already built in where noted)
+These come from standard MOS / listening-test practice and matter for a defensible result:
+1. **Reference / human anchors (built in).** Upload a few real human recordings flagged as
+   *reference*. They are mixed blindly into the reviewer's queue but excluded from the system MOS,
+   and their mean is reported separately. If your anchors don't score ~4.5–5.0, that reviewer (or
+   the whole batch) is mis-calibrated and you can discount them. This is the single most important
+   safeguard for credible MOS.
+2. **Blind model names (built in).** Already enforced — reviewers never see which system produced
+   a clip.
+3. **Inter-rater spread (built in).** The per-model/per-sample tables include the standard
+   deviation of the Overall score, so you can spot samples reviewers disagree on.
+Worth considering for a v2 (not yet built — happy to add):
+4. **Minimum ratings per sample** before a sample counts as "final" (e.g. require ≥5 reviewers),
+   and surface samples that are under-rated.
+5. **Randomised, balanced presentation order** per reviewer (Latin-square style) so position bias
+   averages out, instead of always-ascending sample id.
+6. **Listening-setup gate** — a one-time confirmation that the reviewer is on headphones in a quiet
+   room, stored with their profile.
+7. **Native-speaker / experience metadata** on reviewers, so you can filter MOS to native speakers
+   only for the paper.
+8. **Attention-trap clips** (e.g. "rate this one 1") to catch click-through reviewers.
+9. **Krippendorff's α / ICC** for formal inter-rater reliability, reported per language.
+10. **Pairwise / MUSHRA / CMOS modes** if you later want preference tests rather than absolute MOS.
+11. **Per-reviewer rate limiting / session timing** to flag implausibly fast ratings.
+12. **CI / standard error** on each MOS (the raw export already lets you compute this; could be
+    shown inline).
+13. **DB backup-to-bucket** (recommended if you go to production scale on a Space) — periodic
+    snapshot of `mos.db` to a separate bucket path, with restore-on-startup, to remove the
+    mid-write corruption risk noted above.
+---
+## Notes & limitations
+- **SQLite journal mode.** Default is `DELETE` (safe on bucket/FUSE mounts). Set
+  `MOS_JOURNAL_MODE=WAL` only on a genuine local disk for better concurrency. Comfortable for a few
+  dozen concurrent reviewers; for larger crowdsourcing, move to Postgres — the data layer is
+  isolated in a handful of functions and easy to swap.
+- **Audio serving.** Files are served from disk via Gradio's file mechanism, enabled by
+  `launch(allowed_paths=[MOS_DATA_DIR])`. Keep clips short (a few seconds) as is normal for MOS.
+- Deactivating a user (`Admin → Users → active = no`) blocks login but keeps their ratings.
+- Deleting a sample also deletes its ratings (cascade) and the audio file on disk.
+---
+*Generated for Afolabi Abeeb, Plotweaver AI.*

app.py ADDED Viewed

	@@ -0,0 +1,783 @@

+"""
+Plotweaver AI — TTS MOS Evaluation Platform
+============================================
+A multi-user Gradio application for collecting Mean Opinion Score (MOS) ratings
+of synthesised speech across multiple languages.
+Roles
+-----
+- Reviewer: signs up, selects the language(s) they are competent in, listens to
+  audio samples and rates each on 7 criteria (1-5) plus free-text comments.
+- Admin:    uploads audio per language, manages languages/users, and views the
+  aggregated MOS results per language and per model, with CSV/XLSX export.
+Persistence
+-----------
+Everything (users, languages, samples, ratings) lives in a single SQLite
+database. Audio files are stored on disk under the data directory. Point
+MOS_DATA_DIR at a persistent location (on Hugging Face Spaces enable persistent
+storage and set MOS_DATA_DIR=/data) so data survives restarts.
+Bootstrapping the first admin
+-----------------------------
+On the signup form there is an optional "Admin code" field. If it matches the
+ADMIN_CODE environment variable, the new account is created as an admin.
+The effective admin code is printed to the logs on startup.
+"""
+import os
+import re
+import sqlite3
+import hashlib
+import secrets
+import datetime as dt
+import shutil
+import pandas as pd
+import gradio as gr
+# --------------------------------------------------------------------------- #
+# Configuration
+# --------------------------------------------------------------------------- #
+DATA_DIR = os.environ.get("MOS_DATA_DIR", os.path.join(os.path.dirname(os.path.abspath(__file__)), "data"))
+AUDIO_DIR = os.path.join(DATA_DIR, "audio")
+EXPORT_DIR = os.path.join(DATA_DIR, "exports")
+DB_PATH = os.path.join(DATA_DIR, "mos.db")
+ADMIN_CODE = os.environ.get("ADMIN_CODE", "plotweaver-admin")
+for d in (DATA_DIR, AUDIO_DIR, EXPORT_DIR):
+    os.makedirs(d, exist_ok=True)
+# The 7 MOS criteria, in the order they appear on the evaluation form.
+# (db_column, display_label, short_definition)
+CRITERIA = [
+    ("naturalness",   "Naturalness",                "How human-like and natural the speech sounds."),
+    ("intelligibility", "Intelligibility",          "How easy it is to understand the spoken content."),
+    ("pronunciation", "Pronunciation Accuracy",     "Whether words, phonemes, tones and language-specific sounds are correct."),
+    ("prosody",       "Prosody & Expressiveness",   "Rhythm, stress, pitch, intonation and speaking style."),
+    ("fluency",       "Fluency",                    "Smoothness without awkward pauses, repetitions or glitches."),
+    ("audio_quality", "Audio Quality",              "Technical quality: noise, distortion, clipping, artifacts."),
+    ("overall",       "Overall Quality",            "Overall impression considering all aspects of synthesis quality."),
+]
+CRITERIA_KEYS = [c[0] for c in CRITERIA]
+SCALE_HINT = "1 = Very Poor  ·  2 = Poor  ·  3 = Fair  ·  4 = Good  ·  5 = Excellent"
+# --------------------------------------------------------------------------- #
+# Database helpers
+# --------------------------------------------------------------------------- #
+def get_conn():
+    conn = sqlite3.connect(DB_PATH, timeout=30)
+    conn.row_factory = sqlite3.Row
+    conn.execute("PRAGMA foreign_keys = ON;")
+    return conn
+def init_db():
+    with get_conn() as conn:
+        conn.execute("PRAGMA journal_mode = WAL;")
+        conn.execute("""
+            CREATE TABLE IF NOT EXISTS users (
+                id            INTEGER PRIMARY KEY AUTOINCREMENT,
+                username      TEXT UNIQUE NOT NULL,
+                email         TEXT,
+                password_hash TEXT NOT NULL,
+                salt          TEXT NOT NULL,
+                role          TEXT NOT NULL DEFAULT 'reviewer',
+                native_langs  TEXT DEFAULT '',
+                is_active     INTEGER NOT NULL DEFAULT 1,
+                created_at    TEXT NOT NULL
+            );
+        """)
+        conn.execute("""
+            CREATE TABLE IF NOT EXISTS languages (
+                id         INTEGER PRIMARY KEY AUTOINCREMENT,
+                code       TEXT UNIQUE NOT NULL,
+                name       TEXT NOT NULL,
+                created_at TEXT NOT NULL
+            );
+        """)
+        # user <-> language assignment (which languages a reviewer is eligible for)
+        conn.execute("""
+            CREATE TABLE IF NOT EXISTS user_languages (
+                user_id     INTEGER NOT NULL,
+                language_id INTEGER NOT NULL,
+                PRIMARY KEY (user_id, language_id),
+                FOREIGN KEY (user_id)     REFERENCES users(id)     ON DELETE CASCADE,
+                FOREIGN KEY (language_id) REFERENCES languages(id) ON DELETE CASCADE
+            );
+        """)
+        conn.execute("""
+            CREATE TABLE IF NOT EXISTS samples (
+                id           INTEGER PRIMARY KEY AUTOINCREMENT,
+                language_id  INTEGER NOT NULL,
+                sample_name  TEXT NOT NULL,
+                model_name   TEXT NOT NULL DEFAULT 'unspecified',
+                file_path    TEXT NOT NULL,
+                is_reference INTEGER NOT NULL DEFAULT 0,
+                transcript   TEXT DEFAULT '',
+                created_at   TEXT NOT NULL,
+                FOREIGN KEY (language_id) REFERENCES languages(id) ON DELETE CASCADE
+            );
+        """)
+        cols = ",\n".join(f"{k} INTEGER" for k in CRITERIA_KEYS)
+        conn.execute(f"""
+            CREATE TABLE IF NOT EXISTS ratings (
+                id         INTEGER PRIMARY KEY AUTOINCREMENT,
+                user_id    INTEGER NOT NULL,
+                sample_id  INTEGER NOT NULL,
+                {cols},
+                comments   TEXT DEFAULT '',
+                updated_at TEXT NOT NULL,
+                UNIQUE (user_id, sample_id),
+                FOREIGN KEY (user_id)   REFERENCES users(id)   ON DELETE CASCADE,
+                FOREIGN KEY (sample_id) REFERENCES samples(id) ON DELETE CASCADE
+            );
+        """)
+def now_iso():
+    return dt.datetime.utcnow().isoformat(timespec="seconds")
+# --------------------------------------------------------------------------- #
+# Auth
+# --------------------------------------------------------------------------- #
+def hash_password(password, salt=None):
+    if salt is None:
+        salt = secrets.token_hex(16)
+    h = hashlib.pbkdf2_hmac("sha256", password.encode("utf-8"), salt.encode("utf-8"), 200_000)
+    return h.hex(), salt
+def create_user(username, email, password, role="reviewer", language_ids=None):
+    username = (username or "").strip()
+    if not re.fullmatch(r"[A-Za-z0-9_.\-]{3,32}", username):
+        raise ValueError("Username must be 3-32 chars: letters, numbers, _ . - only.")
+    if not password or len(password) < 6:
+        raise ValueError("Password must be at least 6 characters.")
+    pw_hash, salt = hash_password(password)
+    with get_conn() as conn:
+        try:
+            cur = conn.execute(
+                "INSERT INTO users (username, email, password_hash, salt, role, created_at) "
+                "VALUES (?,?,?,?,?,?)",
+                (username, (email or "").strip(), pw_hash, salt, role, now_iso()),
+            )
+        except sqlite3.IntegrityError:
+            raise ValueError(f"Username '{username}' is already taken.")
+        uid = cur.lastrowid
+        for lid in (language_ids or []):
+            conn.execute("INSERT OR IGNORE INTO user_languages (user_id, language_id) VALUES (?,?)", (uid, lid))
+    return uid
+def authenticate(username, password):
+    with get_conn() as conn:
+        row = conn.execute("SELECT * FROM users WHERE username = ?", ((username or "").strip(),)).fetchone()
+    if not row or not row["is_active"]:
+        return None
+    pw_hash, _ = hash_password(password, row["salt"])
+    if secrets.compare_digest(pw_hash, row["password_hash"]):
+        return dict(row)
+    return None
+def user_session(uid):
+    """Return a lightweight session dict used in gr.State."""
+    with get_conn() as conn:
+        row = conn.execute("SELECT * FROM users WHERE id = ?", (uid,)).fetchone()
+        langs = conn.execute(
+            "SELECT l.id, l.code, l.name FROM user_languages ul "
+            "JOIN languages l ON l.id = ul.language_id WHERE ul.user_id = ? ORDER BY l.name", (uid,)
+        ).fetchall()
+    return {
+        "id": row["id"],
+        "username": row["username"],
+        "role": row["role"],
+        "languages": [{"id": l["id"], "code": l["code"], "name": l["name"]} for l in langs],
+    }
+# --------------------------------------------------------------------------- #
+# Languages
+# --------------------------------------------------------------------------- #
+def add_language(code, name):
+    code = (code or "").strip().lower()
+    name = (name or "").strip()
+    if not code or not name:
+        raise ValueError("Both language code and name are required.")
+    with get_conn() as conn:
+        try:
+            conn.execute("INSERT INTO languages (code, name, created_at) VALUES (?,?,?)", (code, name, now_iso()))
+        except sqlite3.IntegrityError:
+            raise ValueError(f"Language code '{code}' already exists.")
+def list_languages():
+    with get_conn() as conn:
+        return [dict(r) for r in conn.execute("SELECT * FROM languages ORDER BY name").fetchall()]
+def set_user_languages(uid, language_ids):
+    with get_conn() as conn:
+        conn.execute("DELETE FROM user_languages WHERE user_id = ?", (uid,))
+        for lid in language_ids:
+            conn.execute("INSERT OR IGNORE INTO user_languages (user_id, language_id) VALUES (?,?)", (uid, lid))
+# --------------------------------------------------------------------------- #
+# Samples
+# --------------------------------------------------------------------------- #
+def add_sample(language_id, src_path, sample_name=None, model_name="unspecified",
+               is_reference=False, transcript=""):
+    if not src_path or not os.path.exists(src_path):
+        raise ValueError("Audio file not found.")
+    ext = os.path.splitext(src_path)[1].lower() or ".wav"
+    lang_dir = os.path.join(AUDIO_DIR, str(language_id))
+    os.makedirs(lang_dir, exist_ok=True)
+    fname = f"{secrets.token_hex(8)}{ext}"
+    dst = os.path.join(lang_dir, fname)
+    shutil.copyfile(src_path, dst)
+    sample_name = (sample_name or "").strip() or os.path.splitext(os.path.basename(src_path))[0]
+    with get_conn() as conn:
+        conn.execute(
+            "INSERT INTO samples (language_id, sample_name, model_name, file_path, is_reference, transcript, created_at) "
+            "VALUES (?,?,?,?,?,?,?)",
+            (language_id, sample_name, (model_name or "unspecified").strip(), dst,
+             1 if is_reference else 0, (transcript or "").strip(), now_iso()),
+        )
+def list_samples(language_id=None):
+    q = ("SELECT s.*, l.name AS language_name, l.code AS language_code "
+         "FROM samples s JOIN languages l ON l.id = s.language_id")
+    args = ()
+    if language_id:
+        q += " WHERE s.language_id = ?"
+        args = (language_id,)
+    q += " ORDER BY s.language_id, s.id"
+    with get_conn() as conn:
+        return [dict(r) for r in conn.execute(q, args).fetchall()]
+def delete_sample(sample_id):
+    with get_conn() as conn:
+        row = conn.execute("SELECT file_path FROM samples WHERE id = ?", (sample_id,)).fetchone()
+        if row and row["file_path"] and os.path.exists(row["file_path"]):
+            try:
+                os.remove(row["file_path"])
+            except OSError:
+                pass
+        conn.execute("DELETE FROM samples WHERE id = ?", (sample_id,))
+# --------------------------------------------------------------------------- #
+# Ratings
+# --------------------------------------------------------------------------- #
+def upsert_rating(user_id, sample_id, scores, comments=""):
+    cols = ", ".join(CRITERIA_KEYS)
+    placeholders = ", ".join("?" for _ in CRITERIA_KEYS)
+    updates = ", ".join(f"{k}=excluded.{k}" for k in CRITERIA_KEYS)
+    vals = [int(scores[k]) for k in CRITERIA_KEYS]
+    with get_conn() as conn:
+        conn.execute(
+            f"INSERT INTO ratings (user_id, sample_id, {cols}, comments, updated_at) "
+            f"VALUES (?,?,{placeholders},?,?) "
+            f"ON CONFLICT(user_id, sample_id) DO UPDATE SET {updates}, "
+            f"comments=excluded.comments, updated_at=excluded.updated_at",
+            [user_id, sample_id, *vals, (comments or "").strip(), now_iso()],
+        )
+def get_rating(user_id, sample_id):
+    with get_conn() as conn:
+        row = conn.execute("SELECT * FROM ratings WHERE user_id=? AND sample_id=?",
+                           (user_id, sample_id)).fetchone()
+    return dict(row) if row else None
+def samples_for_reviewer(user_id, language_id):
+    """Samples in a language with a 'rated' flag for this reviewer."""
+    with get_conn() as conn:
+        rows = conn.execute(
+            "SELECT s.id, s.sample_name, s.is_reference, "
+            "       CASE WHEN r.id IS NULL THEN 0 ELSE 1 END AS rated "
+            "FROM samples s "
+            "LEFT JOIN ratings r ON r.sample_id = s.id AND r.user_id = ? "
+            "WHERE s.language_id = ? ORDER BY s.id",
+            (user_id, language_id),
+        ).fetchall()
+    return [dict(r) for r in rows]
+# --------------------------------------------------------------------------- #
+# Aggregation
+# --------------------------------------------------------------------------- #
+def ratings_dataframe(language_id=None, include_reference=False):
+    q = (
+        "SELECT r.*, s.language_id, s.sample_name, s.model_name, s.is_reference, "
+        "       l.name AS language_name, l.code AS language_code "
+        "FROM ratings r "
+        "JOIN samples s ON s.id = r.sample_id "
+        "JOIN languages l ON l.id = s.language_id"
+    )
+    args = ()
+    if language_id:
+        q += " WHERE s.language_id = ?"
+        args = (language_id,)
+    with get_conn() as conn:
+        df = pd.read_sql_query(q, conn, params=args)
+    if not include_reference and not df.empty:
+        df = df[df["is_reference"] == 0]
+    return df
+def compute_results(language_id, by_model=True):
+    """Return (per_model_df, per_sample_df, summary_text) for a language."""
+    df = ratings_dataframe(language_id=language_id, include_reference=True)
+    if df.empty:
+        return pd.DataFrame(), pd.DataFrame(), "No ratings collected yet for this language."
+    # Split system vs reference (human anchor) samples.
+    sys_df = df[df["is_reference"] == 0]
+    ref_df = df[df["is_reference"] == 1]
+    def agg_block(frame, group_cols):
+        if frame.empty:
+            return pd.DataFrame()
+        g = frame.groupby(group_cols, dropna=False)
+        out = g[CRITERIA_KEYS].mean().round(3)
+        out["MOS (mean of criteria)"] = out[CRITERIA_KEYS].mean(axis=1).round(3)
+        out["overall_std"] = g["overall"].std().round(3)
+        out["n_ratings"] = g.size()
+        out["n_reviewers"] = g["user_id"].nunique()
+        out["n_samples"] = g["sample_id"].nunique()
+        return out.reset_index()
+    per_model = agg_block(sys_df, ["model_name"])
+    per_sample = agg_block(sys_df, ["model_name", "sample_id", "sample_name"])
+    # Friendly column labels.
+    label_map = {k: lbl for k, lbl, _ in CRITERIA}
+    per_model = per_model.rename(columns=label_map)
+    per_sample = per_sample.rename(columns=label_map)
+    lang_name = df["language_name"].iloc[0]
+    lines = [f"Language: {lang_name}",
+             f"Total ratings: {len(sys_df)}  |  Reviewers: {sys_df['user_id'].nunique()}  |  "
+             f"Samples: {sys_df['sample_id'].nunique()}  |  Models: {sys_df['model_name'].nunique()}"]
+    if not sys_df.empty:
+        overall_mos = sys_df["overall"].mean()
+        mean_mos = sys_df[CRITERIA_KEYS].mean(axis=1).mean()
+        lines.append(f"System MOS (Overall criterion): {overall_mos:.3f}   |   "
+                     f"System MOS (mean of all criteria): {mean_mos:.3f}")
+    if not ref_df.empty:
+        lines.append(f"Reference/anchor MOS (Overall): {ref_df['overall'].mean():.3f}  "
+                     f"({ref_df['sample_id'].nunique()} anchor samples) — sanity check on reviewer calibration.")
+    return per_model, per_sample, "\n".join(lines)
+def export_results(language_id):
+    per_model, per_sample, _ = compute_results(language_id)
+    raw = ratings_dataframe(language_id=language_id, include_reference=True)
+    langs = {l["id"]: l for l in list_languages()}
+    code = langs.get(language_id, {}).get("code", str(language_id))
+    path = os.path.join(EXPORT_DIR, f"mos_results_{code}_{dt.datetime.utcnow().strftime('%Y%m%d_%H%M%S')}.xlsx")
+    with pd.ExcelWriter(path, engine="openpyxl") as xw:
+        (per_model if not per_model.empty else pd.DataFrame({"info": ["no data"]})).to_excel(xw, sheet_name="Per Model", index=False)
+        (per_sample if not per_sample.empty else pd.DataFrame({"info": ["no data"]})).to_excel(xw, sheet_name="Per Sample", index=False)
+        (raw if not raw.empty else pd.DataFrame({"info": ["no data"]})).to_excel(xw, sheet_name="Raw Ratings", index=False)
+    return path
+# --------------------------------------------------------------------------- #
+# Startup
+# --------------------------------------------------------------------------- #
+init_db()
+print("=" * 64)
+print("Plotweaver AI — TTS MOS Evaluation Platform")
+print(f"Data directory : {DATA_DIR}")
+print(f"Database       : {DB_PATH}")
+print(f"Admin code     : {ADMIN_CODE}  (use on signup to create an admin)")
+print("=" * 64)
+# ===========================================================================
+#  GRADIO UI
+# ===========================================================================
+def lang_choices():
+    return [(f"{l['name']} ({l['code']})", l["id"]) for l in list_languages()]
+def reviewer_lang_choices(session):
+    if not session:
+        return []
+    return [(l["name"], l["id"]) for l in session["languages"]]
+with gr.Blocks(title="Plotweaver AI — TTS MOS Evaluation", theme=gr.themes.Soft()) as demo:
+    session = gr.State(None)          # logged-in user session dict
+    current_sample = gr.State(None)   # sample id currently being rated
+    gr.Markdown("# 🎧 Plotweaver AI — TTS MOS Evaluation Platform")
+    # ----------------------------- AUTH ------------------------------------ #
+    with gr.Column(visible=True) as auth_col:
+        gr.Markdown("Sign in or create a reviewer account to begin.")
+        with gr.Tabs():
+            with gr.Tab("Sign in"):
+                li_user = gr.Textbox(label="Username")
+                li_pw = gr.Textbox(label="Password", type="password")
+                li_btn = gr.Button("Sign in", variant="primary")
+                li_msg = gr.Markdown()
+            with gr.Tab("Create account"):
+                su_user = gr.Textbox(label="Username", info="3-32 chars: letters, numbers, _ . -")
+                su_email = gr.Textbox(label="Email (optional)")
+                su_pw = gr.Textbox(label="Password", type="password", info="At least 6 characters")
+                su_langs = gr.CheckboxGroup(label="Languages you can evaluate", choices=lang_choices())
+                su_code = gr.Textbox(label="Admin code (optional)", type="password",
+                                     info="Leave blank for a normal reviewer account.")
+                su_btn = gr.Button("Create account", variant="primary")
+                su_msg = gr.Markdown()
+    # ----------------------------- APP ------------------------------------- #
+    with gr.Column(visible=False) as app_col:
+        with gr.Row():
+            greeting = gr.Markdown()
+            logout_btn = gr.Button("Log out", scale=0)
+        with gr.Tabs() as app_tabs:
+            # ---------- Rate tab ----------
+            with gr.Tab("Rate samples"):
+                gr.Markdown(
+                    "Please listen to **the whole clip** with headphones in a quiet room before rating. "
+                    f"Scale: {SCALE_HINT}"
+                )
+                with gr.Row():
+                    rate_lang = gr.Dropdown(label="Language", choices=[], interactive=True)
+                    rate_sample = gr.Dropdown(label="Sample (✓ = already rated)", choices=[], interactive=True)
+                    next_btn = gr.Button("Next unrated ▶", scale=0)
+                rate_audio = gr.Audio(label="Audio sample", type="filepath", interactive=False)
+                criterion_inputs = {}
+                with gr.Row():
+                    for key, label, definition in CRITERIA[:4]:
+                        criterion_inputs[key] = gr.Radio(
+                            choices=[1, 2, 3, 4, 5], label=label, info=definition, value=None
+                        )
+                with gr.Row():
+                    for key, label, definition in CRITERIA[4:]:
+                        criterion_inputs[key] = gr.Radio(
+                            choices=[1, 2, 3, 4, 5], label=label, info=definition, value=None
+                        )
+                rate_comments = gr.Textbox(label="Comments (pronunciation errors, artifacts, etc.)", lines=2)
+                with gr.Row():
+                    submit_btn = gr.Button("Submit / update rating", variant="primary")
+                rate_msg = gr.Markdown()
+            # ---------- Progress tab ----------
+            with gr.Tab("My progress"):
+                refresh_prog_btn = gr.Button("Refresh")
+                progress_md = gr.Markdown()
+                progress_tbl = gr.Dataframe(headers=["Language", "Rated", "Total", "Remaining"], interactive=False)
+            # ---------- Profile tab ----------
+            with gr.Tab("My languages"):
+                gr.Markdown("Update the set of languages you are eligible to evaluate.")
+                prof_langs = gr.CheckboxGroup(label="Languages", choices=lang_choices())
+                prof_save = gr.Button("Save", variant="primary")
+                prof_msg = gr.Markdown()
+            # ---------- Admin tab ----------
+            with gr.Tab("Admin", visible=False) as admin_tab:
+                gr.Markdown("### Languages")
+                with gr.Row():
+                    al_code = gr.Textbox(label="Code", info="e.g. yo, ha, ig, pcm, en-NG")
+                    al_name = gr.Textbox(label="Name", info="e.g. Yoruba")
+                    al_btn = gr.Button("Add language", scale=0)
+                al_msg = gr.Markdown()
+                langs_tbl = gr.Dataframe(headers=["id", "code", "name"], interactive=False, label="Existing languages")
+                gr.Markdown("### Upload audio samples")
+                with gr.Row():
+                    up_lang = gr.Dropdown(label="Language", choices=lang_choices())
+                    up_model = gr.Textbox(label="Model / system name", value="",
+                                          info="e.g. F5-TTS, XTTS-v2, MMS-TTS, human. Hidden from reviewers.")
+                up_files = gr.File(label="Audio files (wav/mp3/flac/ogg)", file_count="multiple",
+                                   file_types=["audio"])
+                with gr.Row():
+                    up_isref = gr.Checkbox(label="These are reference / human anchor samples", value=False)
+                    up_transcript = gr.Textbox(label="Transcript (optional, applies to all uploaded)", scale=2)
+                up_btn = gr.Button("Upload", variant="primary")
+                up_msg = gr.Markdown()
+                samples_tbl = gr.Dataframe(
+                    headers=["id", "language", "sample_name", "model", "reference"],
+                    interactive=False, label="Samples")
+                with gr.Row():
+                    del_sample_id = gr.Number(label="Delete sample by id", precision=0)
+                    del_btn = gr.Button("Delete", variant="stop", scale=0)
+                gr.Markdown("### Users")
+                refresh_users_btn = gr.Button("Refresh users")
+                users_tbl = gr.Dataframe(
+                    headers=["id", "username", "role", "active", "languages", "ratings"],
+                    interactive=False)
+                with gr.Row():
+                    promote_id = gr.Number(label="User id", precision=0)
+                    role_choice = gr.Dropdown(label="Set role", choices=["reviewer", "admin"], value="reviewer")
+                    active_choice = gr.Dropdown(label="Set active", choices=["yes", "no"], value="yes")
+                    update_user_btn = gr.Button("Apply", scale=0)
+                user_admin_msg = gr.Markdown()
+                gr.Markdown("### Results")
+                with gr.Row():
+                    res_lang = gr.Dropdown(label="Language", choices=lang_choices())
+                    res_btn = gr.Button("Compute MOS", variant="primary")
+                    export_btn = gr.Button("Export XLSX")
+                res_summary = gr.Markdown()
+                res_model_tbl = gr.Dataframe(label="MOS by model", interactive=False)
+                res_sample_tbl = gr.Dataframe(label="MOS by sample", interactive=False)
+                res_file = gr.File(label="Exported file")
+    # ===================================================================== #
+    #  Handlers
+    # ===================================================================== #
+    rate_outputs = [criterion_inputs[k] for k in CRITERIA_KEYS]
+    def do_login(username, password):
+        user = authenticate(username, password)
+        if not user:
+            return (gr.update(), gr.update(), None, "❌ Invalid username or password.",
+                    gr.update(), gr.update(), gr.update(), gr.update())
+        sess = user_session(user["id"])
+        is_admin = sess["role"] == "admin"
+        rl = reviewer_lang_choices(sess)
+        return (
+            gr.update(visible=False),                       # auth_col
+            gr.update(visible=True),                        # app_col
+            sess,                                           # session state
+            "",                                             # li_msg
+            gr.update(value=f"Signed in as **{sess['username']}** ({sess['role']})"),  # greeting
+            gr.update(visible=is_admin),                    # admin_tab
+            gr.update(choices=rl, value=(rl[0][1] if rl else None)),  # rate_lang
+            gr.update(choices=lang_choices(),
+                      value=[l["id"] for l in sess["languages"]]),    # prof_langs
+        )
+    li_btn.click(
+        do_login, [li_user, li_pw],
+        [auth_col, app_col, session, li_msg, greeting, admin_tab, rate_lang, prof_langs],
+    )
+    def do_signup(username, email, password, lang_ids, code):
+        role = "admin" if (code and code == ADMIN_CODE) else "reviewer"
+        try:
+            create_user(username, email, password, role=role, language_ids=lang_ids or [])
+        except ValueError as e:
+            return f"❌ {e}", gr.update()
+        note = " (admin)" if role == "admin" else ""
+        return f"✅ Account created{note}. Switch to **Sign in** to continue.", gr.update(value="")
+    su_btn.click(do_signup, [su_user, su_email, su_pw, su_langs, su_code], [su_msg, su_pw])
+    def do_logout():
+        return (gr.update(visible=True), gr.update(visible=False), None, "")
+    logout_btn.click(do_logout, None, [auth_col, app_col, session, greeting])
+    # ---- Rating flow ----
+    def load_samples_for_lang(sess, language_id):
+        if not sess or not language_id:
+            return gr.update(choices=[], value=None)
+        items = samples_for_reviewer(sess["id"], language_id)
+        choices = [(("✓ " if s["rated"] else "• ") + s["sample_name"], s["id"]) for s in items]
+        return gr.update(choices=choices, value=(choices[0][1] if choices else None))
+    rate_lang.change(load_samples_for_lang, [session, rate_lang], [rate_sample])
+    def load_sample(sess, sample_id):
+        """Load audio + any existing rating into the form."""
+        if not sample_id:
+            return (None, None, *[None] * len(CRITERIA_KEYS), "")
+        with get_conn() as conn:
+            row = conn.execute("SELECT * FROM samples WHERE id=?", (sample_id,)).fetchone()
+        if not row:
+            return (None, None, *[None] * len(CRITERIA_KEYS), "")
+        existing = get_rating(sess["id"], sample_id) if sess else None
+        scores = [existing[k] if existing else None for k in CRITERIA_KEYS]
+        comments = existing["comments"] if existing else ""
+        return (row["file_path"], sample_id, *scores, comments)
+    rate_sample.change(
+        load_sample, [session, rate_sample],
+        [rate_audio, current_sample, *rate_outputs, rate_comments],
+    )
+    def go_next_unrated(sess, language_id):
+        if not sess or not language_id:
+            return gr.update()
+        items = samples_for_reviewer(sess["id"], language_id)
+        nxt = next((s["id"] for s in items if not s["rated"]), None)
+        if nxt is None:
+            return gr.update()
+        return gr.update(value=nxt)
+    next_btn.click(go_next_unrated, [session, rate_lang], [rate_sample])
+    def submit_rating(sess, sample_id, comments, *scores):
+        if not sess:
+            return "❌ Not signed in.", gr.update()
+        if not sample_id:
+            return "❌ No sample selected.", gr.update()
+        score_map = dict(zip(CRITERIA_KEYS, scores))
+        missing = [lbl for (k, lbl, _) in CRITERIA if score_map.get(k) in (None, "")]
+        if missing:
+            return f"❌ Please rate every criterion. Missing: {', '.join(missing)}", gr.update()
+        upsert_rating(sess["id"], sample_id, score_map, comments)
+        # refresh sample dropdown to show the ✓ and move on
+        items = samples_for_reviewer(sess["id"], score_lang(sess, sample_id))
+        rated = sum(1 for s in items if s["rated"])
+        return (f"✅ Saved. You have rated {rated}/{len(items)} samples in this language.", gr.update())
+    def score_lang(sess, sample_id):
+        with get_conn() as conn:
+            row = conn.execute("SELECT language_id FROM samples WHERE id=?", (sample_id,)).fetchone()
+        return row["language_id"] if row else None
+    submit_btn.click(
+        submit_rating,
+        [session, current_sample, rate_comments, *rate_outputs],
+        [rate_msg, rate_sample],
+    ).then(
+        load_samples_for_lang, [session, rate_lang], [rate_sample]
+    )
+    # ---- Progress ----
+    def load_progress(sess):
+        if not sess:
+            return "Not signed in.", []
+        rows = []
+        for l in sess["languages"]:
+            items = samples_for_reviewer(sess["id"], l["id"])
+            rated = sum(1 for s in items if s["rated"])
+            rows.append([l["name"], rated, len(items), len(items) - rated])
+        if not rows:
+            return "You have no languages assigned yet. Add some under **My languages**.", []
+        return f"Progress for **{sess['username']}**:", rows
+    refresh_prog_btn.click(load_progress, [session], [progress_md, progress_tbl])
+    # ---- Profile ----
+    def save_profile(sess, lang_ids):
+        if not sess:
+            return "❌ Not signed in.", gr.update(), gr.update()
+        set_user_languages(sess["id"], lang_ids or [])
+        new_sess = user_session(sess["id"])
+        rl = reviewer_lang_choices(new_sess)
+        return ("✅ Languages updated.", new_sess,
+                gr.update(choices=rl, value=(rl[0][1] if rl else None)))
+    prof_save.click(save_profile, [session, prof_langs], [prof_msg, session, rate_lang])
+    # ---- Admin: languages ----
+    def admin_add_language(sess, code, name):
+        if not sess or sess["role"] != "admin":
+            return "❌ Admin only.", _languages_table(), *_lang_dropdown_updates()
+        try:
+            add_language(code, name)
+        except ValueError as e:
+            return f"❌ {e}", _languages_table(), *_lang_dropdown_updates()
+        return f"✅ Added {name} ({code}).", _languages_table(), *_lang_dropdown_updates()
+    def _languages_table():
+        return [[l["id"], l["code"], l["name"]] for l in list_languages()]
+    def _lang_dropdown_updates():
+        ch = lang_choices()
+        return (gr.update(choices=ch), gr.update(choices=ch), gr.update(choices=ch), gr.update(choices=ch))
+    al_btn.click(
+        admin_add_language, [session, al_code, al_name],
+        [al_msg, langs_tbl, up_lang, res_lang, su_langs, prof_langs],
+    )
+    # ---- Admin: samples ----
+    def admin_upload(sess, language_id, files, model, is_ref, transcript):
+        if not sess or sess["role"] != "admin":
+            return "❌ Admin only.", _samples_table()
+        if not language_id:
+            return "❌ Choose a language first.", _samples_table()
+        if not files:
+            return "❌ No files selected.", _samples_table()
+        count = 0
+        for f in files:
+            path = f if isinstance(f, str) else getattr(f, "name", None)
+            try:
+                add_sample(language_id, path, model_name=model, is_reference=is_ref, transcript=transcript)
+                count += 1
+            except Exception as e:  # noqa
+                return f"❌ Error on a file: {e}", _samples_table()
+        return f"✅ Uploaded {count} sample(s).", _samples_table()
+    def _samples_table():
+        return [[s["id"], f"{s['language_name']} ({s['language_code']})", s["sample_name"],
+                 s["model_name"], "yes" if s["is_reference"] else "no"] for s in list_samples()]
+    up_btn.click(admin_upload, [session, up_lang, up_files, up_model, up_isref, up_transcript],
+                 [up_msg, samples_tbl])
+    def admin_delete_sample(sess, sid):
+        if not sess or sess["role"] != "admin":
+            return "❌ Admin only.", _samples_table()
+        if not sid:
+            return "❌ Enter a sample id.", _samples_table()
+        delete_sample(int(sid))
+        return f"✅ Deleted sample {int(sid)}.", _samples_table()
+    del_btn.click(admin_delete_sample, [session, del_sample_id], [up_msg, samples_tbl])
+    # ---- Admin: users ----
+    def _users_table():
+        rows = []
+        with get_conn() as conn:
+            for u in conn.execute("SELECT * FROM users ORDER BY id").fetchall():
+                langs = conn.execute(
+                    "SELECT l.name FROM user_languages ul JOIN languages l ON l.id=ul.language_id "
+                    "WHERE ul.user_id=?", (u["id"],)).fetchall()
+                nratings = conn.execute("SELECT COUNT(*) c FROM ratings WHERE user_id=?", (u["id"],)).fetchone()["c"]
+                rows.append([u["id"], u["username"], u["role"], "yes" if u["is_active"] else "no",
+                             ", ".join(l["name"] for l in langs), nratings])
+        return rows
+    refresh_users_btn.click(lambda s: _users_table() if s and s["role"] == "admin" else [], [session], [users_tbl])
+    def admin_update_user(sess, uid, role, active):
+        if not sess or sess["role"] != "admin":
+            return "❌ Admin only.", _users_table()
+        if not uid:
+            return "❌ Enter a user id.", _users_table()
+        with get_conn() as conn:
+            conn.execute("UPDATE users SET role=?, is_active=? WHERE id=?",
+                         (role, 1 if active == "yes" else 0, int(uid)))
+        return f"✅ Updated user {int(uid)}.", _users_table()
+    update_user_btn.click(admin_update_user, [session, promote_id, role_choice, active_choice],
+                          [user_admin_msg, users_tbl])
+    # ---- Admin: results ----
+    def admin_results(sess, language_id):
+        if not sess or sess["role"] != "admin":
+            return "❌ Admin only.", pd.DataFrame(), pd.DataFrame()
+        if not language_id:
+            return "Choose a language.", pd.DataFrame(), pd.DataFrame()
+        per_model, per_sample, summary = compute_results(language_id)
+        return summary, per_model, per_sample
+    res_btn.click(admin_results, [session, res_lang], [res_summary, res_model_tbl, res_sample_tbl])
+    def admin_export(sess, language_id):
+        if not sess or sess["role"] != "admin" or not language_id:
+            return None
+        return export_results(language_id)
+    export_btn.click(admin_export, [session, res_lang], [res_file])
+    # Populate admin tables when the app loads for an admin (via login .then chain)
+    li_btn.click(lambda s: (_languages_table(), _samples_table(), _users_table())
+                 if s and s["role"] == "admin" else ([], [], []),
+                 [session], [langs_tbl, samples_tbl, users_tbl])
+if __name__ == "__main__":
+    demo.queue().launch(server_name="0.0.0.0", server_port=int(os.environ.get("PORT", 7860)))

requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+gradio>=5.0,<6.0
+pandas>=2.0
+openpyxl>=3.1