Spaces:

JunSiang26
/

OdioCheck-Backend

Running

App Files Files Community

JunSiang26 commited on Apr 16

Commit

9f2b6db

0 Parent(s):

Pure production deploy

Browse files

Files changed (16) hide show

.gitattributes +4 -0
.gitignore +18 -0
.vscode/settings.json +4 -0
AI Project 2026.pdf +0 -0
Dockerfile +25 -0
Model_Training_(Odio).ipynb +0 -0
README.md +108 -0
backend/app.py +233 -0
backend/dataset.py +211 -0
backend/models.py +1019 -0
backend/preprocess.py +58 -0
backend/train.py +514 -0
frontend/index.html +75 -0
frontend/script.js +243 -0
frontend/style.css +34 -0
requirements.txt +17 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,4 @@

+backend/models/*.pth filter=lfs diff=lfs merge=lfs -text
+**/*.pth filter=lfs diff=lfs merge=lfs -text
+backend/models/**/*.pth filter=lfs diff=lfs merge=lfs -text
+backend/models/Ablation[[:space:]]models/*.pth filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,18 @@

+.venv
+MLAAD-tiny/
+data/
+*.pth
+Download_sample (Ignore)/
+__pycache__/
+project_requirements.txt
+generate_notebook.py
+backend/precomputed_features/
+# Models
+*.pth
+backend/models/*.pth
+# Environment
+.env
+.DS_Store
+node_modules/

.vscode/settings.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "python-envs.defaultEnvManager": "ms-python.python:conda",
+    "python-envs.defaultPackageManager": "ms-python.python:conda"
+}

AI Project 2026.pdf ADDED Viewed

Binary file (73.1 kB). View file

Dockerfile ADDED Viewed

	@@ -0,0 +1,25 @@

+# Use a slim Python image
+FROM python:3.10-slim
+# Install system dependencies for audio processing
+RUN apt-get update && apt-get install -y \
+    libsndfile1 \
+    ffmpeg \
+    && rm -rf /var/lib/apt/lists/*
+# Set working directory
+WORKDIR /app
+# Copy requirements and install
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy the backend code and models
+COPY backend/ ./backend/
+# Expose the port FastAPI will run on
+EXPOSE 7860
+# Command to run the application
+# Note: Hugging Face uses port 7860 by default
+CMD ["uvicorn", "backend.app:app", "--host", "0.0.0.0", "--port", "7860"]

Model_Training_(Odio).ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

README.md ADDED Viewed

	@@ -0,0 +1,108 @@

+<img width="1080" height="324" alt="odiocheck" src="https://github.com/user-attachments/assets/4d7b573e-5b0b-4fc7-85de-da60bbb701c2" />
+# OdioCheck - Deepfake Voice Detection AI
+*50.021 Artificial Intelligence Project*
+## Theme
+**AI for Security & Social Good** (UN SDG #16: Peace, Justice, and Strong Institutions)
+OdioCheck tackles the rising threat of audio deepfakes used in scams and misdirection.
+## Requirements Checklist
+- [x] **Fully functioning code:** Complete end-to-end PyTorch implementation from dataset loading to real-time inference via a web UI.
+- [x] **Baseline models (×3):**
+  - **Wav2Vec2** — self-supervised transformer feature extractor (frozen) + attentive pooling classifier (`backend/models.py`)
+  - **AASIST** — graph-based SOTA baseline using sinc-filter frontend + spectro-temporal heterogeneous graph attention (`backend/models.py`)
+  - **CQCC Baseline** — standard CNN processing Constant-Q Cepstral Coefficients (`backend/models.py`)
+- [x] **SOTA Custom Model:** `ImprovedWav2Vec2CQCCDetector` — a novel fusion architecture combining Wav2Vec 2.0 and CQCC features via **bidirectional cross-attention**, followed by a **Graph Attention** backend (`backend/models.py`).
+- [x] **Ablation Study (×4):** Four ablation variants systematically isolate each architectural component to validate the custom model design:
+  - **Ablation 1** — Wav2Vec2 + Graph (no CQCC, no cross-attention)
+  - **Ablation 2** — CQCC + Graph (no Wav2Vec2, no cross-attention)
+  - **Ablation 3** — Wav2Vec2 + CQCC + Simple Concat + Graph (no cross-attention)
+  - **Ablation 4** — Wav2Vec2 + CQCC + Cross-Attention + Linear (no Graph Attention)
+- [x] **Fully Working Frontend:** Glassmorphic UI (Tailwind + Vanilla JS) served via FastAPI. Supports OGG/MP3/M4A/FLAC/WAV. Shows **side-by-side** predictions from all four primary models with real-time animated confidence bars and a per-window **temporal analysis timeline chart**.
+- [x] **Cross-lingual Dataset Split:** Trained on English audio (`MLAAD-tiny/en`), tested on unseen German audio (`MLAAD-tiny/de`) for out-of-distribution generalisation evaluation.
+- [x] **CQCC Feature Caching:** Pre-computed CQCC tensors are cached to disk to avoid redundant computation across training runs.
+---
+## Installation
+Ensure you have Python 3.9+ installed. Install all dependencies:
+```bash
+pip install -r requirements.txt
+```
+### Dataset Download
+We use the `MLAAD-tiny` dataset (multi-language audio deepfakes). Download it from Hugging Face before training:
+```bash
+pip install -U "huggingface_hub[cli]"
+huggingface-cli download mueller91/MLAAD-tiny --repo-type dataset --local-dir MLAAD-tiny
+```
+---
+## Running the Project
+### Step 1 — (Optional) Pre-compute CQCC Cache
+Pre-computing CQCC features once dramatically speeds up all subsequent training runs:
+```bash
+python backend/train.py --precompute-cqcc-only
+```
+### Step 2 — Train All Models
+Trains all 4 primary models + 4 ablation variants, evaluates on the German test set, and saves `.pth` weights to `backend/models/`:
+```bash
+python backend/train.py
+```
+#### Available Training Flags
+| Flag | Default | Description |
+|---|---|---|
+| `--val-split F` | `0.2` | Fraction of English data reserved for validation (0–0.5). |
+| `--data-dir PATH` | auto | Override dataset root (must contain `original/` and `fake/` folders). |
+| `--cqcc-cache-dir PATH` | `backend/precomputed_features/cqcc` | Where to read/write cached CQCC tensors. |
+| `--precompute-cqcc-only` | `False` | Build CQCC cache and exit without training. |
+| `--force-rebuild-cqcc` | `False` | Recompute CQCC cache even if files already exist. |
+| `--smoke-test` | `False` | Run one forward pass through every model and exit — useful for verifying setup. |
+#### Quick Smoke Test
+Verify all models initialise and run a forward pass correctly without full training:
+```bash
+python backend/train.py --smoke-test
+```
+### Step 3 — Start the Web Interface
+```bash
+uvicorn backend.app:app --reload
+```
+Open **http://127.0.0.1:8000** in your browser. Upload any audio file (WAV, MP3, OGG, FLAC, M4A) to see simultaneous predictions from all four primary models plus an animated temporal confidence chart.
+---
+## Project Architecture
+```
+AI Project/
+├── backend/
+│   ├── models.py          # All model architectures (3 baselines + custom + 4 ablations)
+│   ├── dataset.py         # AudioDataset with CQCC caching + data augmentation
+│   ├── train.py           # Full training + evaluation pipeline (CLI-driven)
+│   ├── app.py             # FastAPI inference server (windowed temporal analysis)
+│   ├── preprocess.py      # Standalone preprocessing utilities
+│   └── models/            # Saved .pth weight files (generated after training)
+├── frontend/
+│   ├── index.html         # Glassmorphic UI shell
+│   ├── script.js          # File upload, Chart.js timeline, model panel rendering
+│   └── style.css          # Custom glassmorphism styles
+├── MLAAD-tiny/            # Dataset (downloaded separately)
+├── requirements.txt       # Python dependencies
+└── colab_training_notebook.ipynb  # Google Colab training notebook
+```
+---
+## Working with Other Datasets
+To replace MLAAD-tiny with another dataset (e.g., ASVspoof):
+1. Place your `fake/` and `original/` (or `real/`) audio folders into a `data/` directory at the project root.
+2. The `AudioDataset` in `dataset.py` auto-detects and falls back to the `data/` directory if `MLAAD-tiny` is absent.
+3. Re-run `python backend/train.py`. The full pipeline runs identically.

backend/app.py ADDED Viewed

	@@ -0,0 +1,233 @@

+import os
+import torch
+import numpy as np
+import torch.nn.functional as F
+from fastapi import FastAPI, UploadFile, File
+from fastapi.responses import JSONResponse
+from fastapi.staticfiles import StaticFiles
+from fastapi.middleware.cors import CORSMiddleware
+from dataset import compute_cqcc
+import sys
+import librosa
+sys.path.append(os.path.dirname(__file__))
+from models import (
+    Wav2Vec2SpoofDetector,
+    AASISTDetector,
+    CQCCBaselineDetector,
+    ImprovedWav2Vec2CQCCDetector
+)
+app = FastAPI(title="Deepfake Voice Detection")
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+# -------------------------------------------------------
+# Load Models
+# -------------------------------------------------------
+models_dir = os.path.join(os.path.dirname(__file__), "models")
+def load_model(model, filename):
+    path = os.path.join(models_dir, filename)
+    if os.path.exists(path):
+        state_dict = torch.load(path, map_location=device)
+        # Handle weight_norm parametrization mismatch (common in Wav2Vec2 between versions)
+        # This converts the 'parametrizations' keys back to 'weight_g' and 'weight_v'
+        new_state_dict = {}
+        for k, v in state_dict.items():
+            if "pos_conv_embed.conv.parametrizations.weight.original0" in k:
+                new_key = k.replace("parametrizations.weight.original0", "weight_g")
+                new_state_dict[new_key] = v
+            elif "pos_conv_embed.conv.parametrizations.weight.original1" in k:
+                new_key = k.replace("parametrizations.weight.original1", "weight_v")
+                new_state_dict[new_key] = v
+            else:
+                new_state_dict[k] = v
+        model.load_state_dict(new_state_dict)
+        print(f"Loaded {filename}")
+    else:
+        print(f"WARNING: {filename} not found. Run train.py first!")
+    model.eval()
+    return model
+wav2vec_model = load_model(
+    Wav2Vec2SpoofDetector(num_classes=2).to(device),
+    "wav2vec2.pth"
+)
+aasist_model = load_model(
+    AASISTDetector(num_classes=2).to(device),
+    "aasist.pth"
+)
+cqcc_baseline_model = load_model(
+    CQCCBaselineDetector(num_classes=2).to(device),
+    "cqcc_baseline.pth"
+)
+custom_hybrid_model = load_model(
+    ImprovedWav2Vec2CQCCDetector(num_classes=2).to(device),
+    "custom_hybrid.pth"
+)
+# -------------------------------------------------------
+# Audio Preprocessing (mirrors dataset.py __getitem__)
+# -------------------------------------------------------
+TARGET_LEN = 64600  # AASIST standard: 4.025s at 16kHz
+CQCC_N_BINS = 60    # Matches AudioDataset default
+# 50% overlap: each step is half a window (~2s), giving smooth temporal curves
+# without running 4x Wav2Vec2 passes per second.
+WINDOW_STEP = TARGET_LEN // 2
+def preprocess_window(wav_np: np.ndarray) -> tuple[torch.Tensor, torch.Tensor]:
+    """
+    Crop or pad a single audio window to TARGET_LEN, then compute waveform
+    and CQCC tensors — identical to AudioDataset.__getitem__ (non-augmented).
+    Returns:
+        wav  : (1, TARGET_LEN) float32 tensor
+        cqcc : (1, 20, T)      float32 tensor
+    """
+    # Center-crop or zero-pad to exactly TARGET_LEN (matches eval path in dataset.py)
+    if len(wav_np) > TARGET_LEN:
+        start = (len(wav_np) - TARGET_LEN) // 2
+        wav_np = wav_np[start : start + TARGET_LEN]
+    elif len(wav_np) < TARGET_LEN:
+        wav_np = np.pad(wav_np, (0, TARGET_LEN - len(wav_np)), mode='constant')
+    wav = torch.from_numpy(wav_np).unsqueeze(0).float()
+    cqcc = compute_cqcc(wav_np, n_bins=CQCC_N_BINS)   # → (1, 20, T)
+    return wav, cqcc
+def run_window(wav: torch.Tensor, cqcc: torch.Tensor) -> dict:
+    """
+    Run all four models on a single window and return fake probabilities (0–100).
+    """
+    wav  = wav.unsqueeze(0).to(device)    # (1, 1, TARGET_LEN)
+    cqcc = cqcc.unsqueeze(0).to(device)   # (1, 1, 20, T)
+    with torch.no_grad():
+        w2v_prob    = torch.softmax(wav2vec_model(wav),             dim=1)[0][1].item()
+        aasist_prob = torch.softmax(aasist_model(wav),              dim=1)[0][1].item()
+        cqcc_prob   = torch.softmax(cqcc_baseline_model(cqcc),      dim=1)[0][1].item()
+        custom_prob = torch.softmax(custom_hybrid_model(wav, cqcc), dim=1)[0][1].item()
+    return {
+        "wav2vec2":      round(w2v_prob    * 100, 2),
+        "aasist":        round(aasist_prob * 100, 2),
+        "cqcc_baseline": round(cqcc_prob   * 100, 2),
+        "custom_hybrid": round(custom_prob * 100, 2),
+    }
+def aggregate_prediction(fake_prob_pct: float) -> dict:
+    """Convert a mean fake probability into the standard prediction dict."""
+    return {
+        "prediction":       "FAKE" if fake_prob_pct > 50 else "REAL",
+        "fake_probability": fake_prob_pct,
+        "real_probability": round(100 - fake_prob_pct, 2),
+    }
+# -------------------------------------------------------
+# Prediction Endpoint
+# -------------------------------------------------------
+@app.post("/api/predict")
+async def predict(file: UploadFile = File(...)):
+    temp_path = f"temp_{file.filename}"
+    try:
+        with open(temp_path, "wb") as f:
+            f.write(await file.read())
+        # Load at 16 kHz mono — identical to librosa.load call in dataset.py
+        wav_np, sr = librosa.load(temp_path, sr=16000, mono=True)
+        # -------------------------------------------------------
+        # Slice into overlapping windows of TARGET_LEN samples.
+        # Step = 50% overlap.  Very short clips produce a single window.
+        # -------------------------------------------------------
+        total_samples = len(wav_np)
+        starts = list(range(0, total_samples, WINDOW_STEP))
+        window_probs  = []   # per-window fake-probability dicts
+        window_labels = []   # x-axis: start of each window in seconds
+        for start in starts:
+            chunk = wav_np[start : start + TARGET_LEN]
+            wav_t, cqcc_t = preprocess_window(chunk)
+            probs = run_window(wav_t, cqcc_t)
+            window_probs.append(probs)
+            start_sec = round(start / sr, 2)
+            window_labels.append(start_sec)
+        # -------------------------------------------------------
+        # Overall prediction = mean fake probability across all windows
+        # -------------------------------------------------------
+        model_keys = ["wav2vec2", "aasist", "cqcc_baseline", "custom_hybrid"]
+        overall = {}
+        for key in model_keys:
+            mean_fake = round(
+                sum(w[key] for w in window_probs) / len(window_probs), 2
+            )
+            overall[key] = aggregate_prediction(mean_fake)
+        # -------------------------------------------------------
+        # Time-series data for the frontend chart
+        # -------------------------------------------------------
+        timeline = {
+            key: [w[key] for w in window_probs]
+            for key in model_keys
+        }
+        return JSONResponse({
+            "overall":       overall,         # {model: {prediction, fake_probability, real_probability}}
+            "timeline":      timeline,        # {model: [fake_prob_pct, ...]}  — one value per window
+            "window_labels": window_labels,   # [start_sec, ...]               — x-axis in seconds (starts at 0.0)
+        })
+    except Exception as e:
+        return JSONResponse({"error": str(e)}, status_code=500)
+    finally:
+        if os.path.exists(temp_path):
+            os.remove(temp_path)
+# -------------------------------------------------------
+# Serve frontend
+# -------------------------------------------------------
+frontend_dir = os.path.join(os.path.dirname(__file__), "..", "frontend")
+if os.path.exists(frontend_dir):
+    app.mount("/", StaticFiles(directory=frontend_dir, html=True), name="frontend")
+# -------------------------------------------------------
+# Run Server
+# -------------------------------------------------------
+if __name__ == "__main__":
+    import uvicorn
+    print("Starting server at http://127.0.0.1:8000")
+    uvicorn.run(app, host="127.0.0.1", port=8000)

backend/dataset.py ADDED Viewed

	@@ -0,0 +1,211 @@

+import os
+import hashlib
+import torch
+import torchaudio
+import numpy as np
+from torch.utils.data import Dataset
+import librosa
+from scipy.fftpack import dct
+def compute_cqcc(wav_np, n_bins, sample_rate=16000, hop_length=160, num_coeffs=20):
+    """Compute CQCC features from a mono waveform numpy array."""
+    try:
+        cqt = np.abs(
+            librosa.cqt(
+                wav_np,
+                sr=sample_rate,
+                n_bins=n_bins,
+                hop_length=hop_length,
+                fmin=librosa.note_to_hz('C1')
+            )
+        )
+        log_power = librosa.amplitude_to_db(cqt, ref=np.max)
+        cqcc = dct(log_power, type=2, axis=0, norm='ortho')[:num_coeffs]
+        return torch.from_numpy(cqcc).unsqueeze(0).float()
+    except Exception:
+        # Fallback for very short or invalid audio.
+        return torch.zeros((1, num_coeffs, 10), dtype=torch.float32)
+class AudioDataset(Dataset):
+    def __init__(self, data_dir=None, n_bins=60, augment=False, cqcc_cache_dir=None, target_lang=None):
+        if data_dir is None:
+            # Check if MLAAD-tiny exists, else fallback to 'data'
+            mlaad_dir = os.path.join(os.path.dirname(__file__), "..", "MLAAD-tiny")
+            if os.path.exists(mlaad_dir):
+                data_dir = mlaad_dir
+            else:
+                data_dir = os.path.join(os.path.dirname(__file__), "..", "data")
+        self.data_dir = data_dir
+        self.files = []
+        self.labels = []
+        self.n_bins = n_bins
+        self.augment = augment
+        self.cqcc_cache_dir = cqcc_cache_dir
+        self.target_lang = target_lang
+        real_path = os.path.join(data_dir, "original")
+        if not os.path.exists(real_path):
+            real_path = os.path.join(data_dir, "real")
+        fake_path = os.path.join(data_dir, "fake")
+        for root, dirs, files in os.walk(real_path):
+            dirs.sort()
+            files.sort()
+            for f in files:
+                if f.endswith('.wav') or f.endswith('.flac'):
+                    if self.target_lang:
+                        rel_root = os.path.relpath(root, real_path).replace('\\', '/')
+                        if not rel_root.startswith(self.target_lang):
+                            continue
+                    self.files.append(os.path.join(root, f))
+                    self.labels.append(0) # 0 = Real
+        for root, dirs, files in os.walk(fake_path):
+            dirs.sort()
+            files.sort()
+            for f in files:
+                if f.endswith('.wav') or f.endswith('.flac'):
+                    if self.target_lang:
+                        rel_root = os.path.relpath(root, fake_path).replace('\\', '/')
+                        if not rel_root.startswith(self.target_lang):
+                            continue
+                    self.files.append(os.path.join(root, f))
+                    self.labels.append(1) # 1 = Fake
+        if self.cqcc_cache_dir is not None:
+            os.makedirs(self.cqcc_cache_dir, exist_ok=True)
+    def __len__(self):
+        return len(self.files)
+    def _cqcc_cache_path(self, audio_path):
+        rel_path = os.path.relpath(audio_path, start=self.data_dir)
+        cache_key = hashlib.md5(audio_path.encode("utf-8")).hexdigest()
+        rel_stem = os.path.splitext(rel_path)[0]
+        safe_name = rel_stem.replace(os.sep, "__")
+        return os.path.join(self.cqcc_cache_dir, f"{safe_name}_{cache_key}.pt")
+    def _load_or_compute_cqcc(self, audio_path, wav_np, is_augmented=False):
+        if self.cqcc_cache_dir is None or is_augmented:
+            return compute_cqcc(wav_np, n_bins=self.n_bins)
+        cache_path = self._cqcc_cache_path(audio_path)
+        if os.path.exists(cache_path):
+            return torch.load(cache_path, map_location="cpu")
+        cqcc = compute_cqcc(wav_np, n_bins=self.n_bins)
+        torch.save(cqcc, cache_path)
+        return cqcc
+    def precompute_cqcc_cache(self, force=False):
+        """Materialize CQCC features to disk so training can reuse them."""
+        import tqdm
+        if self.cqcc_cache_dir is None:
+            raise ValueError("cqcc_cache_dir must be set to precompute CQCC features.")
+        try:
+            from tqdm.notebook import tqdm
+            iterable_files = tqdm(self.files, desc="Precomputing CQCC Cache")
+        except ImportError:
+            iterable_files = self.files
+        total = len(self.files)
+        for idx, audio_path in enumerate(iterable_files):
+            cache_path = self._cqcc_cache_path(audio_path)
+            if not force and os.path.exists(cache_path):
+                continue
+            try:
+                wav_np, _ = librosa.load(audio_path, sr=16000, mono=True)
+                cqcc = compute_cqcc(wav_np, n_bins=self.n_bins)
+                torch.save(cqcc, cache_path)
+            except Exception as e:
+                print(f"Error precomputing CQCC for {audio_path}: {e}")
+            if (idx + 1) % 100 == 0 or idx + 1 == total:
+                print(f"Precomputed CQCC {idx + 1}/{total}")
+    def __getitem__(self, idx):
+        audio_path = self.files[idx]
+        wav_np, sr = librosa.load(audio_path, sr=16000, mono=True)
+        is_augmented = False
+        # Augmentation on raw audio (Data Augmentation for generalizability)
+        if self.augment and np.random.rand() < 0.3:
+            # Apply only ONE augmentation type per sample to avoid over-modification
+            aug_type = np.random.choice(['noise', 'speed', 'pitch'], p=[0.33, 0.33, 0.34])
+            if aug_type == 'noise':
+                # SNR-based noise addition (reverted to original robust method)
+                signal_power = np.mean(wav_np**2)
+                if signal_power > 1e-10:
+                    snr_db = np.random.uniform(10, 30)
+                    snr_linear = 10**(snr_db / 10)
+                    noise_power = signal_power / snr_linear
+                    noise = np.random.randn(len(wav_np)) * np.sqrt(noise_power)
+                    wav_np = wav_np + noise
+                is_augmented = True
+            elif aug_type == 'speed':
+                # Mild speed perturbation
+                speed_factor = np.random.uniform(0.95, 1.05)
+                wav_np = librosa.effects.time_stretch(wav_np, rate=speed_factor)
+                is_augmented = True
+            elif aug_type == 'pitch':
+                # Subtle pitch shift
+                n_steps = np.random.uniform(-1, 1)
+                wav_np = librosa.effects.pitch_shift(wav_np, sr=sr, n_steps=n_steps)
+                is_augmented = True
+        # Crop or pad to exactly 64600 samples (AASIST standard)
+        target_len = 64600
+        if len(wav_np) > target_len:
+            # Center crop or random crop for augment instead of taking just the start.
+            if self.augment:
+                start = np.random.randint(0, len(wav_np) - target_len)
+            else:
+                start = (len(wav_np) - target_len) // 2
+            wav_np = wav_np[start:start+target_len]
+        elif len(wav_np) < target_len:
+            pad = target_len - len(wav_np)
+            wav_np = np.pad(wav_np, (0, pad), 'constant')
+        wav = torch.from_numpy(wav_np).unsqueeze(0).float()
+        cqcc = self._load_or_compute_cqcc(audio_path, wav_np, is_augmented=is_augmented)
+        return wav, cqcc, self.labels[idx]
+def collate_variable_length(batch):
+    wavs, cqccs, labels = zip(*batch)
+    labels = torch.tensor(labels)
+    # ---------- WAVE ----------
+    max_wav_len = max(w.shape[-1] for w in wavs)
+    wavs_padded = []
+    for w in wavs:
+        if w.shape[-1] < max_wav_len:
+            pad = max_wav_len - w.shape[-1]
+            w = torch.nn.functional.pad(w, (0, pad))
+        wavs_padded.append(w)
+    wavs = torch.stack(wavs_padded, dim=0)
+    # ---------- CQCC ----------
+    max_cqcc_len = max(c.shape[-1] for c in cqccs)
+    cqccs_padded = []
+    for c in cqccs:
+        if c.shape[-1] < max_cqcc_len:
+            pad = max_cqcc_len - c.shape[-1]
+            c = torch.nn.functional.pad(c, (0, pad))
+        cqccs_padded.append(c)
+    cqccs = torch.stack(cqccs_padded, dim=0)
+    return wavs, cqccs, labels

backend/models.py ADDED Viewed

	@@ -0,0 +1,1019 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import Wav2Vec2Model
+# ============================================================
+# 1. Wav2Vec2 Detector (Self-supervised Transformer Baseline)
+# ============================================================
+class AttentivePooling(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.attn = nn.Sequential(
+            nn.Linear(dim, dim),
+            nn.Tanh(),
+            nn.Linear(dim, 1)
+        )
+    def forward(self, x):
+        w = torch.softmax(self.attn(x), dim=1)
+        return torch.sum(w * x, dim=1)
+class Wav2Vec2SpoofDetector(nn.Module):
+    def __init__(self, num_classes=2, model_name="facebook/wav2vec2-base"):
+        super().__init__()
+        self.wav2vec = Wav2Vec2Model.from_pretrained(model_name)
+        #freeze model
+        for param in self.wav2vec.parameters():
+            param.requires_grad = False
+        hidden = self.wav2vec.config.hidden_size
+        self.pool = AttentivePooling(hidden)
+        self.classifier = nn.Sequential(
+            nn.LayerNorm(hidden),
+            nn.Dropout(0.2),
+            nn.Linear(hidden, num_classes)
+        )
+    def forward(self, x):
+        if x.dim() == 3:
+            x = x.squeeze(1)
+        out = self.wav2vec(x).last_hidden_state
+        pooled = self.pool(out)
+        return self.classifier(pooled)
+# ============================================================
+# 2. AASIST (SOTA Graph-based Baseline)
+# ============================================================
+import random
+from typing import Union
+import numpy as np
+from torch import Tensor
+# Original simplistic Graph Attention/Block kept for the Custom model dependent on it
+class GraphAttention(nn.Module):
+    def __init__(self, in_dim, out_dim):
+        super().__init__()
+        self.fc = nn.Linear(in_dim, out_dim)
+        self.attn = nn.Linear(out_dim * 2, 1)
+    def forward(self, x):
+        h = self.fc(x)
+        # Instead of allocating O(N^2 * D) tensor arrays for pairwise combinations,
+        # we can decompose the linear attention matrix and use broadcasting!
+        # Memory consumption goes from ~10GB on N=400 to ~2MB.
+        W = self.attn.weight.squeeze()
+        D = h.shape[-1]
+        W_1 = W[:D]
+        W_2 = W[D:]
+        # Compute individual node scores: shape (B, N, 1)
+        score_i = torch.matmul(h, W_1).unsqueeze(-1)
+        score_j = torch.matmul(h, W_2).unsqueeze(-1)
+        # Broadcast (B, N, 1) + (B, 1, N) -> (B, N, N)
+        e = score_i + score_j.transpose(1, 2)
+        if self.attn.bias is not None:
+            e = e + self.attn.bias
+        alpha = F.softmax(e, dim=-1)
+        out = torch.matmul(alpha, h)
+        return out
+class GraphBlock(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.gat = GraphAttention(dim, dim)
+        self.norm = nn.LayerNorm(dim)
+        self.dropout = nn.Dropout(0.2)
+    def forward(self, x):
+        res = x
+        x = self.gat(x)
+        x = self.dropout(x)
+        x = self.norm(x + res)
+        return x
+class GraphAttentionLayer(nn.Module):
+    def __init__(self, in_dim, out_dim, **kwargs):
+        super().__init__()
+        # attention map
+        self.att_proj = nn.Linear(in_dim, out_dim)
+        self.att_weight = self._init_new_params(out_dim, 1)
+        # project
+        self.proj_with_att = nn.Linear(in_dim, out_dim)
+        self.proj_without_att = nn.Linear(in_dim, out_dim)
+        # batch norm
+        self.bn = nn.BatchNorm1d(out_dim)
+        # dropout for inputs
+        self.input_drop = nn.Dropout(p=0.2)
+        # activate
+        self.act = nn.SELU(inplace=True)
+        # temperature
+        self.temp = 1.
+        if "temperature" in kwargs:
+            self.temp = kwargs["temperature"]
+    def forward(self, x):
+        '''
+        x   :(#bs, #node, #dim)
+        '''
+        # apply input dropout
+        x = self.input_drop(x)
+        # derive attention map
+        att_map = self._derive_att_map(x)
+        # projection
+        x = self._project(x, att_map)
+        # apply batch norm
+        x = self._apply_BN(x)
+        x = self.act(x)
+        return x
+    def _pairwise_mul_nodes(self, x):
+        '''
+        Calculates pairwise multiplication of nodes.
+        - for attention map
+        x           :(#bs, #node, #dim)
+        out_shape   :(#bs, #node, #node, #dim)
+        '''
+        nb_nodes = x.size(1)
+        x = x.unsqueeze(2).expand(-1, -1, nb_nodes, -1)
+        x_mirror = x.transpose(1, 2)
+        return x * x_mirror
+    def _derive_att_map(self, x):
+        '''
+        x           :(#bs, #node, #dim)
+        out_shape   :(#bs, #node, #node, 1)
+        '''
+        att_map = self._pairwise_mul_nodes(x)
+        # size: (#bs, #node, #node, #dim_out)
+        att_map = torch.tanh(self.att_proj(att_map))
+        # size: (#bs, #node, #node, 1)
+        att_map = torch.matmul(att_map, self.att_weight)
+        # apply temperature
+        att_map = att_map / self.temp
+        att_map = F.softmax(att_map, dim=-2)
+        return att_map
+    def _project(self, x, att_map):
+        x1 = self.proj_with_att(torch.matmul(att_map.squeeze(-1), x))
+        x2 = self.proj_without_att(x)
+        return x1 + x2
+    def _apply_BN(self, x):
+        org_size = x.size()
+        x = x.view(-1, org_size[-1])
+        x = self.bn(x)
+        x = x.view(org_size)
+        return x
+    def _init_new_params(self, *size):
+        out = nn.Parameter(torch.FloatTensor(*size))
+        nn.init.xavier_normal_(out)
+        return out
+class HtrgGraphAttentionLayer(nn.Module):
+    def __init__(self, in_dim, out_dim, **kwargs):
+        super().__init__()
+        self.proj_type1 = nn.Linear(in_dim, in_dim)
+        self.proj_type2 = nn.Linear(in_dim, in_dim)
+        # attention map
+        self.att_proj = nn.Linear(in_dim, out_dim)
+        self.att_projM = nn.Linear(in_dim, out_dim)
+        self.att_weight11 = self._init_new_params(out_dim, 1)
+        self.att_weight22 = self._init_new_params(out_dim, 1)
+        self.att_weight12 = self._init_new_params(out_dim, 1)
+        self.att_weightM = self._init_new_params(out_dim, 1)
+        # project
+        self.proj_with_att = nn.Linear(in_dim, out_dim)
+        self.proj_without_att = nn.Linear(in_dim, out_dim)
+        self.proj_with_attM = nn.Linear(in_dim, out_dim)
+        self.proj_without_attM = nn.Linear(in_dim, out_dim)
+        # batch norm
+        self.bn = nn.BatchNorm1d(out_dim)
+        # dropout for inputs
+        self.input_drop = nn.Dropout(p=0.2)
+        # activate
+        self.act = nn.SELU(inplace=True)
+        # temperature
+        self.temp = 1.
+        if "temperature" in kwargs:
+            self.temp = kwargs["temperature"]
+    def forward(self, x1, x2, master=None):
+        '''
+        x1  :(#bs, #node, #dim)
+        x2  :(#bs, #node, #dim)
+        '''
+        num_type1 = x1.size(1)
+        num_type2 = x2.size(1)
+        x1 = self.proj_type1(x1)
+        x2 = self.proj_type2(x2)
+        x = torch.cat([x1, x2], dim=1)
+        if master is None:
+            master = torch.mean(x, dim=1, keepdim=True)
+        # apply input dropout
+        x = self.input_drop(x)
+        # derive attention map
+        att_map = self._derive_att_map(x, num_type1, num_type2)
+        # directional edge for master node
+        master = self._update_master(x, master)
+        # projection
+        x = self._project(x, att_map)
+        # apply batch norm
+        x = self._apply_BN(x)
+        x = self.act(x)
+        x1 = x.narrow(1, 0, num_type1)
+        x2 = x.narrow(1, num_type1, num_type2)
+        return x1, x2, master
+    def _update_master(self, x, master):
+        att_map = self._derive_att_map_master(x, master)
+        master = self._project_master(x, master, att_map)
+        return master
+    def _pairwise_mul_nodes(self, x):
+        '''
+        Calculates pairwise multiplication of nodes.
+        - for attention map
+        x           :(#bs, #node, #dim)
+        out_shape   :(#bs, #node, #node, #dim)
+        '''
+        nb_nodes = x.size(1)
+        x = x.unsqueeze(2).expand(-1, -1, nb_nodes, -1)
+        x_mirror = x.transpose(1, 2)
+        return x * x_mirror
+    def _derive_att_map_master(self, x, master):
+        '''
+        x           :(#bs, #node, #dim)
+        out_shape   :(#bs, #node, #node, 1)
+        '''
+        att_map = x * master
+        att_map = torch.tanh(self.att_projM(att_map))
+        att_map = torch.matmul(att_map, self.att_weightM)
+        # apply temperature
+        att_map = att_map / self.temp
+        att_map = F.softmax(att_map, dim=-2)
+        return att_map
+    def _derive_att_map(self, x, num_type1, num_type2):
+        '''
+        x           :(#bs, #node, #dim)
+        out_shape   :(#bs, #node, #node, 1)
+        '''
+        att_map = self._pairwise_mul_nodes(x)
+        # size: (#bs, #node, #node, #dim_out)
+        att_map = torch.tanh(self.att_proj(att_map))
+        # size: (#bs, #node, #node, 1)
+        att_board = torch.zeros_like(att_map[:, :, :, 0]).unsqueeze(-1)
+        att_board[:, :num_type1, :num_type1, :] = torch.matmul(
+            att_map[:, :num_type1, :num_type1, :], self.att_weight11)
+        att_board[:, num_type1:, num_type1:, :] = torch.matmul(
+            att_map[:, num_type1:, num_type1:, :], self.att_weight22)
+        att_board[:, :num_type1, num_type1:, :] = torch.matmul(
+            att_map[:, :num_type1, num_type1:, :], self.att_weight12)
+        att_board[:, num_type1:, :num_type1, :] = torch.matmul(
+            att_map[:, num_type1:, :num_type1, :], self.att_weight12)
+        att_map = att_board
+        # apply temperature
+        att_map = att_map / self.temp
+        att_map = F.softmax(att_map, dim=-2)
+        return att_map
+    def _project(self, x, att_map):
+        x1 = self.proj_with_att(torch.matmul(att_map.squeeze(-1), x))
+        x2 = self.proj_without_att(x)
+        return x1 + x2
+    def _project_master(self, x, master, att_map):
+        x1 = self.proj_with_attM(torch.matmul(
+            att_map.squeeze(-1).unsqueeze(1), x))
+        x2 = self.proj_without_attM(master)
+        return x1 + x2
+    def _apply_BN(self, x):
+        org_size = x.size()
+        x = x.view(-1, org_size[-1])
+        x = self.bn(x)
+        x = x.view(org_size)
+        return x
+    def _init_new_params(self, *size):
+        out = nn.Parameter(torch.FloatTensor(*size))
+        nn.init.xavier_normal_(out)
+        return out
+class GraphPool(nn.Module):
+    def __init__(self, k: float, in_dim: int, p: Union[float, int]):
+        super().__init__()
+        self.k = k
+        self.sigmoid = nn.Sigmoid()
+        self.proj = nn.Linear(in_dim, 1)
+        self.drop = nn.Dropout(p=p) if p > 0 else nn.Identity()
+        self.in_dim = in_dim
+    def forward(self, h):
+        Z = self.drop(h)
+        weights = self.proj(Z)
+        scores = self.sigmoid(weights)
+        new_h = self.top_k_graph(scores, h, self.k)
+        return new_h
+    def top_k_graph(self, scores, h, k):
+        _, n_nodes, n_feat = h.size()
+        n_nodes = max(int(n_nodes * k), 1)
+        _, idx = torch.topk(scores, n_nodes, dim=1)
+        idx = idx.expand(-1, -1, n_feat)
+        h = h * scores
+        h = torch.gather(h, 1, idx)
+        return h
+class CONV(nn.Module):
+    @staticmethod
+    def to_mel(hz):
+        return 2595 * np.log10(1 + hz / 700)
+    @staticmethod
+    def to_hz(mel):
+        return 700 * (10**(mel / 2595) - 1)
+    def __init__(self,
+                 out_channels,
+                 kernel_size,
+                 sample_rate=16000,
+                 in_channels=1,
+                 stride=1,
+                 padding=0,
+                 dilation=1,
+                 bias=False,
+                 groups=1,
+                 mask=False):
+        super().__init__()
+        if in_channels != 1:
+            msg = "SincConv only support one input channel (here, in_channels = {%i})" % (in_channels)
+            raise ValueError(msg)
+        self.out_channels = out_channels
+        self.kernel_size = kernel_size
+        self.sample_rate = sample_rate
+        # Forcing the filters to be odd (i.e, perfectly symmetrics)
+        if kernel_size % 2 == 0:
+            self.kernel_size = self.kernel_size + 1
+        self.stride = stride
+        self.padding = padding
+        self.dilation = dilation
+        self.mask = mask
+        if bias:
+            raise ValueError('SincConv does not support bias.')
+        if groups > 1:
+            raise ValueError('SincConv does not support groups.')
+        NFFT = 512
+        f = int(self.sample_rate / 2) * np.linspace(0, 1, int(NFFT / 2) + 1)
+        fmel = self.to_mel(f)
+        fmelmax = np.max(fmel)
+        fmelmin = np.min(fmel)
+        filbandwidthsmel = np.linspace(fmelmin, fmelmax, self.out_channels + 1)
+        filbandwidthsf = self.to_hz(filbandwidthsmel)
+        self.mel = filbandwidthsf
+        self.hsupp = torch.arange(-(self.kernel_size - 1) / 2,
+                                  (self.kernel_size - 1) / 2 + 1)
+        self.band_pass = torch.zeros(self.out_channels, self.kernel_size)
+        for i in range(len(self.mel) - 1):
+            fmin = self.mel[i]
+            fmax = self.mel[i + 1]
+            hHigh = (2*fmax/self.sample_rate) * \
+                np.sinc(2*fmax*self.hsupp/self.sample_rate)
+            hLow = (2*fmin/self.sample_rate) * \
+                np.sinc(2*fmin*self.hsupp/self.sample_rate)
+            hideal = hHigh - hLow
+            self.band_pass[i, :] = Tensor(np.hamming(
+                self.kernel_size)) * Tensor(hideal)
+    def forward(self, x, mask=False):
+        band_pass_filter = self.band_pass.clone().to(x.device)
+        if mask:
+            A = np.random.uniform(0, 20)
+            A = int(A)
+            A0 = random.randint(0, band_pass_filter.shape[0] - A)
+            band_pass_filter[A0:A0 + A, :] = 0
+        else:
+            band_pass_filter = band_pass_filter
+        self.filters = (band_pass_filter).view(self.out_channels, 1,
+                                               self.kernel_size)
+        return F.conv1d(x,
+                        self.filters,
+                        stride=self.stride,
+                        padding=self.padding,
+                        dilation=self.dilation,
+                        bias=None,
+                        groups=1)
+class Residual_block(nn.Module):
+    def __init__(self, nb_filts, first=False):
+        super().__init__()
+        self.first = first
+        if not self.first:
+            self.bn1 = nn.BatchNorm2d(num_features=nb_filts[0])
+        self.conv1 = nn.Conv2d(in_channels=nb_filts[0],
+                               out_channels=nb_filts[1],
+                               kernel_size=(2, 3),
+                               padding=(1, 1),
+                               stride=1)
+        self.selu = nn.SELU(inplace=True)
+        self.bn2 = nn.BatchNorm2d(num_features=nb_filts[1])
+        self.conv2 = nn.Conv2d(in_channels=nb_filts[1],
+                               out_channels=nb_filts[1],
+                               kernel_size=(2, 3),
+                               padding=(0, 1),
+                               stride=1)
+        if nb_filts[0] != nb_filts[1]:
+            self.downsample = True
+            self.conv_downsample = nn.Conv2d(in_channels=nb_filts[0],
+                                             out_channels=nb_filts[1],
+                                             padding=(0, 1),
+                                             kernel_size=(1, 3),
+                                             stride=1)
+        else:
+            self.downsample = False
+        self.mp = nn.MaxPool2d((1, 3))
+    def forward(self, x):
+        identity = x
+        if not self.first:
+            out = self.bn1(x)
+            out = self.selu(out)
+        else:
+            out = x
+        out = self.conv1(x)
+        out = self.bn2(out)
+        out = self.selu(out)
+        out = self.conv2(out)
+        if self.downsample:
+            identity = self.conv_downsample(identity)
+        out += identity
+        out = self.mp(out)
+        return out
+class AASISTModel(nn.Module):
+    def __init__(self, d_args):
+        super().__init__()
+        self.d_args = d_args
+        filts = d_args["filts"]
+        gat_dims = d_args["gat_dims"]
+        pool_ratios = d_args["pool_ratios"]
+        temperatures = d_args["temperatures"]
+        self.conv_time = CONV(out_channels=filts[0],
+                              kernel_size=d_args["first_conv"],
+                              in_channels=1)
+        self.first_bn = nn.BatchNorm2d(num_features=1)
+        self.drop = nn.Dropout(0.5, inplace=True)
+        self.drop_way = nn.Dropout(0.2, inplace=True)
+        self.selu = nn.SELU(inplace=True)
+        self.encoder = nn.Sequential(
+            nn.Sequential(Residual_block(nb_filts=filts[1], first=True)),
+            nn.Sequential(Residual_block(nb_filts=filts[2])),
+            nn.Sequential(Residual_block(nb_filts=filts[3])),
+            nn.Sequential(Residual_block(nb_filts=filts[4])),
+            nn.Sequential(Residual_block(nb_filts=filts[4])),
+            nn.Sequential(Residual_block(nb_filts=filts[4])))
+        self.pos_S = nn.Parameter(torch.randn(1, 23, filts[-1][-1]))
+        self.master1 = nn.Parameter(torch.randn(1, 1, gat_dims[0]))
+        self.master2 = nn.Parameter(torch.randn(1, 1, gat_dims[0]))
+        self.GAT_layer_S = GraphAttentionLayer(filts[-1][-1],
+                                               gat_dims[0],
+                                               temperature=temperatures[0])
+        self.GAT_layer_T = GraphAttentionLayer(filts[-1][-1],
+                                               gat_dims[0],
+                                               temperature=temperatures[1])
+        self.HtrgGAT_layer_ST11 = HtrgGraphAttentionLayer(
+            gat_dims[0], gat_dims[1], temperature=temperatures[2])
+        self.HtrgGAT_layer_ST12 = HtrgGraphAttentionLayer(
+            gat_dims[1], gat_dims[1], temperature=temperatures[2])
+        self.HtrgGAT_layer_ST21 = HtrgGraphAttentionLayer(
+            gat_dims[0], gat_dims[1], temperature=temperatures[2])
+        self.HtrgGAT_layer_ST22 = HtrgGraphAttentionLayer(
+            gat_dims[1], gat_dims[1], temperature=temperatures[2])
+        self.pool_S = GraphPool(pool_ratios[0], gat_dims[0], 0.3)
+        self.pool_T = GraphPool(pool_ratios[1], gat_dims[0], 0.3)
+        self.pool_hS1 = GraphPool(pool_ratios[2], gat_dims[1], 0.3)
+        self.pool_hT1 = GraphPool(pool_ratios[2], gat_dims[1], 0.3)
+        self.pool_hS2 = GraphPool(pool_ratios[2], gat_dims[1], 0.3)
+        self.pool_hT2 = GraphPool(pool_ratios[2], gat_dims[1], 0.3)
+        self.out_layer = nn.Linear(5 * gat_dims[1], 2)
+    def forward(self, x, Freq_aug=False):
+        x = x.unsqueeze(1)
+        x = self.conv_time(x, mask=Freq_aug)
+        x = x.unsqueeze(dim=1)
+        x = F.max_pool2d(torch.abs(x), (3, 3))
+        x = self.first_bn(x)
+        x = self.selu(x)
+        e = self.encoder(x)
+        e_S, _ = torch.max(torch.abs(e), dim=3)
+        e_S = e_S.transpose(1, 2) + self.pos_S
+        gat_S = self.GAT_layer_S(e_S)
+        out_S = self.pool_S(gat_S)
+        e_T, _ = torch.max(torch.abs(e), dim=2)
+        e_T = e_T.transpose(1, 2)
+        gat_T = self.GAT_layer_T(e_T)
+        out_T = self.pool_T(gat_T)
+        master1 = self.master1.expand(x.size(0), -1, -1)
+        master2 = self.master2.expand(x.size(0), -1, -1)
+        out_T1, out_S1, master1 = self.HtrgGAT_layer_ST11(
+            out_T, out_S, master=self.master1)
+        out_S1 = self.pool_hS1(out_S1)
+        out_T1 = self.pool_hT1(out_T1)
+        out_T_aug, out_S_aug, master_aug = self.HtrgGAT_layer_ST12(
+            out_T1, out_S1, master=master1)
+        out_T1 = out_T1 + out_T_aug
+        out_S1 = out_S1 + out_S_aug
+        master1 = master1 + master_aug
+        out_T2, out_S2, master2 = self.HtrgGAT_layer_ST21(
+            out_T, out_S, master=self.master2)
+        out_S2 = self.pool_hS2(out_S2)
+        out_T2 = self.pool_hT2(out_T2)
+        out_T_aug, out_S_aug, master_aug = self.HtrgGAT_layer_ST22(
+            out_T2, out_S2, master=master2)
+        out_T2 = out_T2 + out_T_aug
+        out_S2 = out_S2 + out_S_aug
+        master2 = master2 + master_aug
+        out_T1 = self.drop_way(out_T1)
+        out_T2 = self.drop_way(out_T2)
+        out_S1 = self.drop_way(out_S1)
+        out_S2 = self.drop_way(out_S2)
+        master1 = self.drop_way(master1)
+        master2 = self.drop_way(master2)
+        out_T = torch.max(out_T1, out_T2)
+        out_S = torch.max(out_S1, out_S2)
+        master = torch.max(master1, master2)
+        T_max, _ = torch.max(torch.abs(out_T), dim=1)
+        T_avg = torch.mean(out_T, dim=1)
+        S_max, _ = torch.max(torch.abs(out_S), dim=1)
+        S_avg = torch.mean(out_S, dim=1)
+        last_hidden = torch.cat(
+            [T_max, T_avg, S_max, S_avg, master.squeeze(1)], dim=1)
+        last_hidden = self.drop(last_hidden)
+        output = self.out_layer(last_hidden)
+        return last_hidden, output
+class AASISTDetector(nn.Module):
+    def __init__(self, num_classes=2):
+        super().__init__()
+        d_args = {
+            "nb_samp": 64600,
+            "first_conv": 128,
+            "in_channels": 1,
+            "filts": [70, [1, 32], [32, 32], [32, 64], [64, 64]],
+            "gat_dims": [64, 32],
+            "pool_ratios": [0.5, 0.7, 0.5, 0.5],
+            "temperatures": [2.0, 2.0, 100.0]
+        }
+        self.model = AASISTModel(d_args)
+        # Override out_layer if not strictly 2 classes.
+        if num_classes != 2:
+            self.model.out_layer = nn.Linear(5 * d_args["gat_dims"][1], num_classes)
+    def forward(self, x):
+        # x is (B, 1, T) or (B, T)
+        if x.dim() == 3:
+            x = x.squeeze(1) # Convert to (B, T)
+        _, out = self.model(x)
+        return out
+# ============================================================
+# 3. CQCC Baseline Detector (Acoustic Feature Baseline)
+# ============================================================
+class CQCCBaselineDetector(nn.Module):
+    def __init__(self, num_classes=2):
+        super().__init__()
+        # Input shape expected: (B, 1, 20, T)
+        self.features = nn.Sequential(
+            nn.Conv2d(1, 16, 3, padding=1),
+            nn.BatchNorm2d(16),
+            nn.ReLU(),
+            nn.MaxPool2d(2),
+            nn.Conv2d(16, 32, 3, padding=1),
+            nn.BatchNorm2d(32),
+            nn.ReLU(),
+            nn.MaxPool2d(2),
+            nn.Conv2d(32, 64, 3, padding=1),
+            nn.BatchNorm2d(64),
+            nn.ReLU(),
+            nn.AdaptiveAvgPool2d(1)
+        )
+        self.classifier = nn.Sequential(
+            nn.Dropout(0.3),
+            nn.Linear(64, num_classes)
+        )
+    def forward(self, x):
+        x = self.features(x)
+        x = x.flatten(1)
+        return self.classifier(x)
+# ============================================================
+# 4. Custom Fusional Wav2Vec2 + CQCC with Cross-Attention + Graph
+# ============================================================
+class PositionalEncoding(nn.Module):
+    def __init__(self, dim, max_len=6000):
+        super().__init__()
+        self.pos_embed = nn.Parameter(torch.randn(1, max_len, dim))
+    def forward(self, x):
+        return x + self.pos_embed[:, :x.size(1)]
+class BidirectionalCrossAttention(nn.Module):
+    def __init__(self, dim, num_heads=4):
+        super().__init__()
+        self.attn1 = nn.MultiheadAttention(dim, num_heads, batch_first=True, dropout=0.2)
+        self.attn2 = nn.MultiheadAttention(dim, num_heads, batch_first=True, dropout=0.2)
+        self.norm_q = nn.LayerNorm(dim)
+        self.norm_kv = nn.LayerNorm(dim)
+    def forward(self, x1, x2):
+        # x1 attends to x2
+        q1 = self.norm_q(x1)
+        k2 = self.norm_kv(x2)
+        v2 = k2
+        out1, _ = self.attn1(q1, k2, v2)
+        # x2 attends to x1
+        q2 = self.norm_q(x2)
+        k1 = self.norm_kv(x1)
+        v1 = k1
+        out2, _ = self.attn2(q2, k1, v1)
+        return out1, out2
+def align_sequences(x, target_len):
+    """Linear interpolation to match sequence lengths"""
+    x = x.transpose(1, 2)
+    x = F.interpolate(x, size=target_len, mode='linear', align_corners=False)
+    return x.transpose(1, 2)
+class ImprovedWav2Vec2CQCCDetector(nn.Module):
+    def __init__(self, num_classes=2):
+        super().__init__()
+        # Wav2Vec2
+        self.wav2vec = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base")
+        # Freeze the Wav2Vec2 layer so it acts purely as a feature extractor
+        for param in self.wav2vec.parameters():
+            param.requires_grad = False
+        dim = self.wav2vec.config.hidden_size
+        # CQCC encoder
+        self.cqcc_conv = nn.Sequential(
+            nn.Conv1d(20, 128, kernel_size=3, padding=1),
+            nn.BatchNorm1d(128),
+            nn.GELU(),
+            nn.Dropout(0.2),
+            nn.Conv1d(128, dim, kernel_size=3, padding=1),
+            nn.BatchNorm1d(dim),
+            nn.GELU()
+        )
+        # Positional Encoding
+        self.pos_enc = PositionalEncoding(dim)
+        # Bidirectional Cross Attention
+        self.cross_attn = BidirectionalCrossAttention(dim)
+        # True Graph Transformer Backend (using GAT blocks from AASIST)
+        self.graph_layers = nn.ModuleList([
+            GraphBlock(dim) for _ in range(3)
+        ])
+        # Classifier
+        self.classifier = nn.Sequential(
+            nn.Linear(dim, 128),
+            nn.GELU(),
+            nn.Dropout(0.2),
+            nn.Linear(128, num_classes)
+        )
+    def forward(self, wav, cqcc):
+        if wav.dim() == 3:
+            wav = wav.squeeze(1)
+        # Wav2Vec2 features
+        w2v = self.wav2vec(wav).last_hidden_state  # (B, T_w, D)
+        # CQCC features
+        if cqcc.dim() == 4:
+            cqcc = cqcc.squeeze(1)
+        cqcc_feat = self.cqcc_conv(cqcc).transpose(1, 2)  # (B, T_c, D)
+        # Align lengths
+        cqcc_feat = align_sequences(cqcc_feat, w2v.size(1))
+        # Add positional encoding
+        w2v = self.pos_enc(w2v)
+        cqcc_feat = self.pos_enc(cqcc_feat)
+        # Cross attention (bidirectional)
+        f1, f2 = self.cross_attn(cqcc_feat, w2v)
+        fused = f1 + f2
+        # Graph Transformer processing on node sequences
+        x = fused
+        for layer in self.graph_layers:
+            x = layer(x)
+        # Global average pooling on the nodes
+        pooled = x.mean(dim=1)
+        return self.classifier(pooled)
+# ============================================================
+# 5. Ablation Models
+# ============================================================
+class AblationWav2Vec2GraphDetector(nn.Module):
+    """Ablation 1: Wav2Vec2 only + Graph Backend (No CQCC, No Cross-Attention)"""
+    def __init__(self, num_classes=2):
+        super().__init__()
+        self.wav2vec = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base")
+        for param in self.wav2vec.parameters():
+            param.requires_grad = False
+        dim = self.wav2vec.config.hidden_size
+        self.pos_enc = PositionalEncoding(dim)
+        self.graph_layers = nn.ModuleList([GraphBlock(dim) for _ in range(3)])
+        self.classifier = nn.Sequential(
+            nn.Linear(dim, 128), nn.GELU(), nn.Dropout(0.2), nn.Linear(128, num_classes)
+        )
+    def forward(self, wav, cqcc=None): # Accept both but ignore CQCC
+        if wav.dim() == 3:
+            wav = wav.squeeze(1)
+        w2v = self.wav2vec(wav).last_hidden_state
+        w2v = self.pos_enc(w2v)
+        x = w2v
+        for layer in self.graph_layers:
+            x = layer(x)
+        pooled = x.mean(dim=1)
+        return self.classifier(pooled)
+class AblationCQCCGraphDetector(nn.Module):
+    """Ablation 2: CQCC only + Graph Backend (No Wav2Vec2, No Cross-Attention)"""
+    def __init__(self, num_classes=2):
+        super().__init__()
+        dim = 768 # Match Wav2Vec2 hidden size for fair comparison
+        self.cqcc_conv = nn.Sequential(
+            nn.Conv1d(20, 128, kernel_size=3, padding=1),
+            nn.BatchNorm1d(128),
+            nn.GELU(),
+            nn.Dropout(0.2),
+            nn.Conv1d(128, dim, kernel_size=3, padding=1),
+            nn.BatchNorm1d(dim),
+            nn.GELU()
+        )
+        self.pos_enc = PositionalEncoding(dim)
+        self.graph_layers = nn.ModuleList([GraphBlock(dim) for _ in range(3)])
+        self.classifier = nn.Sequential(
+            nn.Linear(dim, 128), nn.GELU(), nn.Dropout(0.2), nn.Linear(128, num_classes)
+        )
+    def forward(self, cqcc):
+        if cqcc.dim() == 4:
+            cqcc = cqcc.squeeze(1)
+        cqcc_feat = self.cqcc_conv(cqcc).transpose(1, 2)
+        cqcc_feat = self.pos_enc(cqcc_feat)
+        x = cqcc_feat
+        for layer in self.graph_layers:
+            x = layer(x)
+        pooled = x.mean(dim=1)
+        return self.classifier(pooled)
+class AblationConcatGraphDetector(nn.Module):
+    """Ablation 3: Wav2Vec2 + CQCC + Simple Concat Fusion + Graph Backend (No Cross-Attention)"""
+    def __init__(self, num_classes=2):
+        super().__init__()
+        self.wav2vec = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base")
+        for param in self.wav2vec.parameters():
+            param.requires_grad = False
+        dim = self.wav2vec.config.hidden_size
+        self.cqcc_conv = nn.Sequential(
+            nn.Conv1d(20, 128, kernel_size=3, padding=1),
+            nn.BatchNorm1d(128),
+            nn.GELU(),
+            nn.Dropout(0.2),
+            nn.Conv1d(128, dim, kernel_size=3, padding=1),
+            nn.BatchNorm1d(dim),
+            nn.GELU()
+        )
+        self.fusion_proj = nn.Linear(dim * 2, dim) # Project concatenated features back to dim
+        self.pos_enc = PositionalEncoding(dim)
+        self.graph_layers = nn.ModuleList([GraphBlock(dim) for _ in range(3)])
+        self.classifier = nn.Sequential(
+            nn.Linear(dim, 128), nn.GELU(), nn.Dropout(0.2), nn.Linear(128, num_classes)
+        )
+    def forward(self, wav, cqcc):
+        if wav.dim() == 3:
+            wav = wav.squeeze(1)
+        w2v = self.wav2vec(wav).last_hidden_state
+        if cqcc.dim() == 4:
+            cqcc = cqcc.squeeze(1)
+        cqcc_feat = self.cqcc_conv(cqcc).transpose(1, 2)
+        cqcc_feat = align_sequences(cqcc_feat, w2v.size(1))
+        # Simple concat over feature dimension instead of cross-attention
+        fused = torch.cat([w2v, cqcc_feat], dim=-1)
+        fused = self.fusion_proj(fused)
+        fused = self.pos_enc(fused)
+        x = fused
+        for layer in self.graph_layers:
+            x = layer(x)
+        pooled = x.mean(dim=1)
+        return self.classifier(pooled)
+class AblationCrossAttnLinearDetector(nn.Module):
+    """Ablation 4: Wav2Vec2 + CQCC + Cross-Attention + Linear Backend (No Graph Transformer)"""
+    def __init__(self, num_classes=2):
+        super().__init__()
+        self.wav2vec = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base")
+        for param in self.wav2vec.parameters():
+            param.requires_grad = False
+        dim = self.wav2vec.config.hidden_size
+        self.cqcc_conv = nn.Sequential(
+            nn.Conv1d(20, 128, kernel_size=3, padding=1),
+            nn.BatchNorm1d(128),
+            nn.GELU(),
+            nn.Dropout(0.2),
+            nn.Conv1d(128, dim, kernel_size=3, padding=1),
+            nn.BatchNorm1d(dim),
+            nn.GELU()
+        )
+        self.pos_enc = PositionalEncoding(dim)
+        self.cross_attn = BidirectionalCrossAttention(dim)
+        # Richer MLP classifier since graph is missing
+        self.classifier = nn.Sequential(
+            nn.Linear(dim, 256),
+            nn.GELU(),
+            nn.Dropout(0.3),
+            nn.Linear(256, 128),
+            nn.GELU(),
+            nn.Dropout(0.2),
+            nn.Linear(128, num_classes)
+        )
+    def forward(self, wav, cqcc):
+        if wav.dim() == 3:
+            wav = wav.squeeze(1)
+        w2v = self.wav2vec(wav).last_hidden_state
+        if cqcc.dim() == 4:
+            cqcc = cqcc.squeeze(1)
+        cqcc_feat = self.cqcc_conv(cqcc).transpose(1, 2)
+        cqcc_feat = align_sequences(cqcc_feat, w2v.size(1))
+        w2v = self.pos_enc(w2v)
+        cqcc_feat = self.pos_enc(cqcc_feat)
+        f1, f2 = self.cross_attn(cqcc_feat, w2v)
+        fused = f1 + f2
+        # No graph layer, straight to global average pooling
+        pooled = fused.mean(dim=1)
+        return self.classifier(pooled)

backend/preprocess.py ADDED Viewed

	@@ -0,0 +1,58 @@

+import subprocess
+import sys
+import os
+import argparse
+from dataset import AudioDataset
+def run_command(cmd):
+    try:
+        subprocess.run(cmd, check=True, text=True, capture_output=True)
+    except subprocess.CalledProcessError:
+        sys.exit(1)
+def download_dataset():
+    run_command(["git", "lfs", "install"])
+    dataset_dir = "MLAAD-tiny"
+    if not os.path.exists(dataset_dir):
+        print("=== Cloning MLAAD-tiny dataset ===")
+        run_command(["git", "clone", "https://huggingface.co/datasets/mueller91/MLAAD-tiny"])
+    else:
+        print(f"Dataset directory '{dataset_dir}' already exists. Skipping clone.")
+def precompute_cqcc(data_dir, cqcc_cache_dir, force=False):
+    for lang in ["en", "de"]:
+        print(f"\n--- Precomputing CQCC for language: {lang} ---")
+        dataset = AudioDataset(
+            data_dir=data_dir,
+            augment=False,
+            cqcc_cache_dir=cqcc_cache_dir,
+            target_lang=lang
+        )
+        dataset.precompute_cqcc_cache(force=force)
+    print("\nFinished all CQCC preprocessing.")
+def parse_args():
+    parser = argparse.ArgumentParser(description="Download dataset and precompute CQCC features.")
+    parser.add_argument("--data-dir", default="MLAAD-tiny")
+    parser.add_argument(
+        "--cqcc-cache-dir",
+        default=os.path.join(os.path.dirname(__file__), "precomputed_features", "cqcc")
+    )
+    parser.add_argument("--force", action="store_true")
+    parser.add_argument("--skip-download", action="store_true")
+    parser.add_argument("--skip-cqcc", action="store_true")
+    return parser.parse_args()
+if __name__ == "__main__":
+    args = parse_args()
+    if not args.skip_download:
+        download_dataset()
+    if not args.skip_cqcc:
+        precompute_cqcc(args.data_dir, args.cqcc_cache_dir, args.force)

backend/train.py ADDED Viewed

	@@ -0,0 +1,514 @@

+import os
+import argparse
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader
+from dataset import AudioDataset, collate_variable_length
+from models import (
+    AASISTDetector,
+    Wav2Vec2SpoofDetector,
+    CQCCBaselineDetector,
+    ImprovedWav2Vec2CQCCDetector,
+    AblationWav2Vec2GraphDetector,
+    AblationCQCCGraphDetector,
+    AblationConcatGraphDetector,
+    AblationCrossAttnLinearDetector
+)
+from sklearn.metrics import roc_curve, auc
+import numpy as np
+import random
+from tqdm import tqdm
+def train_model(model, train_dataloader, criterion, optimizer, epochs=5, input_type='wav', device=None, val_dataloader=None, eval_interval=1, patience=2, model_save_path=None):
+    if device is None:
+        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    model.to(device)
+    loss_history = []
+    best_val_metric = float('inf') # For min_dcf, lower is better
+    patience_counter = 0
+    best_epoch = 0
+    for epoch in range(epochs):
+        model.train()
+        epoch_loss = 0
+        correct = 0
+        total = 0
+        # Wrap the dataloader with tqdm for a progress bar
+        for batch_idx, batch in enumerate(tqdm(train_dataloader, desc=f"Epoch {epoch+1}/{epochs} - Training")):
+            wavs, cqccs, labels = batch
+            wavs = wavs.to(device)
+            cqccs = cqccs.to(device)
+            labels = labels.to(device)
+            optimizer.zero_grad()
+            if input_type == 'wav':
+                outputs = model(wavs)
+            elif input_type == 'cqcc':
+                outputs = model(cqccs)
+            elif input_type == 'wav_and_cqcc':
+                outputs = model(wavs, cqccs)
+            else:
+                raise ValueError("invalid input_type")
+            loss = criterion(outputs, labels)
+            loss.backward()
+            optimizer.step()
+            epoch_loss += loss.item()
+            _, predicted = torch.max(outputs.data, 1)
+            total += labels.size(0)
+            correct += (predicted == labels).sum().item()
+            # Print intermediate progress within the epoch
+            if batch_idx % 500 == 0 and batch_idx > 0: # Report every 500 batches
+                current_acc = 100 * correct / total
+                current_loss = epoch_loss / (batch_idx + 1)
+                print(f"  Batch {batch_idx}/{len(train_dataloader)} | Loss: {current_loss:.4f} | Acc: {current_acc:.2f}%")
+        acc = 100 * correct / total if total > 0 else 0
+        avg_loss = epoch_loss / len(train_dataloader)
+        loss_history.append(avg_loss)
+        print(f"Epoch {epoch+1}/{epochs} | Training Loss: {avg_loss:.4f} | Training Acc: {acc:.2f}%")
+        # Validation and Early Stopping
+        if val_dataloader is not None and (epoch + 1) % eval_interval == 0:
+            print(f"Epoch {epoch+1}/{epochs} - Evaluating on Validation Set...")
+            _, _, _, val_eer, val_min_dcf, val_accuracy = evaluate_model(
+                model, val_dataloader, input_type=input_type, device=device
+            )
+            print(f"  Validation | EER={val_eer*100:.2f}% | minDCF={val_min_dcf:.4f} | Accuracy={val_accuracy:.2f}")
+            if val_min_dcf < best_val_metric:
+                best_val_metric = val_min_dcf
+                patience_counter = 0
+                best_epoch = epoch + 1
+                if model_save_path:
+                    torch.save(model.state_dict(), model_save_path)
+                    print(f"  Saved best model to {model_save_path} (minDCF: {best_val_metric:.4f})")
+            else:
+                patience_counter += 1
+                print(f"  Validation minDCF did not improve. Patience: {patience_counter}/{patience}")
+                if patience_counter >= patience:
+                    print(f"Early stopping triggered after {epoch+1} epochs. Best minDCF: {best_val_metric:.4f} at epoch {best_epoch}")
+                    if model_save_path:
+                        print(f"Loading best model from {model_save_path}")
+                        model.load_state_dict(torch.load(model_save_path))
+                    return loss_history # Stop training
+    # ensure save path logic is intact even when loop ends naturally
+    if val_dataloader is None and model_save_path is not None:
+        torch.save(model.state_dict(), model_save_path)
+        print(f"  Saved final model to {model_save_path}")
+    return loss_history
+def evaluate_model(model, dataloader, input_type='wav', device=None):
+    if device is None:
+        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    model.eval()
+    all_labels = []
+    all_probs = []
+    with torch.no_grad():
+        for batch in tqdm(dataloader, desc="Evaluating"):
+            wavs, cqccs, labels = batch
+            wavs = wavs.to(device)
+            cqccs = cqccs.to(device)
+            labels = labels.to(device)
+            if input_type == 'wav':
+                outputs = model(wavs)
+            elif input_type == 'cqcc':
+                outputs = model(cqccs)
+            elif input_type == 'wav_and_cqcc':
+                outputs = model(wavs, cqccs)
+            else:
+                raise ValueError("invalid input_type")
+            probs = torch.softmax(outputs, dim=1)[:, 1]
+            all_labels.extend(labels.tolist())
+            all_probs.extend(probs.tolist())
+    fpr, tpr, thresholds = roc_curve(all_labels, all_probs)
+    roc_auc = auc(fpr, tpr)
+    # ------------------
+    # EER (Equal Error Rate)
+    # ------------------
+    fnr = 1 - tpr
+    eer_index = np.nanargmin(np.absolute(fnr - fpr))
+    eer = fpr[eer_index]
+    # ------------------
+    # minDCF (Minimum Detection Cost Function)
+    # Parameters according to ASVspoof 5 Evaluation Plan (Track 1)
+    # ------------------
+    P_spoof = 0.05      # Prior probability of a spoofing attack (\pi_{spf})
+    P_bonafide = 0.95   # Prior probability of a real/bonafide utterance (1 - \pi_{spf})
+    C_miss = 1          # Cost of falsely rejecting a real voice (Miss)
+    C_fa = 10           # Cost of falsely accepting a spoof (False Alarm)
+    # In the dataset, 0 = real (bonafide), 1 = fake (spoof)
+    # fpr (False Positive Rate) = predicted fake (1) when true is real (0). This is a "miss" in ASVspoof.
+    # fnr (False Negative Rate) = predicted real (0) when true is fake (1). This is a "false alarm" in ASVspoof.
+    P_miss = fpr
+    P_fa = fnr
+    # Raw DCF = C_miss * P_bonafide * P_miss + C_fa * P_spoof * P_fa
+    # Normalized by the default DCF (min cost of predicting all bonafide vs all spoof)
+    dcf_default = min(C_miss * P_bonafide, C_fa * P_spoof)
+    dcf_array = (C_miss * P_bonafide * P_miss + C_fa * P_spoof * P_fa) / dcf_default
+    min_dcf = np.min(dcf_array)
+    # Overall Accuracy (using 0.5 threshold)
+    preds = [1 if p > 0.5 else 0 for p in all_probs]
+    correct = sum(1 for p, l in zip(preds, all_labels) if p == l)
+    accuracy = correct / len(all_labels) if len(all_labels) > 0 else 0
+    return fpr, tpr, roc_auc, eer, min_dcf, accuracy
+def parse_args():
+    parser = argparse.ArgumentParser(description="Train spoof-detection models with optional CQCC caching.")
+    parser.add_argument(
+        "--data-dir",
+        default=None,
+        help="Path to dataset root containing original/ and fake/ folders."
+    )
+    parser.add_argument(
+        "--cqcc-cache-dir", # this is where cqcc is stored
+        default=os.path.join(os.path.dirname(__file__), "precomputed_features", "cqcc"),
+        help="Directory used to store and reuse precomputed CQCC tensors."
+    )
+    parser.add_argument(
+        "--precompute-cqcc-only",
+        action="store_true",
+        help="Only build the CQCC cache and exit without training."
+    )
+    parser.add_argument(
+        "--val-split",
+        type=float,
+        default=0.2,
+        help="Fraction of English training data to reserve for validation."
+    )
+    parser.add_argument(
+        "--force-rebuild-cqcc",
+        action="store_true",
+        help="Recompute cached CQCC files even if they already exist."
+    )
+    parser.add_argument(
+        "--smoke-test",
+        action="store_true",
+        help="Load one batch, run a forward pass through each model, and exit without training."
+    )
+    return parser.parse_args()
+def run_smoke_test(dataloader, device):
+    print("\n--- Running Smoke Test ---")
+    batch = next(iter(dataloader))
+    wavs, cqccs, labels = batch
+    models_to_test = [
+        ("Wav2Vec2 Baseline", Wav2Vec2SpoofDetector(num_classes=2).to(device), "wav"),
+        ("AASIST Baseline", AASISTDetector(num_classes=2).to(device), "wav"),
+        ("CQCC Baseline", CQCCBaselineDetector(num_classes=2).to(device), "cqcc"),
+        ("Custom Fusion Model", ImprovedWav2Vec2CQCCDetector(num_classes=2).to(device), "wav_and_cqcc"),
+        ("Ablation W2V2+Graph", AblationWav2Vec2GraphDetector(num_classes=2).to(device), "wav"),
+        ("Ablation CQCC+Graph", AblationCQCCGraphDetector(num_classes=2).to(device), "cqcc"),
+        ("Ablation Concat+Graph", AblationConcatGraphDetector(num_classes=2).to(device), "wav_and_cqcc"),
+        ("Ablation CrossAttn+Linear", AblationCrossAttnLinearDetector(num_classes=2).to(device), "wav_and_cqcc"),
+    ]
+    with torch.no_grad():
+        for name, model, input_type in models_to_test:
+            model.eval()
+            if input_type == "wav":
+                outputs = model(wavs.to(device))
+            elif input_type == "cqcc":
+                outputs = model(cqccs.to(device))
+            elif input_type == "wav_and_cqcc":
+                outputs = model(wavs.to(device), cqccs.to(device))
+            else:
+                raise ValueError("invalid input_type")
+            print(f"{name}: input OK, output shape = {tuple(outputs.shape)}")
+    print(f"Labels shape = {tuple(labels.shape)}")
+    print("Smoke test complete. Cached CQCC loading and model forward passes succeeded.")
+def main():
+    args = parse_args()
+    print(args)
+    SEED = 42
+    random.seed(SEED)
+    np.random.seed(SEED)
+    torch.manual_seed(SEED)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(SEED)
+    g = torch.Generator()
+    g.manual_seed(SEED)
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    print(f"Using device: {device}")
+    print("Loading English Dataset for training/validation...")
+    full_en_dataset = AudioDataset(data_dir=args.data_dir, augment=False, cqcc_cache_dir=args.cqcc_cache_dir, target_lang="en")
+    total_en = len(full_en_dataset)
+    if total_en == 0:
+        raise ValueError("No English data found for target_lang='en'. Check data_dir and directory layout.")
+    val_split = min(max(args.val_split, 0.0), 0.5)
+    train_size = int((1.0 - val_split) * total_en)
+    val_size = total_en - train_size
+    indices = torch.randperm(total_en, generator=g).tolist()
+    train_indices = indices[:train_size]
+    val_indices = indices[train_size:]
+    train_dataset = torch.utils.data.Subset(
+        AudioDataset(data_dir=args.data_dir, augment=True, cqcc_cache_dir=args.cqcc_cache_dir, target_lang="en"),
+        train_indices
+    )
+    val_dataset = torch.utils.data.Subset(
+        AudioDataset(data_dir=args.data_dir, augment=False, cqcc_cache_dir=args.cqcc_cache_dir, target_lang="en"),
+        val_indices
+    )
+    print("Loading German Dataset for Testing...")
+    test_dataset = AudioDataset(data_dir=args.data_dir, augment=False, cqcc_cache_dir=args.cqcc_cache_dir, target_lang="de")
+    if args.precompute_cqcc_only:
+        print("\n--- Starting CQCC Precomputation ---")
+        print(f"Dataset: {full_en_dataset.data_dir}")
+        print("Precomputing CQCC cache for English data...")
+        full_en_dataset.precompute_cqcc_cache(force=args.force_rebuild_cqcc)
+        test_dataset.precompute_cqcc_cache(force=args.force_rebuild_cqcc)
+        print("CQCC preprocessing complete. Exiting.")
+        return
+    train_loader = DataLoader(
+        train_dataset,
+        batch_size=8,
+        shuffle=True,
+        collate_fn=collate_variable_length,
+        num_workers=2,
+        pin_memory=True,
+        generator=g, # ensure reproducible shuffling
+    )
+    val_loader = DataLoader(
+        val_dataset,
+        batch_size=8,
+        shuffle=False,
+        collate_fn=collate_variable_length,
+        num_workers=2,
+        pin_memory=True
+    )
+    test_loader = DataLoader(
+        test_dataset,
+        batch_size=8,
+        shuffle=False,
+        collate_fn=collate_variable_length,
+        num_workers=2,
+        pin_memory=True
+    )
+    if args.smoke_test:
+        run_smoke_test(train_loader, device)
+        return
+    models_dir = os.path.join(os.path.dirname(__file__), "models")
+    os.makedirs(models_dir, exist_ok=True)
+    criterion = nn.CrossEntropyLoss()
+    # ============================================================
+    # 1 Wav2Vec2 Baseline
+    # ============================================================
+    print("\n--- Training Wav2Vec2 Baseline ---")
+    wav2vec_model = Wav2Vec2SpoofDetector(num_classes=2).to(device)
+    optimizer_wav2vec = torch.optim.Adam(wav2vec_model.parameters(), lr=1e-4)
+    wav2vec_loss = train_model(
+        wav2vec_model,
+        train_loader,
+        criterion,
+        optimizer_wav2vec,
+        input_type='wav',
+        device=device,
+        val_dataloader=val_loader,
+        model_save_path=os.path.join(models_dir, "wav2vec2.pth")
+    )
+    del wav2vec_model, optimizer_wav2vec
+    torch.cuda.empty_cache()
+    # ============================================================
+    # 2 AASIST Baseline
+    # ============================================================
+    print("\n--- Training AASIST Baseline ---")
+    aasist_model = AASISTDetector(num_classes=2).to(device)
+    optimizer_aasist = torch.optim.Adam(aasist_model.parameters(), lr=5e-4)
+    aasist_loss = train_model(
+        aasist_model,
+        train_loader,
+        criterion,
+        optimizer_aasist,
+        input_type='wav',
+        device=device,
+        val_dataloader=val_loader,
+        model_save_path=os.path.join(models_dir, "aasist.pth")
+    )
+    del aasist_model, optimizer_aasist
+    torch.cuda.empty_cache()
+    # ============================================================
+    # 3 CQCC Baseline
+    # ============================================================
+    print("\n--- Training CQCC Baseline ---")
+    cqcc_baseline = CQCCBaselineDetector(num_classes=2).to(device)
+    optimizer_cqcc = torch.optim.Adam(cqcc_baseline.parameters(), lr=1e-4)
+    cqcc_loss = train_model(
+        cqcc_baseline,
+        train_loader,
+        criterion,
+        optimizer_cqcc,
+        input_type='cqcc',
+        device=device,
+        val_dataloader=val_loader,
+        model_save_path=os.path.join(models_dir, "cqcc_baseline.pth")
+    )
+    del cqcc_baseline, optimizer_cqcc
+    torch.cuda.empty_cache()
+    # ============================================================
+    # 4 Custom Fusional Wav2Vec2 + CQCC with Cross-Attention + Graph
+    # ============================================================
+    print("\n--- Training Custom Fusion Detector ---")
+    custom_model = ImprovedWav2Vec2CQCCDetector(num_classes=2).to(device)
+    optimizer_custom = torch.optim.Adam(custom_model.parameters(), lr=1e-4)
+    custom_loss = train_model(
+        custom_model,
+        train_loader,
+        criterion,
+        optimizer_custom,
+        input_type='wav_and_cqcc',
+        device=device,
+        val_dataloader=val_loader,
+        model_save_path=os.path.join(models_dir, "custom_hybrid.pth")
+    )
+    del custom_model, optimizer_custom
+    torch.cuda.empty_cache()
+    # ============================================================
+    # 5 Ablation Models
+    # ============================================================
+    print("\n--- Training Ablation 1 (Wav2Vec2 + Graph) ---")
+    ab1_model = AblationWav2Vec2GraphDetector(num_classes=2).to(device)
+    optimizer_ab1 = torch.optim.Adam(ab1_model.parameters(), lr=1e-4) # learning rate for wav2vec2-based
+    ab1_loss = train_model(ab1_model, train_loader, criterion, optimizer_ab1, input_type='wav', device=device, val_dataloader=val_loader, model_save_path=os.path.join(models_dir, "ablation_w2v2_graph.pth"))
+    del ab1_model, optimizer_ab1
+    torch.cuda.empty_cache()
+    print("\n--- Training Ablation 2 (CQCC + Graph) ---")
+    ab2_model = AblationCQCCGraphDetector(num_classes=2).to(device)
+    optimizer_ab2 = torch.optim.Adam(ab2_model.parameters(), lr=1e-4) # learning rate for CQCC-based
+    ab2_loss = train_model(ab2_model, train_loader, criterion, optimizer_ab2, input_type='cqcc', device=device, val_dataloader=val_loader, model_save_path=os.path.join(models_dir, "ablation_cqcc_graph.pth"))
+    del ab2_model, optimizer_ab2
+    torch.cuda.empty_cache()
+    print("\n--- Training Ablation 3 (Wav2Vec2 + CQCC + Simple Concat) ---")
+    ab3_model = AblationConcatGraphDetector(num_classes=2).to(device)
+    optimizer_ab3 = torch.optim.Adam(ab3_model.parameters(), lr=1e-4)
+    ab3_loss = train_model(ab3_model, train_loader, criterion, optimizer_ab3, input_type='wav_and_cqcc', device=device, val_dataloader=val_loader, model_save_path=os.path.join(models_dir, "ablation_concat_graph.pth"))
+    del ab3_model, optimizer_ab3
+    torch.cuda.empty_cache()
+    print("\n--- Training Ablation 4 (Wav2Vec2 + CQCC + Cross-Attn + Linear) ---")
+    ab4_model = AblationCrossAttnLinearDetector(num_classes=2).to(device)
+    optimizer_ab4 = torch.optim.Adam(ab4_model.parameters(), lr=1e-4)
+    ab4_loss = train_model(ab4_model, train_loader, criterion, optimizer_ab4, input_type='wav_and_cqcc', device=device, val_dataloader=val_loader, model_save_path=os.path.join(models_dir, "ablation_crossattn_linear.pth"))
+    del ab4_model, optimizer_ab4
+    torch.cuda.empty_cache()
+    # ============================================================
+    # Evaluation — reload one at a time
+    # ============================================================
+    print("\n--- Evaluating Models ---")
+    evals = []
+    models_to_eval = [
+        ("Wav2Vec2 Baseline", Wav2Vec2SpoofDetector, "wav2vec2.pth", 'wav'),
+        ("AASIST Baseline", AASISTDetector, "aasist.pth", 'wav'),
+        ("CQCC Baseline", CQCCBaselineDetector, "cqcc_baseline.pth", 'cqcc'),
+        ("Custom Fusion Model", ImprovedWav2Vec2CQCCDetector, "custom_hybrid.pth", 'wav_and_cqcc'),
+        ("Ablation 1 (W2V2+Graph)", AblationWav2Vec2GraphDetector, "ablation_w2v2_graph.pth", 'wav'),
+        ("Ablation 2 (CQCC+Graph)", AblationCQCCGraphDetector, "ablation_cqcc_graph.pth", 'cqcc'),
+        ("Ablation 3 (Concat+Graph)", AblationConcatGraphDetector, "ablation_concat_graph.pth", 'wav_and_cqcc'),
+        ("Ablation 4 (CrossAttn+Linear)", AblationCrossAttnLinearDetector, "ablation_crossattn_linear.pth", 'wav_and_cqcc'),
+    ]
+    for name, model_class, filename, inp in models_to_eval:
+        model_path = os.path.join(models_dir, filename)
+        if not os.path.exists(model_path):
+            print(f"Skipping evaluation for {name} (Model weights not found at {model_path})")
+            continue
+        model_obj = model_class(num_classes=2).to(device)
+        model_obj.load_state_dict(torch.load(model_path, map_location=device))
+        model_obj.eval()
+        print(f"\n--- Metrics for {name} ---")
+        # 1. EVAL ON TRAIN SET
+        train_fpr, train_tpr, train_auc, train_eer, train_min_dcf, train_acc = evaluate_model(
+            model_obj, train_loader, input_type=inp, device=device
+        )
+        print(f"[Train] Acc={train_acc*100:.2f}% | EER={train_eer*100:.2f}% | minDCF={train_min_dcf:.4f}")
+        # 2. EVAL ON TEST SET
+        test_fpr, test_tpr, test_auc, test_eer, test_min_dcf, test_acc = evaluate_model(
+            model_obj, test_loader, input_type=inp, device=device
+        )
+        print(f"[Test ] Acc={test_acc*100:.2f}% | EER={test_eer*100:.2f}% | minDCF={test_min_dcf:.4f}")
+        del model_obj
+        torch.cuda.empty_cache()
+if __name__ == "__main__":
+    main()

frontend/index.html ADDED Viewed

	@@ -0,0 +1,75 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>OdioCheck | Deepfake Voice Detection</title>
+    <!-- Tailwind CSS -->
+    <script src="https://cdn.tailwindcss.com"></script>
+    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;600;700&display=swap" rel="stylesheet">
+    <link rel="stylesheet" href="style.css">
+    <!-- Chart.js -->
+    <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
+</head>
+<body
+    class="bg-slate-900 text-slate-100 font-sans min-h-screen flex flex-col items-center justify-center p-6 subtle-bg">
+    <div class="glass-card max-w-2xl w-full rounded-3xl p-8 relative overflow-hidden transition-all duration-300">
+        <!-- Glowing Orb Background -->
+        <div
+            class="absolute -top-32 -left-32 w-64 h-64 bg-indigo-600 rounded-full mix-blend-multiply filter blur-3xl opacity-30 animate-pulse">
+        </div>
+        <div class="absolute -bottom-32 -right-32 w-64 h-64 bg-fuchsia-600 rounded-full mix-blend-multiply filter blur-3xl opacity-30 animate-pulse"
+            style="animation-delay: 2s;"></div>
+        <div class="relative z-10">
+            <h1
+                class="text-4xl font-bold mb-2 text-transparent bg-clip-text bg-gradient-to-r from-indigo-400 to-cyan-300">
+                OdioCheck
+            </h1>
+            <p class="text-slate-400 mb-8 font-light">Advanced Deepfake Voice Detection powered by SOTA Graph
+                architecture.</p>
+            <div id="drop-zone"
+                class="border-2 border-dashed border-slate-600 rounded-2xl p-10 flex flex-col items-center justify-center cursor-pointer hover:border-indigo-400 hover:bg-slate-800/50 transition-all duration-300 group">
+                <svg xmlns="http://www.w3.org/2000/svg" fill="none" viewBox="0 0 24 24" stroke-width="1.5"
+                    stroke="currentColor"
+                    class="w-12 h-12 text-slate-500 group-hover:text-indigo-400 mb-4 transition-colors">
+                    <path stroke-linecap="round" stroke-linejoin="round"
+                        d="M12 18.75a6 6 0 006-6v-1.5m-6 7.5a6 6 0 01-6-6v-1.5m6 7.5v3.75m-3.75 0h7.5M12 15.75a3 3 0 01-3-3V4.5a3 3 0 116 0v8.25a3 3 0 01-3 3z" />
+                </svg>
+                <p class="text-lg text-slate-300 font-medium">Click to upload or drag & drop</p>
+                <p class="text-sm text-slate-500 mt-1">Supports WAV, OGG, MP3, FLAC, M4A & more</p>
+                <input type="file" id="file-input" class="hidden" accept="audio/*">
+            </div>
+            <!-- Analysis Section -->
+            <div id="analysis-section" class="mt-8 hidden opacity-0 transition-opacity duration-500">
+                <div class="flex items-center space-x-4 mb-6">
+                    <div id="loading-spinner" class="hidden">
+                        <div class="animate-spin rounded-full h-8 w-8 border-b-2 border-indigo-400"></div>
+                    </div>
+                    <h2 id="status-text" class="text-xl font-semibold text-slate-300">Analyzing Spectrogram...</h2>
+                </div>
+                <!-- Results: panels will be inserted via JavaScript based on response keys -->
+                <div id="results" class="hidden">
+                    <div id="model-panels" class="grid grid-cols-2 gap-6"></div>
+                </div>
+            </div>
+        </div>
+    </div>
+    <!-- Additional Graph Section for wow factor -->
+    <div id="chart-card"
+        class="glass-card max-w-2xl w-full rounded-3xl p-8 mt-6 hidden opacity-0 transition-opacity duration-500">
+        <h3 class="text-lg font-semibold mb-4 text-slate-300">Timeline Analysis</h3>
+        <canvas id="audioChart" height="100"></canvas>
+    </div>
+    <script src="script.js"></script>
+</body>
+</html>

frontend/script.js ADDED Viewed

	@@ -0,0 +1,243 @@

+const dropZone = document.getElementById('drop-zone');
+const fileInput = document.getElementById('file-input');
+const analysisSection = document.getElementById('analysis-section');
+const statusText = document.getElementById('status-text');
+const results = document.getElementById('results');
+const loadingSpinner = document.getElementById('loading-spinner');
+const chartCard = document.getElementById('chart-card');
+// -------------------------------------------------------
+// Chart setup
+// -------------------------------------------------------
+const ctx = document.getElementById('audioChart').getContext('2d');
+let audioChart = new Chart(ctx, {
+    type: 'line',
+    data: { labels: [], datasets: [] },
+    options: {
+        responsive: true,
+        animation: { duration: 600, easing: 'easeInOutQuart' },
+        plugins: {
+            legend: { display: true, labels: { color: '#94a3b8', font: { size: 12 } } },
+            tooltip: {
+                callbacks: {
+                    label: ctx => ` ${ctx.dataset.label}: ${ctx.parsed.y.toFixed(1)}% fake`,
+                    title: items => `Segment @ ${items[0].label}s`
+                }
+            }
+        },
+        scales: {
+            y: {
+                beginAtZero: true,
+                max: 100,
+                ticks: { color: '#94a3b8', callback: v => v + '%' },
+                grid: { color: 'rgba(148,163,184,0.1)' },
+                title: { display: true, text: 'Fake Probability (%)', color: '#64748b' }
+            },
+            x: {
+                ticks: {
+                    color: '#94a3b8', callback: (_, i, ticks) => {
+                        // Show fewer labels when there are many windows
+                        const step = Math.max(1, Math.floor(ticks.length / 8));
+                        return i % step === 0 ? audioChart.data.labels[i] + 's' : '';
+                    }
+                },
+                grid: { color: 'rgba(148,163,184,0.05)' },
+                title: { display: true, text: 'Time (seconds)', color: '#64748b' }
+            }
+        }
+    }
+});
+// Palette and display names for the four models
+const MODEL_META = {
+    wav2vec2: { label: 'Wav2Vec2', color: '#3b82f6' },
+    aasist: { label: 'AASIST', color: '#f43f5e' },
+    cqcc_baseline: { label: 'CQCC Baseline', color: '#fbbf24' },
+    custom_hybrid: { label: 'Proposed Custom Hybrid', color: '#10b981' },
+};
+// -------------------------------------------------------
+// File handling
+// -------------------------------------------------------
+function handleFile(file) {
+    if (!file) return;
+    // Show sections
+    analysisSection.classList.remove('hidden');
+    chartCard.classList.remove('hidden');
+    setTimeout(() => {
+        analysisSection.classList.remove('opacity-0');
+        chartCard.classList.remove('opacity-0');
+    }, 50);
+    results.classList.add('hidden');
+    loadingSpinner.classList.remove('hidden');
+    statusText.innerText = `Analyzing "${file.name}"…`;
+    // Clear previous state
+    document.getElementById('model-panels').innerHTML = '';
+    audioChart.data.labels = [];
+    audioChart.data.datasets = [];
+    audioChart.update();
+    // Animated placeholder while waiting: a single pulsing dataset
+    const placeholder = {
+        label: 'Analyzing…',
+        data: Array.from({ length: 20 }, (_, i) => 45 + Math.sin(i / 2) * 10),
+        borderColor: 'rgba(99,102,241,0.5)',
+        backgroundColor: 'rgba(99,102,241,0.05)',
+        borderDash: [4, 4],
+        fill: true,
+        tension: 0.4,
+        pointRadius: 0,
+    };
+    audioChart.data.labels = Array.from({ length: 20 }, (_, i) => i);
+    audioChart.data.datasets = [placeholder];
+    audioChart.update();
+    let tick = 0;
+    const loadingAnim = setInterval(() => {
+        tick++;
+        placeholder.data = Array.from({ length: 20 }, (_, i) =>
+            45 + Math.sin((i + tick) / 2) * 10
+        );
+        audioChart.update('none'); // skip animation for perf
+    }, 80);
+    const formData = new FormData();
+    formData.append('file', file);
+    const HF_API_URL = window.location.hostname === '127.0.0.1' || window.location.hostname === 'localhost'
+        ? '/api/predict'
+        : 'https://junsiang26-odiocheck-backend.hf.space/api/predict';
+    fetch(HF_API_URL, { method: 'POST', body: formData })
+        .then(r => r.json())
+        .then(data => {
+            clearInterval(loadingAnim);
+            loadingSpinner.classList.add('hidden');
+            if (data.error) {
+                statusText.innerText = 'Error analyzing file.';
+                console.error(data.error);
+                return;
+            }
+            renderResults(data);
+        })
+        .catch(() => {
+            clearInterval(loadingAnim);
+            loadingSpinner.classList.add('hidden');
+            statusText.innerText = 'Connection error. Is the backend running?';
+        });
+}
+// -------------------------------------------------------
+// Render results from the new response shape:
+//   data.overall   → { model: { prediction, fake_probability, real_probability } }
+//   data.timeline  → { model: [fake_prob_pct, ...] }
+//   data.window_labels → [centre_sec, ...]
+// -------------------------------------------------------
+function renderResults(data) {
+    const { overall, timeline, window_labels } = data;
+    statusText.innerText = 'Analysis Complete';
+    results.classList.remove('hidden');
+    // --- Model panels (overall verdict) ---
+    const panelsEl = document.getElementById('model-panels');
+    panelsEl.innerHTML = '';
+    for (const [key, info] of Object.entries(overall)) {
+        const meta = MODEL_META[key] || { label: key, color: '#94a3b8' };
+        const isFake = info.prediction === 'FAKE';
+        const barColor = isFake ? 'from-rose-500 to-rose-400' : 'from-emerald-400 to-emerald-500';
+        const displayPct = isFake ? info.fake_probability : info.real_probability;
+        panelsEl.insertAdjacentHTML('beforeend', `
+            <div>
+                <div class="flex justify-between items-end mb-2">
+                    <span class="text-sm text-slate-400 uppercase tracking-widest font-semibold"
+                          style="color:${meta.color}">${meta.label}</span>
+                    <span class="text-3xl font-bold tracking-wider ${isFake ? 'text-rose-500' : 'text-emerald-500'}">
+                        ${info.prediction}
+                    </span>
+                </div>
+                <div class="text-xs text-slate-500 mb-2">
+                    Fake: <span class="text-slate-300">${info.fake_probability}%</span>
+                    &nbsp;·&nbsp;
+                    Real: <span class="text-slate-300">${info.real_probability}%</span>
+                </div>
+                <div class="w-full bg-slate-700 h-4 rounded-full overflow-hidden mb-6 mt-1">
+                    <div class="prob-bar h-full bg-gradient-to-r transition-all duration-1000 ease-out ${barColor}"
+                         style="width:0%"
+                         data-width="${displayPct}">
+                    </div>
+                </div>
+            </div>`);
+    }
+    // Animate bars
+    requestAnimationFrame(() => {
+        document.querySelectorAll('.prob-bar').forEach(bar => {
+            bar.style.width = bar.dataset.width + '%';
+        });
+    });
+    // --- Timeline chart (real data) ---
+    // window_labels are now start-of-segment times (0, 2, 4 ...)
+    // For short audio with a single window, we pad with the audio-end label
+    // so the chart shows a line rather than a lonely dot.
+    let labels = [...window_labels];
+    let timelineValues = {};
+    Object.entries(timeline).forEach(([k, v]) => { timelineValues[k] = [...v]; });
+    if (labels.length === 1) {
+        // Estimate audio duration: single window = TARGET_LEN / 16000 ≈ 4.025s
+        const audioEnd = parseFloat((labels[0] + 4.025).toFixed(2));
+        labels.push(audioEnd);
+        Object.keys(timelineValues).forEach(k => timelineValues[k].push(timelineValues[k][0]));
+    }
+    audioChart.data.labels = labels;
+    audioChart.data.datasets = Object.entries(timelineValues).map(([key, values]) => {
+        const meta = MODEL_META[key] || { label: key, color: '#94a3b8' };
+        const hex = meta.color;
+        const rgb = hex.match(/[0-9a-fA-F]{2}/g).map(h => parseInt(h, 16)).join(',');
+        return {
+            label: meta.label,
+            data: values,
+            borderColor: hex,
+            backgroundColor: `rgba(${rgb},0.08)`,
+            fill: true,
+            tension: 0.4,
+            pointRadius: values.length <= 20 ? 4 : 2,
+            pointHoverRadius: 6,
+        };
+    });
+    // Add a 50% threshold reference line
+    audioChart.data.datasets.push({
+        label: 'Decision threshold (50%)',
+        data: Array(labels.length).fill(50),
+        borderColor: 'rgba(255,255,255,0.2)',
+        borderDash: [6, 4],
+        borderWidth: 1,
+        pointRadius: 0,
+        fill: false,
+        tension: 0,
+    });
+    audioChart.update();
+}
+// -------------------------------------------------------
+// Drop zone wiring
+// -------------------------------------------------------
+dropZone.addEventListener('click', () => fileInput.click());
+fileInput.addEventListener('change', e => handleFile(e.target.files[0]));
+['dragenter', 'dragover', 'dragleave', 'drop'].forEach(name => {
+    dropZone.addEventListener(name, e => { e.preventDefault(); e.stopPropagation(); });
+});
+dropZone.addEventListener('drop', e => handleFile(e.dataTransfer.files[0]));

frontend/style.css ADDED Viewed

	@@ -0,0 +1,34 @@

+/* Glassmorphism utility classes */
+.glass-card {
+    background: rgba(30, 41, 59, 0.7);
+    backdrop-filter: blur(12px);
+    -webkit-backdrop-filter: blur(12px);
+    border: 1px solid rgba(255, 255, 255, 0.1);
+    box-shadow: 0 4px 30px rgba(0, 0, 0, 0.1);
+}
+.subtle-bg {
+    background-color: #0f172a;
+    background-image:
+        radial-gradient(at 0% 0%, hsla(253, 16%, 7%, 1) 0, transparent 50%),
+        radial-gradient(at 50% 0%, hsla(225, 39%, 30%, 1) 0, transparent 50%),
+        radial-gradient(at 100% 0%, hsla(339, 49%, 30%, 1) 0, transparent 50%);
+}
+.animate-pulse {
+    animation: pulse 4s cubic-bezier(0.4, 0, 0.6, 1) infinite;
+}
+@keyframes pulse {
+    0%,
+    100% {
+        opacity: 0.3;
+        transform: scale(1);
+    }
+    50% {
+        opacity: 0.5;
+        transform: scale(1.05);
+    }
+}

requirements.txt ADDED Viewed

	@@ -0,0 +1,17 @@

+datasets == 2.21.0
+fastapi
+librosa
+matplotlib
+numpy
+python-multipart
+python-pptx
+scikit-learn
+scipy
+seaborn
+soundfile
+torch>=2.6.0
+torchaudio
+torchvision
+tqdm
+transformers
+uvicorn