WhisperMFCCMesoNet / README.md
korallll's picture
add EER badges for backfilled datasets
6933fda verified
metadata
license: mit
tags:
  - audio
  - anti-spoofing
  - audio-deepfake-detection
  - speech
  - whisper
  - asvspoof

Whisper-MFCC-MesoNet

EER% 0.46 on ASVspoof2021_DF EER% 5.83 on ASVspoof2019_LA EER% 15.96 on ASVspoof2021_LA EER% 18.9 on CD-ADD EER% 26.72 on InTheWild EER% 49.64 on SONAR EER% 15.36 on LibriSeVoc EER% 31.93 on CFAD EER% 30.19 on CVoiceFake_small EER% 22.55 on ASVspoof5 arena tier arena rank

The (Whisper + MFCC) MesoNet audio anti-spoofing (voice-deepfake detection) countermeasure from "Improved DeepFake Detection Using Whisper Features" (Kawa et al., INTERSPEECH 2023) — the best-performing MesoNet configuration in that paper. The model takes a raw speech waveform and returns a score where higher = more bona fide.

This repo is self-contained for inference: the network definition is in _net.py, and the exact wrapper used to produce the Arena scores is in whispermfccmesonet.py.

Architecture

The model fuses two front-ends, each producing a 384×3000 feature map, stacked as 2 channels and fed to a MesoInception4 classifier:

  1. Whisper tiny.en encoder (fine-tuned) — the audio is turned into a log-Mel spectrogram and passed through the Whisper encoder; its output is reshaped/tiled to 384×3000.
  2. MFCC front-end — 128 MFCCs + Δ + ΔΔ (torchaudio MFCC, n_fft=512, win=400, hop=160), stacked to 384 features and cropped to 3000 frames.
  3. MesoInception4 — two Inception-style blocks + conv stack + a small linear classifier (bona fide vs. spoof), output as a single logit.

How it was trained & evaluated

  • The Whisper encoder is fine-tuned end-to-end together with the MFCC-fed MesoNet head (the paper's strongest MesoNet variant). Training targets the ASVspoof2021 DF deepfake task; In-the-Wild is the paper's headline cross-domain evaluation.
  • Input length: raw 16 kHz audio windowed to 480,000 samples (30 s) — the fixed size the Whisper encoder expects (repeat-pad if shorter, truncate if longer).
  • Preprocessing (reproduced from upstream): sox silence-trim (silence 1 0.2 1% -1 0.2 1%) before the 30 s window.
  • Paper's reported result: In-the-Wild EER = 26.72 % — reproduced here to within 0.01 pp (see below).

Benchmark result (Speech Anti-Spoofing Arena)

Evaluated through the reproducible Speech Anti-Spoofing Arena. Scores are exactly reproducible from the pinned score files.

Dataset Split EER % Trials Skipped Notes
ASVspoof2021_DF test 0.46 611,829 0 in-domain (deepfake task the model targets)
ASVspoof2019_LA test 5.83 71,237 0 logical-access spoofing
ASVspoof2021_LA test 15.96 181,566 0 logical-access (with codec/channel variation)
CD-ADD test 18.90 20,786 0 modern neural-TTS deepfakes
InTheWild test 26.72 31,779 0 real-world deepfakes (reproduces paper: 26.72 %)

Usage

Requires libsox (for the sox silence-trim). torchaudio 2.x ships the sox bindings but not the shared library; install it, e.g. conda install -c conda-forge sox, and make sure libsox.so is on LD_LIBRARY_PATH.

import numpy as np
from whispermfccmesonet import WhisperMFCCMesoNet   # _net.py must be importable alongside

m = WhisperMFCCMesoNet()
m.load()
audio = np.random.randn(48000).astype(np.float32)   # float32 mono 16 kHz
print(m.score_batch([audio], [16000]))               # higher = more bona fide
m.unload()

Internally the wrapper applies the sox silence-trim, repeat-pads to 480,000 samples, runs the Whisper+MFCC MesoNet, and returns the raw logit (logits[:, 0]). whispermfccmesonet.py is the exact code that produced the Arena scores.txt.

Citation

This model / paper:

@inproceedings{kawa23b_interspeech,
  title     = {Improved DeepFake Detection Using Whisper Features},
  author    = {Piotr Kawa and Marcin Plata and Micha{\l} Czuba and Piotr Szyma{\'n}ski and Piotr Syga},
  year      = {2023},
  booktitle = {Proc. INTERSPEECH 2023},
  pages     = {4009--4013},
  doi       = {10.21437/Interspeech.2023-1537},
}

License

MIT — see the source repository.