add EER badges for backfilled datasets

6933fda verified about 3 hours ago

7.15 kB

license: mit
tags:
  - audio
  - anti-spoofing
  - audio-deepfake-detection
  - speech
  - whisper
  - asvspoof

Whisper-MFCC-MesoNet

The (Whisper + MFCC) MesoNet audio anti-spoofing (voice-deepfake detection) countermeasure from "Improved DeepFake Detection Using Whisper Features" (Kawa et al., INTERSPEECH 2023) — the best-performing MesoNet configuration in that paper. The model takes a raw speech waveform and returns a score where higher = more bona fide.

Code: https://github.com/piotrkawa/deepfake-whisper-features
Paper: https://arxiv.org/abs/2306.01428 (INTERSPEECH 2023)
Parameters: 7,660,881 (7.66 M)
Checkpoint: whisper_mfcc_mesonet_finetuned.pth

This repo is self-contained for inference: the network definition is in _net.py, and the exact wrapper used to produce the Arena scores is in whispermfccmesonet.py.

Architecture

The model fuses two front-ends, each producing a 384×3000 feature map, stacked as 2 channels and fed to a MesoInception4 classifier:

Whisper tiny.en encoder (fine-tuned) — the audio is turned into a log-Mel spectrogram and passed through the Whisper encoder; its output is reshaped/tiled to 384×3000.
MFCC front-end — 128 MFCCs + Δ + ΔΔ (torchaudio MFCC, n_fft=512, win=400, hop=160), stacked to 384 features and cropped to 3000 frames.
MesoInception4 — two Inception-style blocks + conv stack + a small linear classifier (bona fide vs. spoof), output as a single logit.

How it was trained & evaluated

The Whisper encoder is fine-tuned end-to-end together with the MFCC-fed MesoNet head (the paper's strongest MesoNet variant). Training targets the ASVspoof2021 DF deepfake task; In-the-Wild is the paper's headline cross-domain evaluation.
Input length: raw 16 kHz audio windowed to 480,000 samples (30 s) — the fixed size the Whisper encoder expects (repeat-pad if shorter, truncate if longer).
Preprocessing (reproduced from upstream): sox silence-trim (silence 1 0.2 1% -1 0.2 1%) before the 30 s window.
Paper's reported result: In-the-Wild EER = 26.72 % — reproduced here to within 0.01 pp (see below).

Benchmark result (Speech Anti-Spoofing Arena)

Evaluated through the reproducible Speech Anti-Spoofing Arena. Scores are exactly reproducible from the pinned score files.

Dataset	Split	EER %	Trials	Notes
ASVspoof2021_DF	test	0.46	611,829	in-domain (deepfake task the model targets)
ASVspoof2019_LA	test	5.83	71,237	logical-access spoofing
ASVspoof2021_LA	test	15.96	181,566	logical-access (with codec/channel variation)
CD-ADD	test	18.90	20,786	modern neural-TTS deepfakes
InTheWild	test	26.72	31,779	real-world deepfakes (reproduces paper: 26.72 %)

Usage

Requires libsox (for the sox silence-trim). torchaudio 2.x ships the sox bindings but not the shared library; install it, e.g. conda install -c conda-forge sox, and make sure libsox.so is on LD_LIBRARY_PATH.

import numpy as np
from whispermfccmesonet import WhisperMFCCMesoNet   # _net.py must be importable alongside

m = WhisperMFCCMesoNet()
m.load()
audio = np.random.randn(48000).astype(np.float32)   # float32 mono 16 kHz
print(m.score_batch([audio], [16000]))               # higher = more bona fide
m.unload()

Internally the wrapper applies the sox silence-trim, repeat-pads to 480,000 samples, runs the Whisper+MFCC MesoNet, and returns the raw logit (logits[:, 0]). whispermfccmesonet.py is the exact code that produced the Arena scores.txt.

Citation

This model / paper:

@inproceedings{kawa23b_interspeech,
  title     = {Improved DeepFake Detection Using Whisper Features},
  author    = {Piotr Kawa and Marcin Plata and Micha{\l} Czuba and Piotr Szyma{\'n}ski and Piotr Syga},
  year      = {2023},
  booktitle = {Proc. INTERSPEECH 2023},
  pages     = {4009--4013},
  doi       = {10.21437/Interspeech.2023-1537},
}

License

MIT — see the source repository.