Whisper-MFCC-MesoNet
The (Whisper + MFCC) MesoNet audio anti-spoofing (voice-deepfake detection) countermeasure from "Improved DeepFake Detection Using Whisper Features" (Kawa et al., INTERSPEECH 2023) β the best-performing MesoNet configuration in that paper. The model takes a raw speech waveform and returns a score where higher = more bona fide.
- Code: https://github.com/piotrkawa/deepfake-whisper-features
- Paper: https://arxiv.org/abs/2306.01428 (INTERSPEECH 2023)
- Parameters: 7,660,881 (7.66 M)
- Checkpoint:
whisper_mfcc_mesonet_finetuned.pth
This repo is self-contained for inference: the network definition is in
_net.py, and the exact wrapper used to produce the Arena scores is in
whispermfccmesonet.py.
Architecture
The model fuses two front-ends, each producing a 384Γ3000 feature map, stacked as 2 channels and fed to a MesoInception4 classifier:
- Whisper tiny.en encoder (fine-tuned) β the audio is turned into a log-Mel spectrogram and passed through the Whisper encoder; its output is reshaped/tiled to 384Γ3000.
- MFCC front-end β 128 MFCCs + Ξ + ΞΞ (
torchaudioMFCC, n_fft=512, win=400, hop=160), stacked to 384 features and cropped to 3000 frames. - MesoInception4 β two Inception-style blocks + conv stack + a small linear classifier (bona fide vs. spoof), output as a single logit.
How it was trained & evaluated
- The Whisper encoder is fine-tuned end-to-end together with the MFCC-fed MesoNet head (the paper's strongest MesoNet variant). Training targets the ASVspoof2021 DF deepfake task; In-the-Wild is the paper's headline cross-domain evaluation.
- Input length: raw 16 kHz audio windowed to 480,000 samples (30 s) β the fixed size the Whisper encoder expects (repeat-pad if shorter, truncate if longer).
- Preprocessing (reproduced from upstream): sox silence-trim
(
silence 1 0.2 1% -1 0.2 1%) before the 30 s window. - Paper's reported result: In-the-Wild EER = 26.72 % β reproduced here to within 0.01 pp (see below).
Benchmark result (Speech Anti-Spoofing Arena)
Evaluated through the reproducible Speech Anti-Spoofing Arena. Scores are exactly reproducible from the pinned score files.
| Dataset | Split | EER % | Trials | Skipped | Notes |
|---|---|---|---|---|---|
| ASVspoof2021_DF | test | 0.46 | 611,829 | 0 | in-domain (deepfake task the model targets) |
| ASVspoof2019_LA | test | 5.83 | 71,237 | 0 | logical-access spoofing |
| ASVspoof2021_LA | test | 15.96 | 181,566 | 0 | logical-access (with codec/channel variation) |
| CD-ADD | test | 18.90 | 20,786 | 0 | modern neural-TTS deepfakes |
| InTheWild | test | 26.72 | 31,779 | 0 | real-world deepfakes (reproduces paper: 26.72 %) |
Usage
Requires libsox (for the sox silence-trim). torchaudio 2.x ships the sox bindings but not the shared library; install it, e.g.
conda install -c conda-forge sox, and make surelibsox.sois onLD_LIBRARY_PATH.
import numpy as np
from whispermfccmesonet import WhisperMFCCMesoNet # _net.py must be importable alongside
m = WhisperMFCCMesoNet()
m.load()
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(m.score_batch([audio], [16000])) # higher = more bona fide
m.unload()
Internally the wrapper applies the sox silence-trim, repeat-pads to 480,000 samples,
runs the Whisper+MFCC MesoNet, and returns the raw logit (logits[:, 0]).
whispermfccmesonet.py is the exact code that produced the
Arena scores.txt.
Citation
This model / paper:
@inproceedings{kawa23b_interspeech,
title = {Improved DeepFake Detection Using Whisper Features},
author = {Piotr Kawa and Marcin Plata and Micha{\l} Czuba and Piotr Szyma{\'n}ski and Piotr Syga},
year = {2023},
booktitle = {Proc. INTERSPEECH 2023},
pages = {4009--4013},
doi = {10.21437/Interspeech.2023-1537},
}
License
MIT β see the source repository.
- Downloads last month
- 12