--- license: mit tags: - audio - anti-spoofing - audio-deepfake-detection - speech - whisper - asvspoof --- # Whisper-MFCC-MesoNet [![EER% 0.46 on ASVspoof2021_DF](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__DF-0.46%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet) [![EER% 5.83 on ASVspoof2019_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2019__LA-5.83%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet) [![EER% 15.96 on ASVspoof2021_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__LA-15.96%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet) [![EER% 18.9 on CD-ADD](https://img.shields.io/badge/EER%25%20on%20CD--ADD-18.9%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet) [![EER% 26.72 on InTheWild](https://img.shields.io/badge/EER%25%20on%20InTheWild-26.72%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet) [![EER% 49.64 on SONAR](https://img.shields.io/badge/EER%25%20on%20SONAR-49.64%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet) [![EER% 15.36 on LibriSeVoc](https://img.shields.io/badge/EER%25%20on%20LibriSeVoc-15.36%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet) [![EER% 31.93 on CFAD](https://img.shields.io/badge/EER%25%20on%20CFAD-31.93%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet) [![EER% 30.19 on CVoiceFake_small](https://img.shields.io/badge/EER%25%20on%20CVoiceFake__small-30.19%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet) [![EER% 22.55 on ASVspoof5](https://img.shields.io/badge/EER%25%20on%20ASVspoof5-22.55%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet) [![arena tier](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/whisper-mfcc-mesonet/tier.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet) [![arena rank](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/whisper-mfcc-mesonet/rank.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet) The **(Whisper + MFCC) MesoNet** audio anti-spoofing (voice-deepfake detection) countermeasure from *"Improved DeepFake Detection Using Whisper Features"* (Kawa et al., INTERSPEECH 2023) — the best-performing MesoNet configuration in that paper. The model takes a raw speech waveform and returns a score where **higher = more bona fide**. - **Code:** https://github.com/piotrkawa/deepfake-whisper-features - **Paper:** https://arxiv.org/abs/2306.01428 (INTERSPEECH 2023) - **Parameters:** 7,660,881 (7.66 M) - **Checkpoint:** [`whisper_mfcc_mesonet_finetuned.pth`](./whisper_mfcc_mesonet_finetuned.pth) This repo is self-contained for inference: the network definition is in [`_net.py`](./_net.py), and the exact wrapper used to produce the Arena scores is in [`whispermfccmesonet.py`](./whispermfccmesonet.py). ## Architecture The model fuses two front-ends, each producing a 384×3000 feature map, stacked as 2 channels and fed to a MesoInception4 classifier: 1. **Whisper tiny.en encoder** (fine-tuned) — the audio is turned into a log-Mel spectrogram and passed through the Whisper encoder; its output is reshaped/tiled to 384×3000. 2. **MFCC front-end** — 128 MFCCs + Δ + ΔΔ (`torchaudio` MFCC, n_fft=512, win=400, hop=160), stacked to 384 features and cropped to 3000 frames. 3. **MesoInception4** — two Inception-style blocks + conv stack + a small linear classifier (bona fide vs. spoof), output as a single logit. ## How it was trained & evaluated - The Whisper encoder is fine-tuned end-to-end together with the MFCC-fed MesoNet head (the paper's strongest MesoNet variant). Training targets the ASVspoof2021 DF deepfake task; In-the-Wild is the paper's headline cross-domain evaluation. - **Input length:** raw 16 kHz audio windowed to **480,000 samples (30 s)** — the fixed size the Whisper encoder expects (repeat-pad if shorter, truncate if longer). - **Preprocessing (reproduced from upstream):** sox silence-trim (`silence 1 0.2 1% -1 0.2 1%`) before the 30 s window. - **Paper's reported result:** In-the-Wild EER = **26.72 %** — reproduced here to within 0.01 pp (see below). ## Benchmark result (Speech Anti-Spoofing Arena) Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet). Scores are exactly reproducible from the pinned score files. | Dataset | Split | EER % | Trials | Skipped | Notes | |---|---|---|---|---|---| | ASVspoof2021_DF | test | **0.46** | 611,829 | 0 | in-domain (deepfake task the model targets) | | ASVspoof2019_LA | test | **5.83** | 71,237 | 0 | logical-access spoofing | | ASVspoof2021_LA | test | **15.96** | 181,566 | 0 | logical-access (with codec/channel variation) | | CD-ADD | test | **18.90** | 20,786 | 0 | modern neural-TTS deepfakes | | InTheWild | test | **26.72** | 31,779 | 0 | real-world deepfakes (reproduces paper: 26.72 %) | ## Usage > **Requires libsox** (for the sox silence-trim). torchaudio 2.x ships the sox bindings > but not the shared library; install it, e.g. `conda install -c conda-forge sox`, and > make sure `libsox.so` is on `LD_LIBRARY_PATH`. ```python import numpy as np from whispermfccmesonet import WhisperMFCCMesoNet # _net.py must be importable alongside m = WhisperMFCCMesoNet() m.load() audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz print(m.score_batch([audio], [16000])) # higher = more bona fide m.unload() ``` Internally the wrapper applies the sox silence-trim, repeat-pads to 480,000 samples, runs the Whisper+MFCC MesoNet, and returns the raw logit (`logits[:, 0]`). [`whispermfccmesonet.py`](./whispermfccmesonet.py) is the exact code that produced the Arena `scores.txt`. ## Citation **This model / paper:** ```bibtex @inproceedings{kawa23b_interspeech, title = {Improved DeepFake Detection Using Whisper Features}, author = {Piotr Kawa and Marcin Plata and Micha{\l} Czuba and Piotr Szyma{\'n}ski and Piotr Syga}, year = {2023}, booktitle = {Proc. INTERSPEECH 2023}, pages = {4009--4013}, doi = {10.21437/Interspeech.2023-1537}, } ``` ## License MIT — see the [source repository](https://github.com/piotrkawa/deepfake-whisper-features).