---
license: mit
tags:
  - audio
  - anti-spoofing
  - audio-deepfake-detection
  - speech
  - whisper
  - asvspoof
---

# Whisper-MFCC-MesoNet

[![EER% 0.46 on ASVspoof2021_DF](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__DF-0.46%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet)
[![EER% 5.83 on ASVspoof2019_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2019__LA-5.83%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet)
[![EER% 15.96 on ASVspoof2021_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__LA-15.96%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet)
[![EER% 18.9 on CD-ADD](https://img.shields.io/badge/EER%25%20on%20CD--ADD-18.9%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet)
[![EER% 26.72 on InTheWild](https://img.shields.io/badge/EER%25%20on%20InTheWild-26.72%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet)
[![EER% 49.64 on SONAR](https://img.shields.io/badge/EER%25%20on%20SONAR-49.64%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet)
[![EER% 15.36 on LibriSeVoc](https://img.shields.io/badge/EER%25%20on%20LibriSeVoc-15.36%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet)
[![EER% 31.93 on CFAD](https://img.shields.io/badge/EER%25%20on%20CFAD-31.93%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet)
[![EER% 30.19 on CVoiceFake_small](https://img.shields.io/badge/EER%25%20on%20CVoiceFake__small-30.19%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet)
[![EER% 22.55 on ASVspoof5](https://img.shields.io/badge/EER%25%20on%20ASVspoof5-22.55%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet)
[![arena tier](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/whisper-mfcc-mesonet/tier.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet)
[![arena rank](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/whisper-mfcc-mesonet/rank.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet)

The **(Whisper + MFCC) MesoNet** audio anti-spoofing (voice-deepfake detection)
countermeasure from *"Improved DeepFake Detection Using Whisper Features"* (Kawa et al.,
INTERSPEECH 2023) — the best-performing MesoNet configuration in that paper. The model
takes a raw speech waveform and returns a score where **higher = more bona fide**.

- **Code:** https://github.com/piotrkawa/deepfake-whisper-features
- **Paper:** https://arxiv.org/abs/2306.01428 (INTERSPEECH 2023)
- **Parameters:** 7,660,881 (7.66 M)
- **Checkpoint:** [`whisper_mfcc_mesonet_finetuned.pth`](./whisper_mfcc_mesonet_finetuned.pth)

This repo is self-contained for inference: the network definition is in
[`_net.py`](./_net.py), and the exact wrapper used to produce the Arena scores is in
[`whispermfccmesonet.py`](./whispermfccmesonet.py).

## Architecture

The model fuses two front-ends, each producing a 384×3000 feature map, stacked as
2 channels and fed to a MesoInception4 classifier:

1. **Whisper tiny.en encoder** (fine-tuned) — the audio is turned into a log-Mel
   spectrogram and passed through the Whisper encoder; its output is reshaped/tiled to
   384×3000.
2. **MFCC front-end** — 128 MFCCs + Δ + ΔΔ (`torchaudio` MFCC, n_fft=512, win=400,
   hop=160), stacked to 384 features and cropped to 3000 frames.
3. **MesoInception4** — two Inception-style blocks + conv stack + a small linear
   classifier (bona fide vs. spoof), output as a single logit.

## How it was trained & evaluated

- The Whisper encoder is fine-tuned end-to-end together with the MFCC-fed MesoNet head
  (the paper's strongest MesoNet variant). Training targets the ASVspoof2021 DF deepfake
  task; In-the-Wild is the paper's headline cross-domain evaluation.
- **Input length:** raw 16 kHz audio windowed to **480,000 samples (30 s)** — the fixed
  size the Whisper encoder expects (repeat-pad if shorter, truncate if longer).
- **Preprocessing (reproduced from upstream):** sox silence-trim
  (`silence 1 0.2 1% -1 0.2 1%`) before the 30 s window.
- **Paper's reported result:** In-the-Wild EER = **26.72 %** — reproduced here to within
  0.01 pp (see below).

## Benchmark result (Speech Anti-Spoofing Arena)

Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=whisper-mfcc-mesonet).
Scores are exactly reproducible from the pinned score files.

| Dataset | Split | EER % | Trials | Skipped | Notes |
|---|---|---|---|---|---|
| ASVspoof2021_DF | test | **0.46** | 611,829 | 0 | in-domain (deepfake task the model targets) |
| ASVspoof2019_LA | test | **5.83** | 71,237 | 0 | logical-access spoofing |
| ASVspoof2021_LA | test | **15.96** | 181,566 | 0 | logical-access (with codec/channel variation) |
| CD-ADD | test | **18.90** | 20,786 | 0 | modern neural-TTS deepfakes |
| InTheWild | test | **26.72** | 31,779 | 0 | real-world deepfakes (reproduces paper: 26.72 %) |

## Usage

> **Requires libsox** (for the sox silence-trim). torchaudio 2.x ships the sox bindings
> but not the shared library; install it, e.g. `conda install -c conda-forge sox`, and
> make sure `libsox.so` is on `LD_LIBRARY_PATH`.

```python
import numpy as np
from whispermfccmesonet import WhisperMFCCMesoNet   # _net.py must be importable alongside

m = WhisperMFCCMesoNet()
m.load()
audio = np.random.randn(48000).astype(np.float32)   # float32 mono 16 kHz
print(m.score_batch([audio], [16000]))               # higher = more bona fide
m.unload()
```

Internally the wrapper applies the sox silence-trim, repeat-pads to 480,000 samples,
runs the Whisper+MFCC MesoNet, and returns the raw logit (`logits[:, 0]`).
[`whispermfccmesonet.py`](./whispermfccmesonet.py) is the exact code that produced the
Arena `scores.txt`.

## Citation

**This model / paper:**

```bibtex
@inproceedings{kawa23b_interspeech,
  title     = {Improved DeepFake Detection Using Whisper Features},
  author    = {Piotr Kawa and Marcin Plata and Micha{\l} Czuba and Piotr Szyma{\'n}ski and Piotr Syga},
  year      = {2023},
  booktitle = {Proc. INTERSPEECH 2023},
  pages     = {4009--4013},
  doi       = {10.21437/Interspeech.2023-1537},
}
```

## License

MIT — see the [source repository](https://github.com/piotrkawa/deepfake-whisper-features).