| --- |
| license: mit |
| tags: |
| - audio |
| - anti-spoofing |
| - audio-deepfake-detection |
| - speech |
| - asvspoof |
| - wav2vec2 |
| --- |
| |
| # XLSR-SLS |
|
|
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls) |
|
|
| A **wav2vec 2.0 (XLS-R 300M) + SLS** audio-deepfake-detection model, from |
| *"Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier"* |
| (Zhang, Wen & Hu, **ACM MM 2024**). A self-supervised XLS-R front-end is paired |
| with the **SLS (Sensitive Layer Selection)** classifier, which treats the 24 |
| XLS-R transformer layers as a feature pyramid and learns to weight them. The |
| model takes a raw speech waveform and returns a score where **higher = more |
| bona fide**. |
|
|
| - **Code:** https://github.com/QiShanZhang/SLSforASVspoof-2021-DF |
| - **Paper:** https://doi.org/10.1145/3664647.3681345 (ACM MM 2024; no arXiv version) |
| - **Parameters:** 340,790,000 (340.79 M) |
| - **Checkpoint:** [`MMpaper_model.pth`](./MMpaper_model.pth) (the paper's released model) |
|
|
| The exact wrapper used to produce the Arena scores is in |
| [`xlsr_sls.py`](./xlsr_sls.py); the network definition is in [`_net.py`](./_net.py). |
|
|
| ## Architecture |
|
|
| 1. **wav2vec 2.0 XLS-R (300M) front-end** — a self-supervised transformer |
| (`fairseq` `Wav2Vec2Model`) producing 1024-d frame features from **all 24 |
| transformer layers**. |
| 2. **SLS (Sensitive Layer Selection) back-end** — every layer's hidden state is |
| average-pooled to a 1024-d descriptor and gated by a per-layer **sigmoid |
| attention** (`fc0` → sigmoid); the gates re-weight the full per-layer feature |
| stack, which is summed across layers. The fused feature passes through |
| BatchNorm + SELU + `3×3` max-pool, is flattened, and goes through a two-layer |
| MLP (`fc1: 22847→1024`, `fc3: 1024→2`). |
| 3. The 2-class **log-softmax** output is read at **index 1 = bona fide**. |
|
|
| ## How it was trained |
|
|
| - **Data:** ASVspoof 2019 **Logical Access (LA)**. |
| - **Input length:** raw audio at 16 kHz cropped/padded to **64,600 samples** |
| (~4.04 s). The window length is **fixed** — `fc1` expects a 22,847-d flatten, |
| so the 64,600-sample window is mandatory at inference. |
| - **Output:** 2-class log-softmax; the bona-fide log-prob (index 1) is the score. |
|
|
| See the [source repository](https://github.com/QiShanZhang/SLSforASVspoof-2021-DF) |
| for the full training and evaluation code. |
|
|
| ## Benchmark result (Speech Anti-Spoofing Arena) |
|
|
| Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls). |
| Scores were computed with a **deterministic first-64,600-sample window** (no random |
| crop), so the numbers are exactly reproducible from the pinned score file. |
| **Arena standing: 🥇 gold tier, rank #1 of 10.** |
|
|
| | Dataset | Split | EER % | Trials | Skipped | W2V2-AASIST†| Notes | |
| |---|---|---|---|---|---|---| |
| | ASVspoof2019_LA | test | **0.23** | 71,237 | 0 | 0.22 | in-domain (training data) | |
| | ASVspoof2021_LA | test | **7.39** | 181,566 | 0 | 8.11 | cross-dataset generalization | |
| | ASVspoof2021_DF | test | **3.93** | 611,829 | 0 | 8.32 | cross-dataset generalization | |
| | InTheWild | test | **7.46** | 31,779 | 0 | 11.22 | out-of-domain (real-world deepfakes) | |
| | CD-ADD | test | **9.81** | 20,786 | 0 | 38.57 | out-of-domain (modern neural-TTS) | |
| |
| †Same benchmark, the other XLS-R-based system (XLS-R 300M + AASIST). XLSR-SLS's |
| multi-layer SLS fusion wins on **every out-of-domain set** — most strikingly on |
| **ASVspoof2021_DF (3.93 vs 8.32)** and **CD-ADD (9.81 vs 38.57)** — and is on par |
| in-domain. The benchmark's ASVspoof2021 LA/DF use curated trial sets, so absolute |
| EER differs from the paper's official-keys numbers (1.92 % DF, 7.46 % InTheWild — |
| the latter matched here exactly); the relative ordering is the meaningful comparison. |
| |
| ## Usage |
| |
| The checkpoint is a `state_dict` for the `Model` network defined in |
| [`_net.py`](./_net.py). Constructing the network requires the base XLS-R 300M |
| checkpoint **`xlsr2_300m.pt`** (only used to build the wav2vec 2.0 architecture; |
| every weight is then overwritten by `MMpaper_model.pth`): |
| |
| ```bash |
| wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr2_300m.pt |
| ``` |
| |
| The input **must** be exactly 64,600 samples at 16 kHz mono — window the waveform |
| with `pad_fixed` (first 64,600 samples, tile-repeat if shorter). |
| |
| ```python |
| import numpy as np |
| from xlsr_sls import XLSRSLS # _net.py + xlsr_sls.py are in this repo |
| |
| m = XLSRSLS() |
| m.load() # loads MMpaper_model.pth (+ xlsr2_300m.pt) |
| audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz |
| print(m.score_batch([audio], [16000])[0]) # higher = more bona fide |
| m.unload() |
| ``` |
| |
| Internally the wrapper windows the input, runs the network, and returns |
| `output[:, 1]` (class 1 = bona fide; source `main.py`: `batch_score = |
| batch_out[:, 1]`). [`xlsr_sls.py`](./xlsr_sls.py) is the exact |
| `speech_spoof_bench` model that produced the Arena `scores.txt`. |
| |
| ## Citation |
| |
| ```bibtex |
| @inproceedings{zhang2024audio, |
| title={Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier}, |
| author={Zhang, Qishan and Wen, Shuangbing and Hu, Tao}, |
| booktitle={Proceedings of the 32nd ACM International Conference on Multimedia}, |
| pages={6765--6773}, |
| year={2024}, |
| doi={10.1145/3664647.3681345} |
| } |
| ``` |
| |
| ## License |
| |
| MIT — see the [source repository](https://github.com/QiShanZhang/SLSforASVspoof-2021-DF). |
| |