File size: 6,867 Bytes
2e8b862 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | ---
license: mit
tags:
- audio
- anti-spoofing
- audio-deepfake-detection
- speech
- asvspoof
- wav2vec2
---
# XLSR-SLS
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
A **wav2vec 2.0 (XLS-R 300M) + SLS** audio-deepfake-detection model, from
*"Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier"*
(Zhang, Wen & Hu, **ACM MM 2024**). A self-supervised XLS-R front-end is paired
with the **SLS (Sensitive Layer Selection)** classifier, which treats the 24
XLS-R transformer layers as a feature pyramid and learns to weight them. The
model takes a raw speech waveform and returns a score where **higher = more
bona fide**.
- **Code:** https://github.com/QiShanZhang/SLSforASVspoof-2021-DF
- **Paper:** https://doi.org/10.1145/3664647.3681345 (ACM MM 2024; no arXiv version)
- **Parameters:** 340,790,000 (340.79 M)
- **Checkpoint:** [`MMpaper_model.pth`](./MMpaper_model.pth) (the paper's released model)
The exact wrapper used to produce the Arena scores is in
[`xlsr_sls.py`](./xlsr_sls.py); the network definition is in [`_net.py`](./_net.py).
## Architecture
1. **wav2vec 2.0 XLS-R (300M) front-end** — a self-supervised transformer
(`fairseq` `Wav2Vec2Model`) producing 1024-d frame features from **all 24
transformer layers**.
2. **SLS (Sensitive Layer Selection) back-end** — every layer's hidden state is
average-pooled to a 1024-d descriptor and gated by a per-layer **sigmoid
attention** (`fc0` → sigmoid); the gates re-weight the full per-layer feature
stack, which is summed across layers. The fused feature passes through
BatchNorm + SELU + `3×3` max-pool, is flattened, and goes through a two-layer
MLP (`fc1: 22847→1024`, `fc3: 1024→2`).
3. The 2-class **log-softmax** output is read at **index 1 = bona fide**.
## How it was trained
- **Data:** ASVspoof 2019 **Logical Access (LA)**.
- **Input length:** raw audio at 16 kHz cropped/padded to **64,600 samples**
(~4.04 s). The window length is **fixed** — `fc1` expects a 22,847-d flatten,
so the 64,600-sample window is mandatory at inference.
- **Output:** 2-class log-softmax; the bona-fide log-prob (index 1) is the score.
See the [source repository](https://github.com/QiShanZhang/SLSforASVspoof-2021-DF)
for the full training and evaluation code.
## Benchmark result (Speech Anti-Spoofing Arena)
Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls).
Scores were computed with a **deterministic first-64,600-sample window** (no random
crop), so the numbers are exactly reproducible from the pinned score file.
**Arena standing: 🥇 gold tier, rank #1 of 10.**
| Dataset | Split | EER % | Trials | Skipped | W2V2-AASIST† | Notes |
|---|---|---|---|---|---|---|
| ASVspoof2019_LA | test | **0.23** | 71,237 | 0 | 0.22 | in-domain (training data) |
| ASVspoof2021_LA | test | **7.39** | 181,566 | 0 | 8.11 | cross-dataset generalization |
| ASVspoof2021_DF | test | **3.93** | 611,829 | 0 | 8.32 | cross-dataset generalization |
| InTheWild | test | **7.46** | 31,779 | 0 | 11.22 | out-of-domain (real-world deepfakes) |
| CD-ADD | test | **9.81** | 20,786 | 0 | 38.57 | out-of-domain (modern neural-TTS) |
† Same benchmark, the other XLS-R-based system (XLS-R 300M + AASIST). XLSR-SLS's
multi-layer SLS fusion wins on **every out-of-domain set** — most strikingly on
**ASVspoof2021_DF (3.93 vs 8.32)** and **CD-ADD (9.81 vs 38.57)** — and is on par
in-domain. The benchmark's ASVspoof2021 LA/DF use curated trial sets, so absolute
EER differs from the paper's official-keys numbers (1.92 % DF, 7.46 % InTheWild —
the latter matched here exactly); the relative ordering is the meaningful comparison.
## Usage
The checkpoint is a `state_dict` for the `Model` network defined in
[`_net.py`](./_net.py). Constructing the network requires the base XLS-R 300M
checkpoint **`xlsr2_300m.pt`** (only used to build the wav2vec 2.0 architecture;
every weight is then overwritten by `MMpaper_model.pth`):
```bash
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr2_300m.pt
```
The input **must** be exactly 64,600 samples at 16 kHz mono — window the waveform
with `pad_fixed` (first 64,600 samples, tile-repeat if shorter).
```python
import numpy as np
from xlsr_sls import XLSRSLS # _net.py + xlsr_sls.py are in this repo
m = XLSRSLS()
m.load() # loads MMpaper_model.pth (+ xlsr2_300m.pt)
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(m.score_batch([audio], [16000])[0]) # higher = more bona fide
m.unload()
```
Internally the wrapper windows the input, runs the network, and returns
`output[:, 1]` (class 1 = bona fide; source `main.py`: `batch_score =
batch_out[:, 1]`). [`xlsr_sls.py`](./xlsr_sls.py) is the exact
`speech_spoof_bench` model that produced the Arena `scores.txt`.
## Citation
```bibtex
@inproceedings{zhang2024audio,
title={Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier},
author={Zhang, Qishan and Wen, Shuangbing and Hu, Tao},
booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
pages={6765--6773},
year={2024},
doi={10.1145/3664647.3681345}
}
```
## License
MIT — see the [source repository](https://github.com/QiShanZhang/SLSforASVspoof-2021-DF).
|