File size: 6,867 Bytes
2e8b862
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
license: mit
tags:
  - audio
  - anti-spoofing
  - audio-deepfake-detection
  - speech
  - asvspoof
  - wav2vec2
---

# XLSR-SLS

[![EER% 0.23 on ASVspoof2019_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2019__LA-0.23%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
[![EER% 7.39 on ASVspoof2021_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__LA-7.39%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
[![EER% 3.93 on ASVspoof2021_DF](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__DF-3.93%25-green)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
[![EER% 7.46 on InTheWild](https://img.shields.io/badge/EER%25%20on%20InTheWild-7.46%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
[![EER% 9.81 on CD-ADD](https://img.shields.io/badge/EER%25%20on%20CD-ADD-9.81%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
[![arena tier](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/xlsr-sls/tier.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
[![arena rank](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/xlsr-sls/rank.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)

A **wav2vec 2.0 (XLS-R 300M) + SLS** audio-deepfake-detection model, from
*"Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier"*
(Zhang, Wen & Hu, **ACM MM 2024**). A self-supervised XLS-R front-end is paired
with the **SLS (Sensitive Layer Selection)** classifier, which treats the 24
XLS-R transformer layers as a feature pyramid and learns to weight them. The
model takes a raw speech waveform and returns a score where **higher = more
bona fide**.

- **Code:** https://github.com/QiShanZhang/SLSforASVspoof-2021-DF
- **Paper:** https://doi.org/10.1145/3664647.3681345 (ACM MM 2024; no arXiv version)
- **Parameters:** 340,790,000 (340.79 M)
- **Checkpoint:** [`MMpaper_model.pth`](./MMpaper_model.pth) (the paper's released model)

The exact wrapper used to produce the Arena scores is in
[`xlsr_sls.py`](./xlsr_sls.py); the network definition is in [`_net.py`](./_net.py).

## Architecture

1. **wav2vec 2.0 XLS-R (300M) front-end** — a self-supervised transformer
   (`fairseq` `Wav2Vec2Model`) producing 1024-d frame features from **all 24
   transformer layers**.
2. **SLS (Sensitive Layer Selection) back-end** — every layer's hidden state is
   average-pooled to a 1024-d descriptor and gated by a per-layer **sigmoid
   attention** (`fc0` → sigmoid); the gates re-weight the full per-layer feature
   stack, which is summed across layers. The fused feature passes through
   BatchNorm + SELU + `3×3` max-pool, is flattened, and goes through a two-layer
   MLP (`fc1: 22847→1024`, `fc3: 1024→2`).
3. The 2-class **log-softmax** output is read at **index 1 = bona fide**.

## How it was trained

- **Data:** ASVspoof 2019 **Logical Access (LA)**.
- **Input length:** raw audio at 16 kHz cropped/padded to **64,600 samples**
  (~4.04 s). The window length is **fixed**`fc1` expects a 22,847-d flatten,
  so the 64,600-sample window is mandatory at inference.
- **Output:** 2-class log-softmax; the bona-fide log-prob (index 1) is the score.

See the [source repository](https://github.com/QiShanZhang/SLSforASVspoof-2021-DF)
for the full training and evaluation code.

## Benchmark result (Speech Anti-Spoofing Arena)

Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls).
Scores were computed with a **deterministic first-64,600-sample window** (no random
crop), so the numbers are exactly reproducible from the pinned score file.
**Arena standing: 🥇 gold tier, rank #1 of 10.**

| Dataset | Split | EER % | Trials | Skipped | W2V2-AASIST† | Notes |
|---|---|---|---|---|---|---|
| ASVspoof2019_LA | test | **0.23** | 71,237 | 0 | 0.22 | in-domain (training data) |
| ASVspoof2021_LA | test | **7.39** | 181,566 | 0 | 8.11 | cross-dataset generalization |
| ASVspoof2021_DF | test | **3.93** | 611,829 | 0 | 8.32 | cross-dataset generalization |
| InTheWild | test | **7.46** | 31,779 | 0 | 11.22 | out-of-domain (real-world deepfakes) |
| CD-ADD | test | **9.81** | 20,786 | 0 | 38.57 | out-of-domain (modern neural-TTS) |

† Same benchmark, the other XLS-R-based system (XLS-R 300M + AASIST). XLSR-SLS's
multi-layer SLS fusion wins on **every out-of-domain set** — most strikingly on
**ASVspoof2021_DF (3.93 vs 8.32)** and **CD-ADD (9.81 vs 38.57)** — and is on par
in-domain. The benchmark's ASVspoof2021 LA/DF use curated trial sets, so absolute
EER differs from the paper's official-keys numbers (1.92 % DF, 7.46 % InTheWild —
the latter matched here exactly); the relative ordering is the meaningful comparison.

## Usage

The checkpoint is a `state_dict` for the `Model` network defined in
[`_net.py`](./_net.py). Constructing the network requires the base XLS-R 300M
checkpoint **`xlsr2_300m.pt`** (only used to build the wav2vec 2.0 architecture;
every weight is then overwritten by `MMpaper_model.pth`):

```bash
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr2_300m.pt
```

The input **must** be exactly 64,600 samples at 16 kHz mono — window the waveform
with `pad_fixed` (first 64,600 samples, tile-repeat if shorter).

```python
import numpy as np
from xlsr_sls import XLSRSLS   # _net.py + xlsr_sls.py are in this repo

m = XLSRSLS()
m.load()                                          # loads MMpaper_model.pth (+ xlsr2_300m.pt)
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(m.score_batch([audio], [16000])[0])         # higher = more bona fide
m.unload()
```

Internally the wrapper windows the input, runs the network, and returns
`output[:, 1]` (class 1 = bona fide; source `main.py`: `batch_score =
batch_out[:, 1]`). [`xlsr_sls.py`](./xlsr_sls.py) is the exact
`speech_spoof_bench` model that produced the Arena `scores.txt`.

## Citation

```bibtex
@inproceedings{zhang2024audio,
  title={Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier},
  author={Zhang, Qishan and Wen, Shuangbing and Hu, Tao},
  booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
  pages={6765--6773},
  year={2024},
  doi={10.1145/3664647.3681345}
}
```

## License

MIT — see the [source repository](https://github.com/QiShanZhang/SLSforASVspoof-2021-DF).