XLSR-SLS
A wav2vec 2.0 (XLS-R 300M) + SLS audio-deepfake-detection model, from "Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier" (Zhang, Wen & Hu, ACM MM 2024). A self-supervised XLS-R front-end is paired with the SLS (Sensitive Layer Selection) classifier, which treats the 24 XLS-R transformer layers as a feature pyramid and learns to weight them. The model takes a raw speech waveform and returns a score where higher = more bona fide.
- Code: https://github.com/QiShanZhang/SLSforASVspoof-2021-DF
- Paper: https://doi.org/10.1145/3664647.3681345 (ACM MM 2024; no arXiv version)
- Parameters: 340,790,000 (340.79 M)
- Checkpoint:
MMpaper_model.pth(the paper's released model)
The exact wrapper used to produce the Arena scores is in
xlsr_sls.py; the network definition is in _net.py.
Architecture
- wav2vec 2.0 XLS-R (300M) front-end โ a self-supervised transformer
(
fairseqWav2Vec2Model) producing 1024-d frame features from all 24 transformer layers. - SLS (Sensitive Layer Selection) back-end โ every layer's hidden state is
average-pooled to a 1024-d descriptor and gated by a per-layer sigmoid
attention (
fc0โ sigmoid); the gates re-weight the full per-layer feature stack, which is summed across layers. The fused feature passes through BatchNorm + SELU +3ร3max-pool, is flattened, and goes through a two-layer MLP (fc1: 22847โ1024,fc3: 1024โ2). - The 2-class log-softmax output is read at index 1 = bona fide.
How it was trained
- Data: ASVspoof 2019 Logical Access (LA).
- Input length: raw audio at 16 kHz cropped/padded to 64,600 samples
(~4.04 s). The window length is fixed โ
fc1expects a 22,847-d flatten, so the 64,600-sample window is mandatory at inference. - Output: 2-class log-softmax; the bona-fide log-prob (index 1) is the score.
See the source repository for the full training and evaluation code.
Benchmark result (Speech Anti-Spoofing Arena)
Evaluated through the reproducible Speech Anti-Spoofing Arena. Scores were computed with a deterministic first-64,600-sample window (no random crop), so the numbers are exactly reproducible from the pinned score file. Arena standing: ๐ฅ gold tier, rank #1 of 10.
| Dataset | Split | EER % | Trials | Skipped | W2V2-AASISTโ | Notes |
|---|---|---|---|---|---|---|
| ASVspoof2019_LA | test | 0.23 | 71,237 | 0 | 0.22 | in-domain (training data) |
| ASVspoof2021_LA | test | 7.39 | 181,566 | 0 | 8.11 | cross-dataset generalization |
| ASVspoof2021_DF | test | 3.93 | 611,829 | 0 | 8.32 | cross-dataset generalization |
| InTheWild | test | 7.46 | 31,779 | 0 | 11.22 | out-of-domain (real-world deepfakes) |
| CD-ADD | test | 9.81 | 20,786 | 0 | 38.57 | out-of-domain (modern neural-TTS) |
โ Same benchmark, the other XLS-R-based system (XLS-R 300M + AASIST). XLSR-SLS's multi-layer SLS fusion wins on every out-of-domain set โ most strikingly on ASVspoof2021_DF (3.93 vs 8.32) and CD-ADD (9.81 vs 38.57) โ and is on par in-domain. The benchmark's ASVspoof2021 LA/DF use curated trial sets, so absolute EER differs from the paper's official-keys numbers (1.92 % DF, 7.46 % InTheWild โ the latter matched here exactly); the relative ordering is the meaningful comparison.
Usage
The checkpoint is a state_dict for the Model network defined in
_net.py. Constructing the network requires the base XLS-R 300M
checkpoint xlsr2_300m.pt (only used to build the wav2vec 2.0 architecture;
every weight is then overwritten by MMpaper_model.pth):
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr2_300m.pt
The input must be exactly 64,600 samples at 16 kHz mono โ window the waveform
with pad_fixed (first 64,600 samples, tile-repeat if shorter).
import numpy as np
from xlsr_sls import XLSRSLS # _net.py + xlsr_sls.py are in this repo
m = XLSRSLS()
m.load() # loads MMpaper_model.pth (+ xlsr2_300m.pt)
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(m.score_batch([audio], [16000])[0]) # higher = more bona fide
m.unload()
Internally the wrapper windows the input, runs the network, and returns
output[:, 1] (class 1 = bona fide; source main.py: batch_score = batch_out[:, 1]). xlsr_sls.py is the exact
speech_spoof_bench model that produced the Arena scores.txt.
Citation
@inproceedings{zhang2024audio,
title={Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier},
author={Zhang, Qishan and Wen, Shuangbing and Hu, Tao},
booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
pages={6765--6773},
year={2024},
doi={10.1145/3664647.3681345}
}
License
MIT โ see the source repository.