system: name: "XLSR-SLS" slug: "xlsr-sls" description: > wav2vec 2.0 (XLS-R 300M) self-supervised front-end with the SLS (Sensitive Layer Selection) classifier for audio deepfake detection. SLS gates and fuses the hidden states of all XLS-R transformer layers — each layer contributing distinct discriminative cues — via a per-layer sigmoid attention, sums the weighted multi-layer feature, then a BN + max-pool + two-layer MLP head emits a 2-way log-softmax. Official QiShanZhang/SLSforASVspoof-2021-DF checkpoint (model_15, dev-EER 1.45%), trained on ASVspoof2019 LA, FP32, deterministic first-64600-sample window (no random crop). code: "https://github.com/QiShanZhang/SLSforASVspoof-2021-DF" checkpoint: "https://huggingface.co/SpeechAntiSpoofingBenchmarks/XLSR-SLS" params_millions: 340.7900 paper: arxiv_id: "10.1145/3664647.3681345" # no arXiv exists; ACM MM 2024 DOI (per user decision 2026-06-05) url: "https://doi.org/10.1145/3664647.3681345" bibtex: | @inproceedings{zhang2024audio, title={Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier}, author={Zhang, Qishan and Wen, Shuangbing and Hu, Tao}, booktitle={Proceedings of the 32nd ACM International Conference on Multimedia}, pages={6765--6773}, year={2024}, doi={10.1145/3664647.3681345} } notes: > XLS-R 300M (wav2vec 2.0) front-end + SLS (Sensitive Layer Selection) classifier, from QiShanZhang/SLSforASVspoof-2021-DF (ACM MM 2024). Architecture is built from the base xlsr2_300m.pt model config (shared with the W2V2-AASIST submission), then every weight is overwritten by the fine-tuned checkpoint. SLS pools every transformer layer's hidden state, gates each by a learned sigmoid attention, and fuses them before a small MLP head. Deterministic first-64600-sample window (no random crop); the head's fc1 expects this fixed length. score = log-softmax output for class 1 (bona fide); higher = more bona fide (source main.py: batch_score = batch_out[:, 1]).