| system: |
| name: "XLSR-SLS" |
| slug: "xlsr-sls" |
| description: > |
| wav2vec 2.0 (XLS-R 300M) self-supervised front-end with the SLS (Sensitive |
| Layer Selection) classifier for audio deepfake detection. SLS gates and fuses |
| the hidden states of all XLS-R transformer layers — each layer contributing |
| distinct discriminative cues — via a per-layer sigmoid attention, sums the |
| weighted multi-layer feature, then a BN + max-pool + two-layer MLP head emits |
| a 2-way log-softmax. Official QiShanZhang/SLSforASVspoof-2021-DF checkpoint |
| (model_15, dev-EER 1.45%), trained on ASVspoof2019 LA, FP32, deterministic |
| first-64600-sample window (no random crop). |
| code: "https://github.com/QiShanZhang/SLSforASVspoof-2021-DF" |
| checkpoint: "https://huggingface.co/SpeechAntiSpoofingBenchmarks/XLSR-SLS" |
| params_millions: 340.7900 |
| paper: |
| arxiv_id: "10.1145/3664647.3681345" |
| url: "https://doi.org/10.1145/3664647.3681345" |
| bibtex: | |
| @inproceedings{zhang2024audio, |
| title={Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier}, |
| author={Zhang, Qishan and Wen, Shuangbing and Hu, Tao}, |
| booktitle={Proceedings of the 32nd ACM International Conference on Multimedia}, |
| pages={6765--6773}, |
| year={2024}, |
| doi={10.1145/3664647.3681345} |
| } |
| notes: > |
| XLS-R 300M (wav2vec 2.0) front-end + SLS (Sensitive Layer Selection) classifier, |
| from QiShanZhang/SLSforASVspoof-2021-DF (ACM MM 2024). Architecture is built from |
| the base xlsr2_300m.pt model config (shared with the W2V2-AASIST submission), |
| then every weight is overwritten by the fine-tuned checkpoint. SLS pools every |
| transformer layer's hidden state, gates each by a learned sigmoid attention, and |
| fuses them before a small MLP head. Deterministic first-64600-sample window (no |
| random crop); the head's fc1 expects this fixed length. score = log-softmax |
| output for class 1 (bona fide); higher = more bona fide (source main.py: |
| batch_score = batch_out[:, 1]). |
| |