XLSR-SLS / README.md

Add model card with Arena badges + results (gold, #1/10)

2e8b862 verified 1 day ago

6.87 kB

	---
	license: mit
	tags:
	- audio
	- anti-spoofing
	- audio-deepfake-detection
	- speech
	- asvspoof
	- wav2vec2
	---

	# XLSR-SLS

	[![EER% 0.23 on ASVspoof2019_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2019__LA-0.23%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
	[![EER% 7.39 on ASVspoof2021_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__LA-7.39%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
	[![EER% 3.93 on ASVspoof2021_DF](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__DF-3.93%25-green)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
	[![EER% 7.46 on InTheWild](https://img.shields.io/badge/EER%25%20on%20InTheWild-7.46%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
	[![EER% 9.81 on CD-ADD](https://img.shields.io/badge/EER%25%20on%20CD-ADD-9.81%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
	[![arena tier](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/xlsr-sls/tier.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
	[![arena rank](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/xlsr-sls/rank.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)

	A wav2vec 2.0 (XLS-R 300M) + SLS audio-deepfake-detection model, from
	"Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier"
	(Zhang, Wen & Hu, ACM MM 2024). A self-supervised XLS-R front-end is paired
	with the SLS (Sensitive Layer Selection) classifier, which treats the 24
	XLS-R transformer layers as a feature pyramid and learns to weight them. The
	model takes a raw speech waveform and returns a score where **higher = more
	bona fide**.

	- Code: https://github.com/QiShanZhang/SLSforASVspoof-2021-DF
	- Paper: https://doi.org/10.1145/3664647.3681345 (ACM MM 2024; no arXiv version)
	- Parameters: 340,790,000 (340.79 M)
	- Checkpoint: [`MMpaper_model.pth`](./MMpaper_model.pth) (the paper's released model)

	The exact wrapper used to produce the Arena scores is in
	[`xlsr_sls.py`](./xlsr_sls.py); the network definition is in [`_net.py`](./_net.py).

	## Architecture

	1. wav2vec 2.0 XLS-R (300M) front-end — a self-supervised transformer
	(`fairseq` `Wav2Vec2Model`) producing 1024-d frame features from **all 24
	transformer layers**.
	2. SLS (Sensitive Layer Selection) back-end — every layer's hidden state is
	average-pooled to a 1024-d descriptor and gated by a per-layer **sigmoid
	attention** (`fc0` → sigmoid); the gates re-weight the full per-layer feature
	stack, which is summed across layers. The fused feature passes through
	BatchNorm + SELU + `3×3` max-pool, is flattened, and goes through a two-layer
	MLP (`fc1: 22847→1024`, `fc3: 1024→2`).
	3. The 2-class log-softmax output is read at index 1 = bona fide.

	## How it was trained

	- Data: ASVspoof 2019 Logical Access (LA).
	- Input length: raw audio at 16 kHz cropped/padded to 64,600 samples
	(~4.04 s). The window length is fixed — `fc1` expects a 22,847-d flatten,
	so the 64,600-sample window is mandatory at inference.
	- Output: 2-class log-softmax; the bona-fide log-prob (index 1) is the score.

	See the [source repository](https://github.com/QiShanZhang/SLSforASVspoof-2021-DF)
	for the full training and evaluation code.

	## Benchmark result (Speech Anti-Spoofing Arena)

	Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls).
	Scores were computed with a deterministic first-64,600-sample window (no random
	crop), so the numbers are exactly reproducible from the pinned score file.
	Arena standing: 🥇 gold tier, rank #1 of 10.

	\| Dataset \| Split \| EER % \| Trials \| Skipped \| W2V2-AASIST† \| Notes \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| ASVspoof2019_LA \| test \| 0.23 \| 71,237 \| 0 \| 0.22 \| in-domain (training data) \|
	\| ASVspoof2021_LA \| test \| 7.39 \| 181,566 \| 0 \| 8.11 \| cross-dataset generalization \|
	\| ASVspoof2021_DF \| test \| 3.93 \| 611,829 \| 0 \| 8.32 \| cross-dataset generalization \|
	\| InTheWild \| test \| 7.46 \| 31,779 \| 0 \| 11.22 \| out-of-domain (real-world deepfakes) \|
	\| CD-ADD \| test \| 9.81 \| 20,786 \| 0 \| 38.57 \| out-of-domain (modern neural-TTS) \|

	† Same benchmark, the other XLS-R-based system (XLS-R 300M + AASIST). XLSR-SLS's
	multi-layer SLS fusion wins on every out-of-domain set — most strikingly on
	ASVspoof2021_DF (3.93 vs 8.32) and CD-ADD (9.81 vs 38.57) — and is on par
	in-domain. The benchmark's ASVspoof2021 LA/DF use curated trial sets, so absolute
	EER differs from the paper's official-keys numbers (1.92 % DF, 7.46 % InTheWild —
	the latter matched here exactly); the relative ordering is the meaningful comparison.

	## Usage

	The checkpoint is a `state_dict` for the `Model` network defined in
	[`_net.py`](./_net.py). Constructing the network requires the base XLS-R 300M
	checkpoint `xlsr2_300m.pt` (only used to build the wav2vec 2.0 architecture;
	every weight is then overwritten by `MMpaper_model.pth`):

	```bash
	wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr2_300m.pt
	```

	The input must be exactly 64,600 samples at 16 kHz mono — window the waveform
	with `pad_fixed` (first 64,600 samples, tile-repeat if shorter).

	```python
	import numpy as np
	from xlsr_sls import XLSRSLS # _net.py + xlsr_sls.py are in this repo

	m = XLSRSLS()
	m.load() # loads MMpaper_model.pth (+ xlsr2_300m.pt)
	audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
	print(m.score_batch([audio], [16000])[0]) # higher = more bona fide
	m.unload()
	```

	Internally the wrapper windows the input, runs the network, and returns
	`output[:, 1]` (class 1 = bona fide; source `main.py`: `batch_score =
	batch_out[:, 1]`). [`xlsr_sls.py`](./xlsr_sls.py) is the exact
	`speech_spoof_bench` model that produced the Arena `scores.txt`.

	## Citation

	```bibtex
	@inproceedings{zhang2024audio,
	title={Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier},
	author={Zhang, Qishan and Wen, Shuangbing and Hu, Tao},
	booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
	pages={6765--6773},
	year={2024},
	doi={10.1145/3664647.3681345}
	}
	```

	## License

	MIT — see the [source repository](https://github.com/QiShanZhang/SLSforASVspoof-2021-DF).