Add original authors pretrained model links to model card

045fce7 verified 1 day ago

4.76 kB

	---
	license: mit
	tags:
	- audio
	- deepfake-detection
	- anti-spoofing
	- wav2vec2
	- xlsr
	- speech
	- asvspoof
	datasets:
	- asvspoof2019
	- asvspoof2021
	metrics:
	- equal_error_rate
	pipeline_tag: audio-classification
	language:
	- en
	library_name: pytorch
	---

	# XLS-R + SLS Classifier for Audio Deepfake Detection

	Reproduction of "Audio Deepfake Detection with XLS-R and SLS Classifier" (Zhang et al., ACM Multimedia 2024).

	The Selective Layer Summarization (SLS) classifier extracts attention-weighted features from all 24 transformer layers of [XLS-R 300M](https://huggingface.co/facebook/wav2vec2-xls-r-300m) (wav2vec 2.0), then classifies bonafide vs. spoofed speech via a lightweight fully-connected head. [RawBoost](https://arxiv.org/abs/2301.00693) (algo=3, SSI) data augmentation is applied during training.

	## Available Checkpoints

	\| File \| Experiment \| Description \|
	\|------\|-----------\|-------------\|
	\| `v1/epoch_2.pth` \| v1 (baseline) \| Best cross-domain generalization. Patience=1, no validation, 4 epochs. \|
	\| `v2/epoch_16.pth` \| v2 (val-based) \| Validation early stopping. Patience=10, ASVspoof2019 LA dev validation, 27 epochs. \|

	Recommended: Use `v1/epoch_2.pth` — it generalizes better to unseen attack types (DF, In-the-Wild).

	### Original authors' pretrained models

	The original pretrained checkpoints from Zhang et al. are available from:
	- [Google Drive](https://drive.google.com/drive/folders/13vw_AX1jHdYndRu1edlgpdNJpCX8OnrH?usp=sharing)
	- [Baidu Pan](https://pan.baidu.com/s/1dj-hjvf3fFPIYdtHWqtCmg?pwd=shan) (password: shan)

	## Results

	\| Track \| Paper EER (%) \| v1 EER (%) \| v2 EER (%) \|
	\|-------\|--------------\|------------\|------------\|
	\| ASVspoof 2021 DF \| 1.92 \| 2.14 \| 3.75 \|
	\| ASVspoof 2021 LA \| 2.87 \| 3.51 \| 3.47 \|
	\| In-the-Wild \| 7.46 \| 7.84 \| 12.67 \|

	v1 closely reproduces the paper results. v2 improves LA slightly but degrades DF and In-the-Wild due to overfitting to the LA validation domain — a well-documented cross-domain generalization problem in audio deepfake detection ([Muller et al., Interspeech 2022](https://arxiv.org/abs/2203.16263)).

	## Training Configuration

	Both experiments share the following setup:

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Training data \| ASVspoof2019 LA train (25,380 utterances) \|
	\| Loss \| Weighted Cross-Entropy [0.1, 0.9] \|
	\| Optimizer \| Adam (lr=1e-6, weight_decay=1e-4) \|
	\| Batch size \| 5 \|
	\| RawBoost \| algo=3 (SSI) \|
	\| Seed \| 1234 \|
	\| SSL backbone \| XLS-R 300M (frozen feature extractor) \|
	\| GPU \| NVIDIA RTX 4080 (16 GB) \|

	### v1 specifics
	- Early stopping: patience=1 on training loss
	- No validation set
	- 4 epochs trained, best at epoch 2 (train loss = 0.000661)

	### v2 specifics
	- Early stopping: patience=10 on validation loss
	- Validation: ASVspoof2019 LA dev (24,844 trials)
	- 27 epochs trained, best at epoch 16 (val_loss = 0.000468, val_acc = 99.99%)
	- Bug fixes: `torch.no_grad()` in validation loop, correct `best_val_loss` tracking

	## Usage

	### Download checkpoint

	```python
	from huggingface_hub import hf_hub_download

	# Download v1 checkpoint (recommended)
	checkpoint_path = hf_hub_download(
	repo_id="sukhdeveyash/XLS-R-SLS-Deepfake-Detection",
	filename="v1/epoch_2.pth"
	)

	# Download v2 checkpoint
	# checkpoint_path = hf_hub_download(
	# repo_id="sukhdeveyash/XLS-R-SLS-Deepfake-Detection",
	# filename="v2/epoch_16.pth"
	# )
	```

	### Load and run inference

	```python
	import torch
	from model import Model # from the GitHub repo

	device = "cuda" if torch.cuda.is_available() else "cpu"

	model = Model(device=device, ssl_cpkt_path="xlsr2_300m.pt")
	model.load_state_dict(torch.load(checkpoint_path, map_location=device))
	model = model.to(device)
	model.eval()
	```

	Full training and evaluation code: [GitHub Repository](https://github.com/Yash-Sukhdeve/XLS-R-SLS-Deepfake-Detection)

	## Requirements

	- Python 3.7+
	- PyTorch 1.13.1 (CUDA 11.7)
	- fairseq (commit a54021305d6b3c)
	- XLS-R 300M base checkpoint (`xlsr2_300m.pt`) from [fairseq](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec/xlsr)

	See `environment.yml` in the [GitHub repo](https://github.com/Yash-Sukhdeve/XLS-R-SLS-Deepfake-Detection) for the full environment.

	## Citation

	```bibtex
	@inproceedings{zhang2024audio,
	title={Audio Deepfake Detection with XLS-R and SLS Classifier},
	author={Zhang, Qishan and Wen, Shuangbing and Hu, Tao},
	booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
	year={2024},
	publisher={ACM}
	}
	```

	## Acknowledgements

	- [XLS-R](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec/xlsr) (Babu et al., 2022)
	- [RawBoost](https://arxiv.org/abs/2301.00693) (Tak et al., Odyssey 2022)
	- [ASVspoof Challenge](https://www.asvspoof.org/)