sukhdeveyash
/

XLS-R-SLS-Deepfake-Detection

+---
+license: mit
+tags:
+  - audio
+  - deepfake-detection
+  - anti-spoofing
+  - wav2vec2
+  - xlsr
+  - speech
+  - asvspoof
+datasets:
+  - asvspoof2019
+  - asvspoof2021
+metrics:
+  - equal_error_rate
+pipeline_tag: audio-classification
+language:
+  - en
+library_name: pytorch
+---
+# XLS-R + SLS Classifier for Audio Deepfake Detection
+Reproduction of **"Audio Deepfake Detection with XLS-R and SLS Classifier"** (Zhang et al., ACM Multimedia 2024).
+The Selective Layer Summarization (SLS) classifier extracts attention-weighted features from all 24 transformer layers of [XLS-R 300M](https://huggingface.co/facebook/wav2vec2-xls-r-300m) (wav2vec 2.0), then classifies bonafide vs. spoofed speech via a lightweight fully-connected head. [RawBoost](https://arxiv.org/abs/2301.00693) (algo=3, SSI) data augmentation is applied during training.
+## Available Checkpoints
+| File | Experiment | Description |
+|------|-----------|-------------|
+| `v1/epoch_2.pth` | v1 (baseline) | Best cross-domain generalization. Patience=1, no validation, 4 epochs. |
+| `v2/epoch_16.pth` | v2 (val-based) | Validation early stopping. Patience=10, ASVspoof2019 LA dev validation, 27 epochs. |
+**Recommended**: Use `v1/epoch_2.pth` — it generalizes better to unseen attack types (DF, In-the-Wild).
+## Results
+| Track | Paper EER (%) | v1 EER (%) | v2 EER (%) |
+|-------|--------------|------------|------------|
+| ASVspoof 2021 DF | 1.92 | **2.14** | 3.75 |
+| ASVspoof 2021 LA | 2.87 | 3.51 | **3.47** |
+| In-the-Wild | 7.46 | **7.84** | 12.67 |
+v1 closely reproduces the paper results. v2 improves LA slightly but degrades DF and In-the-Wild due to overfitting to the LA validation domain — a well-documented cross-domain generalization problem in audio deepfake detection ([Muller et al., Interspeech 2022](https://arxiv.org/abs/2203.16263)).
+## Training Configuration
+Both experiments share the following setup:
+| Parameter | Value |
+|-----------|-------|
+| Training data | ASVspoof2019 LA train (25,380 utterances) |
+| Loss | Weighted Cross-Entropy [0.1, 0.9] |
+| Optimizer | Adam (lr=1e-6, weight_decay=1e-4) |
+| Batch size | 5 |
+| RawBoost | algo=3 (SSI) |
+| Seed | 1234 |
+| SSL backbone | XLS-R 300M (frozen feature extractor) |
+| GPU | NVIDIA RTX 4080 (16 GB) |
+### v1 specifics
+- Early stopping: patience=1 on training loss
+- No validation set
+- 4 epochs trained, best at epoch 2 (train loss = 0.000661)
+### v2 specifics
+- Early stopping: patience=10 on validation loss
+- Validation: ASVspoof2019 LA dev (24,844 trials)
+- 27 epochs trained, best at epoch 16 (val_loss = 0.000468, val_acc = 99.99%)
+- Bug fixes: `torch.no_grad()` in validation loop, correct `best_val_loss` tracking
+## Usage
+### Download checkpoint
+```python
+from huggingface_hub import hf_hub_download
+# Download v1 checkpoint (recommended)
+checkpoint_path = hf_hub_download(
+    repo_id="sukhdeveyash/XLS-R-SLS-Deepfake-Detection",
+    filename="v1/epoch_2.pth"
+)
+# Download v2 checkpoint
+# checkpoint_path = hf_hub_download(
+#     repo_id="sukhdeveyash/XLS-R-SLS-Deepfake-Detection",
+#     filename="v2/epoch_16.pth"
+# )
+```
+### Load and run inference
+```python
+import torch
+from model import Model  # from the GitHub repo
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = Model(device=device, ssl_cpkt_path="xlsr2_300m.pt")
+model.load_state_dict(torch.load(checkpoint_path, map_location=device))
+model = model.to(device)
+model.eval()
+```
+Full training and evaluation code: [GitHub Repository](https://github.com/Yash-Sukhdeve/XLS-R-SLS-Deepfake-Detection)
+## Requirements
+- Python 3.7+
+- PyTorch 1.13.1 (CUDA 11.7)
+- fairseq (commit a54021305d6b3c)
+- XLS-R 300M base checkpoint (`xlsr2_300m.pt`) from [fairseq](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec/xlsr)
+See `environment.yml` in the [GitHub repo](https://github.com/Yash-Sukhdeve/XLS-R-SLS-Deepfake-Detection) for the full environment.
+## Citation
+```bibtex
+@inproceedings{zhang2024audio,
+  title={Audio Deepfake Detection with XLS-R and SLS Classifier},
+  author={Zhang, Qishan and Wen, Shuangbing and Hu, Tao},
+  booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
+  year={2024},
+  publisher={ACM}
+}
+```
+## Acknowledgements
+- [XLS-R](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec/xlsr) (Babu et al., 2022)
+- [RawBoost](https://arxiv.org/abs/2301.00693) (Tak et al., Odyssey 2022)
+- [ASVspoof Challenge](https://www.asvspoof.org/)