| | --- |
| | license: mit |
| | tags: |
| | - audio |
| | - deepfake-detection |
| | - anti-spoofing |
| | - wav2vec2 |
| | - xlsr |
| | - speech |
| | - asvspoof |
| | datasets: |
| | - asvspoof2019 |
| | - asvspoof2021 |
| | metrics: |
| | - equal_error_rate |
| | pipeline_tag: audio-classification |
| | language: |
| | - en |
| | library_name: pytorch |
| | --- |
| | |
| | # XLS-R + SLS Classifier for Audio Deepfake Detection |
| |
|
| | Reproduction of **"Audio Deepfake Detection with XLS-R and SLS Classifier"** (Zhang et al., ACM Multimedia 2024). |
| |
|
| | The Selective Layer Summarization (SLS) classifier extracts attention-weighted features from all 24 transformer layers of [XLS-R 300M](https://huggingface.co/facebook/wav2vec2-xls-r-300m) (wav2vec 2.0), then classifies bonafide vs. spoofed speech via a lightweight fully-connected head. [RawBoost](https://arxiv.org/abs/2301.00693) (algo=3, SSI) data augmentation is applied during training. |
| |
|
| | ## Available Checkpoints |
| |
|
| | | File | Experiment | Description | |
| | |------|-----------|-------------| |
| | | `v1/epoch_2.pth` | v1 (baseline) | Best cross-domain generalization. Patience=1, no validation, 4 epochs. | |
| | | `v2/epoch_16.pth` | v2 (val-based) | Validation early stopping. Patience=10, ASVspoof2019 LA dev validation, 27 epochs. | |
| |
|
| | **Recommended**: Use `v1/epoch_2.pth` — it generalizes better to unseen attack types (DF, In-the-Wild). |
| |
|
| | ### Original authors' pretrained models |
| |
|
| | The original pretrained checkpoints from Zhang et al. are available from: |
| | - [Google Drive](https://drive.google.com/drive/folders/13vw_AX1jHdYndRu1edlgpdNJpCX8OnrH?usp=sharing) |
| | - [Baidu Pan](https://pan.baidu.com/s/1dj-hjvf3fFPIYdtHWqtCmg?pwd=shan) (password: shan) |
| |
|
| | ## Results |
| |
|
| | | Track | Paper EER (%) | v1 EER (%) | v2 EER (%) | |
| | |-------|--------------|------------|------------| |
| | | ASVspoof 2021 DF | 1.92 | **2.14** | 3.75 | |
| | | ASVspoof 2021 LA | 2.87 | 3.51 | **3.47** | |
| | | In-the-Wild | 7.46 | **7.84** | 12.67 | |
| |
|
| | v1 closely reproduces the paper results. v2 improves LA slightly but degrades DF and In-the-Wild due to overfitting to the LA validation domain — a well-documented cross-domain generalization problem in audio deepfake detection ([Muller et al., Interspeech 2022](https://arxiv.org/abs/2203.16263)). |
| |
|
| | ## Training Configuration |
| |
|
| | Both experiments share the following setup: |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Training data | ASVspoof2019 LA train (25,380 utterances) | |
| | | Loss | Weighted Cross-Entropy [0.1, 0.9] | |
| | | Optimizer | Adam (lr=1e-6, weight_decay=1e-4) | |
| | | Batch size | 5 | |
| | | RawBoost | algo=3 (SSI) | |
| | | Seed | 1234 | |
| | | SSL backbone | XLS-R 300M (frozen feature extractor) | |
| | | GPU | NVIDIA RTX 4080 (16 GB) | |
| | |
| | ### v1 specifics |
| | - Early stopping: patience=1 on training loss |
| | - No validation set |
| | - 4 epochs trained, best at epoch 2 (train loss = 0.000661) |
| | |
| | ### v2 specifics |
| | - Early stopping: patience=10 on validation loss |
| | - Validation: ASVspoof2019 LA dev (24,844 trials) |
| | - 27 epochs trained, best at epoch 16 (val_loss = 0.000468, val_acc = 99.99%) |
| | - Bug fixes: `torch.no_grad()` in validation loop, correct `best_val_loss` tracking |
| |
|
| | ## Usage |
| |
|
| | ### Download checkpoint |
| |
|
| | ```python |
| | from huggingface_hub import hf_hub_download |
| | |
| | # Download v1 checkpoint (recommended) |
| | checkpoint_path = hf_hub_download( |
| | repo_id="sukhdeveyash/XLS-R-SLS-Deepfake-Detection", |
| | filename="v1/epoch_2.pth" |
| | ) |
| | |
| | # Download v2 checkpoint |
| | # checkpoint_path = hf_hub_download( |
| | # repo_id="sukhdeveyash/XLS-R-SLS-Deepfake-Detection", |
| | # filename="v2/epoch_16.pth" |
| | # ) |
| | ``` |
| |
|
| | ### Load and run inference |
| |
|
| | ```python |
| | import torch |
| | from model import Model # from the GitHub repo |
| | |
| | device = "cuda" if torch.cuda.is_available() else "cpu" |
| | |
| | model = Model(device=device, ssl_cpkt_path="xlsr2_300m.pt") |
| | model.load_state_dict(torch.load(checkpoint_path, map_location=device)) |
| | model = model.to(device) |
| | model.eval() |
| | ``` |
| |
|
| | Full training and evaluation code: [GitHub Repository](https://github.com/Yash-Sukhdeve/XLS-R-SLS-Deepfake-Detection) |
| |
|
| | ## Requirements |
| |
|
| | - Python 3.7+ |
| | - PyTorch 1.13.1 (CUDA 11.7) |
| | - fairseq (commit a54021305d6b3c) |
| | - XLS-R 300M base checkpoint (`xlsr2_300m.pt`) from [fairseq](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec/xlsr) |
| |
|
| | See `environment.yml` in the [GitHub repo](https://github.com/Yash-Sukhdeve/XLS-R-SLS-Deepfake-Detection) for the full environment. |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @inproceedings{zhang2024audio, |
| | title={Audio Deepfake Detection with XLS-R and SLS Classifier}, |
| | author={Zhang, Qishan and Wen, Shuangbing and Hu, Tao}, |
| | booktitle={Proceedings of the 32nd ACM International Conference on Multimedia}, |
| | year={2024}, |
| | publisher={ACM} |
| | } |
| | ``` |
| |
|
| | ## Acknowledgements |
| |
|
| | - [XLS-R](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec/xlsr) (Babu et al., 2022) |
| | - [RawBoost](https://arxiv.org/abs/2301.00693) (Tak et al., Odyssey 2022) |
| | - [ASVspoof Challenge](https://www.asvspoof.org/) |
| |
|