File size: 4,755 Bytes
afa0a32 045fce7 afa0a32 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | ---
license: mit
tags:
- audio
- deepfake-detection
- anti-spoofing
- wav2vec2
- xlsr
- speech
- asvspoof
datasets:
- asvspoof2019
- asvspoof2021
metrics:
- equal_error_rate
pipeline_tag: audio-classification
language:
- en
library_name: pytorch
---
# XLS-R + SLS Classifier for Audio Deepfake Detection
Reproduction of **"Audio Deepfake Detection with XLS-R and SLS Classifier"** (Zhang et al., ACM Multimedia 2024).
The Selective Layer Summarization (SLS) classifier extracts attention-weighted features from all 24 transformer layers of [XLS-R 300M](https://huggingface.co/facebook/wav2vec2-xls-r-300m) (wav2vec 2.0), then classifies bonafide vs. spoofed speech via a lightweight fully-connected head. [RawBoost](https://arxiv.org/abs/2301.00693) (algo=3, SSI) data augmentation is applied during training.
## Available Checkpoints
| File | Experiment | Description |
|------|-----------|-------------|
| `v1/epoch_2.pth` | v1 (baseline) | Best cross-domain generalization. Patience=1, no validation, 4 epochs. |
| `v2/epoch_16.pth` | v2 (val-based) | Validation early stopping. Patience=10, ASVspoof2019 LA dev validation, 27 epochs. |
**Recommended**: Use `v1/epoch_2.pth` — it generalizes better to unseen attack types (DF, In-the-Wild).
### Original authors' pretrained models
The original pretrained checkpoints from Zhang et al. are available from:
- [Google Drive](https://drive.google.com/drive/folders/13vw_AX1jHdYndRu1edlgpdNJpCX8OnrH?usp=sharing)
- [Baidu Pan](https://pan.baidu.com/s/1dj-hjvf3fFPIYdtHWqtCmg?pwd=shan) (password: shan)
## Results
| Track | Paper EER (%) | v1 EER (%) | v2 EER (%) |
|-------|--------------|------------|------------|
| ASVspoof 2021 DF | 1.92 | **2.14** | 3.75 |
| ASVspoof 2021 LA | 2.87 | 3.51 | **3.47** |
| In-the-Wild | 7.46 | **7.84** | 12.67 |
v1 closely reproduces the paper results. v2 improves LA slightly but degrades DF and In-the-Wild due to overfitting to the LA validation domain — a well-documented cross-domain generalization problem in audio deepfake detection ([Muller et al., Interspeech 2022](https://arxiv.org/abs/2203.16263)).
## Training Configuration
Both experiments share the following setup:
| Parameter | Value |
|-----------|-------|
| Training data | ASVspoof2019 LA train (25,380 utterances) |
| Loss | Weighted Cross-Entropy [0.1, 0.9] |
| Optimizer | Adam (lr=1e-6, weight_decay=1e-4) |
| Batch size | 5 |
| RawBoost | algo=3 (SSI) |
| Seed | 1234 |
| SSL backbone | XLS-R 300M (frozen feature extractor) |
| GPU | NVIDIA RTX 4080 (16 GB) |
### v1 specifics
- Early stopping: patience=1 on training loss
- No validation set
- 4 epochs trained, best at epoch 2 (train loss = 0.000661)
### v2 specifics
- Early stopping: patience=10 on validation loss
- Validation: ASVspoof2019 LA dev (24,844 trials)
- 27 epochs trained, best at epoch 16 (val_loss = 0.000468, val_acc = 99.99%)
- Bug fixes: `torch.no_grad()` in validation loop, correct `best_val_loss` tracking
## Usage
### Download checkpoint
```python
from huggingface_hub import hf_hub_download
# Download v1 checkpoint (recommended)
checkpoint_path = hf_hub_download(
repo_id="sukhdeveyash/XLS-R-SLS-Deepfake-Detection",
filename="v1/epoch_2.pth"
)
# Download v2 checkpoint
# checkpoint_path = hf_hub_download(
# repo_id="sukhdeveyash/XLS-R-SLS-Deepfake-Detection",
# filename="v2/epoch_16.pth"
# )
```
### Load and run inference
```python
import torch
from model import Model # from the GitHub repo
device = "cuda" if torch.cuda.is_available() else "cpu"
model = Model(device=device, ssl_cpkt_path="xlsr2_300m.pt")
model.load_state_dict(torch.load(checkpoint_path, map_location=device))
model = model.to(device)
model.eval()
```
Full training and evaluation code: [GitHub Repository](https://github.com/Yash-Sukhdeve/XLS-R-SLS-Deepfake-Detection)
## Requirements
- Python 3.7+
- PyTorch 1.13.1 (CUDA 11.7)
- fairseq (commit a54021305d6b3c)
- XLS-R 300M base checkpoint (`xlsr2_300m.pt`) from [fairseq](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec/xlsr)
See `environment.yml` in the [GitHub repo](https://github.com/Yash-Sukhdeve/XLS-R-SLS-Deepfake-Detection) for the full environment.
## Citation
```bibtex
@inproceedings{zhang2024audio,
title={Audio Deepfake Detection with XLS-R and SLS Classifier},
author={Zhang, Qishan and Wen, Shuangbing and Hu, Tao},
booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
year={2024},
publisher={ACM}
}
```
## Acknowledgements
- [XLS-R](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec/xlsr) (Babu et al., 2022)
- [RawBoost](https://arxiv.org/abs/2301.00693) (Tak et al., Odyssey 2022)
- [ASVspoof Challenge](https://www.asvspoof.org/)
|