File size: 4,755 Bytes
afa0a32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
045fce7
 
 
 
 
 
afa0a32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
license: mit
tags:
  - audio
  - deepfake-detection
  - anti-spoofing
  - wav2vec2
  - xlsr
  - speech
  - asvspoof
datasets:
  - asvspoof2019
  - asvspoof2021
metrics:
  - equal_error_rate
pipeline_tag: audio-classification
language:
  - en
library_name: pytorch
---

# XLS-R + SLS Classifier for Audio Deepfake Detection

Reproduction of **"Audio Deepfake Detection with XLS-R and SLS Classifier"** (Zhang et al., ACM Multimedia 2024).

The Selective Layer Summarization (SLS) classifier extracts attention-weighted features from all 24 transformer layers of [XLS-R 300M](https://huggingface.co/facebook/wav2vec2-xls-r-300m) (wav2vec 2.0), then classifies bonafide vs. spoofed speech via a lightweight fully-connected head. [RawBoost](https://arxiv.org/abs/2301.00693) (algo=3, SSI) data augmentation is applied during training.

## Available Checkpoints

| File | Experiment | Description |
|------|-----------|-------------|
| `v1/epoch_2.pth` | v1 (baseline) | Best cross-domain generalization. Patience=1, no validation, 4 epochs. |
| `v2/epoch_16.pth` | v2 (val-based) | Validation early stopping. Patience=10, ASVspoof2019 LA dev validation, 27 epochs. |

**Recommended**: Use `v1/epoch_2.pth` — it generalizes better to unseen attack types (DF, In-the-Wild).

### Original authors' pretrained models

The original pretrained checkpoints from Zhang et al. are available from:
- [Google Drive](https://drive.google.com/drive/folders/13vw_AX1jHdYndRu1edlgpdNJpCX8OnrH?usp=sharing)
- [Baidu Pan](https://pan.baidu.com/s/1dj-hjvf3fFPIYdtHWqtCmg?pwd=shan) (password: shan)

## Results

| Track | Paper EER (%) | v1 EER (%) | v2 EER (%) |
|-------|--------------|------------|------------|
| ASVspoof 2021 DF | 1.92 | **2.14** | 3.75 |
| ASVspoof 2021 LA | 2.87 | 3.51 | **3.47** |
| In-the-Wild | 7.46 | **7.84** | 12.67 |

v1 closely reproduces the paper results. v2 improves LA slightly but degrades DF and In-the-Wild due to overfitting to the LA validation domain — a well-documented cross-domain generalization problem in audio deepfake detection ([Muller et al., Interspeech 2022](https://arxiv.org/abs/2203.16263)).

## Training Configuration

Both experiments share the following setup:

| Parameter | Value |
|-----------|-------|
| Training data | ASVspoof2019 LA train (25,380 utterances) |
| Loss | Weighted Cross-Entropy [0.1, 0.9] |
| Optimizer | Adam (lr=1e-6, weight_decay=1e-4) |
| Batch size | 5 |
| RawBoost | algo=3 (SSI) |
| Seed | 1234 |
| SSL backbone | XLS-R 300M (frozen feature extractor) |
| GPU | NVIDIA RTX 4080 (16 GB) |

### v1 specifics
- Early stopping: patience=1 on training loss
- No validation set
- 4 epochs trained, best at epoch 2 (train loss = 0.000661)

### v2 specifics
- Early stopping: patience=10 on validation loss
- Validation: ASVspoof2019 LA dev (24,844 trials)
- 27 epochs trained, best at epoch 16 (val_loss = 0.000468, val_acc = 99.99%)
- Bug fixes: `torch.no_grad()` in validation loop, correct `best_val_loss` tracking

## Usage

### Download checkpoint

```python
from huggingface_hub import hf_hub_download

# Download v1 checkpoint (recommended)
checkpoint_path = hf_hub_download(
    repo_id="sukhdeveyash/XLS-R-SLS-Deepfake-Detection",
    filename="v1/epoch_2.pth"
)

# Download v2 checkpoint
# checkpoint_path = hf_hub_download(
#     repo_id="sukhdeveyash/XLS-R-SLS-Deepfake-Detection",
#     filename="v2/epoch_16.pth"
# )
```

### Load and run inference

```python
import torch
from model import Model  # from the GitHub repo

device = "cuda" if torch.cuda.is_available() else "cpu"

model = Model(device=device, ssl_cpkt_path="xlsr2_300m.pt")
model.load_state_dict(torch.load(checkpoint_path, map_location=device))
model = model.to(device)
model.eval()
```

Full training and evaluation code: [GitHub Repository](https://github.com/Yash-Sukhdeve/XLS-R-SLS-Deepfake-Detection)

## Requirements

- Python 3.7+
- PyTorch 1.13.1 (CUDA 11.7)
- fairseq (commit a54021305d6b3c)
- XLS-R 300M base checkpoint (`xlsr2_300m.pt`) from [fairseq](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec/xlsr)

See `environment.yml` in the [GitHub repo](https://github.com/Yash-Sukhdeve/XLS-R-SLS-Deepfake-Detection) for the full environment.

## Citation

```bibtex
@inproceedings{zhang2024audio,
  title={Audio Deepfake Detection with XLS-R and SLS Classifier},
  author={Zhang, Qishan and Wen, Shuangbing and Hu, Tao},
  booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
  year={2024},
  publisher={ACM}
}
```

## Acknowledgements

- [XLS-R](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec/xlsr) (Babu et al., 2022)
- [RawBoost](https://arxiv.org/abs/2301.00693) (Tak et al., Odyssey 2022)
- [ASVspoof Challenge](https://www.asvspoof.org/)