Add model card with Arena badges + results (gold, #1/10)
Browse files
README.md
ADDED
|
@@ -0,0 +1,129 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- audio
|
| 5 |
+
- anti-spoofing
|
| 6 |
+
- audio-deepfake-detection
|
| 7 |
+
- speech
|
| 8 |
+
- asvspoof
|
| 9 |
+
- wav2vec2
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# XLSR-SLS
|
| 13 |
+
|
| 14 |
+
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
|
| 15 |
+
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
|
| 16 |
+
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
|
| 17 |
+
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
|
| 18 |
+
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
|
| 19 |
+
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
|
| 20 |
+
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
|
| 21 |
+
|
| 22 |
+
A **wav2vec 2.0 (XLS-R 300M) + SLS** audio-deepfake-detection model, from
|
| 23 |
+
*"Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier"*
|
| 24 |
+
(Zhang, Wen & Hu, **ACM MM 2024**). A self-supervised XLS-R front-end is paired
|
| 25 |
+
with the **SLS (Sensitive Layer Selection)** classifier, which treats the 24
|
| 26 |
+
XLS-R transformer layers as a feature pyramid and learns to weight them. The
|
| 27 |
+
model takes a raw speech waveform and returns a score where **higher = more
|
| 28 |
+
bona fide**.
|
| 29 |
+
|
| 30 |
+
- **Code:** https://github.com/QiShanZhang/SLSforASVspoof-2021-DF
|
| 31 |
+
- **Paper:** https://doi.org/10.1145/3664647.3681345 (ACM MM 2024; no arXiv version)
|
| 32 |
+
- **Parameters:** 340,790,000 (340.79 M)
|
| 33 |
+
- **Checkpoint:** [`MMpaper_model.pth`](./MMpaper_model.pth) (the paper's released model)
|
| 34 |
+
|
| 35 |
+
The exact wrapper used to produce the Arena scores is in
|
| 36 |
+
[`xlsr_sls.py`](./xlsr_sls.py); the network definition is in [`_net.py`](./_net.py).
|
| 37 |
+
|
| 38 |
+
## Architecture
|
| 39 |
+
|
| 40 |
+
1. **wav2vec 2.0 XLS-R (300M) front-end** — a self-supervised transformer
|
| 41 |
+
(`fairseq` `Wav2Vec2Model`) producing 1024-d frame features from **all 24
|
| 42 |
+
transformer layers**.
|
| 43 |
+
2. **SLS (Sensitive Layer Selection) back-end** — every layer's hidden state is
|
| 44 |
+
average-pooled to a 1024-d descriptor and gated by a per-layer **sigmoid
|
| 45 |
+
attention** (`fc0` → sigmoid); the gates re-weight the full per-layer feature
|
| 46 |
+
stack, which is summed across layers. The fused feature passes through
|
| 47 |
+
BatchNorm + SELU + `3×3` max-pool, is flattened, and goes through a two-layer
|
| 48 |
+
MLP (`fc1: 22847→1024`, `fc3: 1024→2`).
|
| 49 |
+
3. The 2-class **log-softmax** output is read at **index 1 = bona fide**.
|
| 50 |
+
|
| 51 |
+
## How it was trained
|
| 52 |
+
|
| 53 |
+
- **Data:** ASVspoof 2019 **Logical Access (LA)**.
|
| 54 |
+
- **Input length:** raw audio at 16 kHz cropped/padded to **64,600 samples**
|
| 55 |
+
(~4.04 s). The window length is **fixed** — `fc1` expects a 22,847-d flatten,
|
| 56 |
+
so the 64,600-sample window is mandatory at inference.
|
| 57 |
+
- **Output:** 2-class log-softmax; the bona-fide log-prob (index 1) is the score.
|
| 58 |
+
|
| 59 |
+
See the [source repository](https://github.com/QiShanZhang/SLSforASVspoof-2021-DF)
|
| 60 |
+
for the full training and evaluation code.
|
| 61 |
+
|
| 62 |
+
## Benchmark result (Speech Anti-Spoofing Arena)
|
| 63 |
+
|
| 64 |
+
Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls).
|
| 65 |
+
Scores were computed with a **deterministic first-64,600-sample window** (no random
|
| 66 |
+
crop), so the numbers are exactly reproducible from the pinned score file.
|
| 67 |
+
**Arena standing: 🥇 gold tier, rank #1 of 10.**
|
| 68 |
+
|
| 69 |
+
| Dataset | Split | EER % | Trials | Skipped | W2V2-AASIST† | Notes |
|
| 70 |
+
|---|---|---|---|---|---|---|
|
| 71 |
+
| ASVspoof2019_LA | test | **0.23** | 71,237 | 0 | 0.22 | in-domain (training data) |
|
| 72 |
+
| ASVspoof2021_LA | test | **7.39** | 181,566 | 0 | 8.11 | cross-dataset generalization |
|
| 73 |
+
| ASVspoof2021_DF | test | **3.93** | 611,829 | 0 | 8.32 | cross-dataset generalization |
|
| 74 |
+
| InTheWild | test | **7.46** | 31,779 | 0 | 11.22 | out-of-domain (real-world deepfakes) |
|
| 75 |
+
| CD-ADD | test | **9.81** | 20,786 | 0 | 38.57 | out-of-domain (modern neural-TTS) |
|
| 76 |
+
|
| 77 |
+
† Same benchmark, the other XLS-R-based system (XLS-R 300M + AASIST). XLSR-SLS's
|
| 78 |
+
multi-layer SLS fusion wins on **every out-of-domain set** — most strikingly on
|
| 79 |
+
**ASVspoof2021_DF (3.93 vs 8.32)** and **CD-ADD (9.81 vs 38.57)** — and is on par
|
| 80 |
+
in-domain. The benchmark's ASVspoof2021 LA/DF use curated trial sets, so absolute
|
| 81 |
+
EER differs from the paper's official-keys numbers (1.92 % DF, 7.46 % InTheWild —
|
| 82 |
+
the latter matched here exactly); the relative ordering is the meaningful comparison.
|
| 83 |
+
|
| 84 |
+
## Usage
|
| 85 |
+
|
| 86 |
+
The checkpoint is a `state_dict` for the `Model` network defined in
|
| 87 |
+
[`_net.py`](./_net.py). Constructing the network requires the base XLS-R 300M
|
| 88 |
+
checkpoint **`xlsr2_300m.pt`** (only used to build the wav2vec 2.0 architecture;
|
| 89 |
+
every weight is then overwritten by `MMpaper_model.pth`):
|
| 90 |
+
|
| 91 |
+
```bash
|
| 92 |
+
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr2_300m.pt
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
The input **must** be exactly 64,600 samples at 16 kHz mono — window the waveform
|
| 96 |
+
with `pad_fixed` (first 64,600 samples, tile-repeat if shorter).
|
| 97 |
+
|
| 98 |
+
```python
|
| 99 |
+
import numpy as np
|
| 100 |
+
from xlsr_sls import XLSRSLS # _net.py + xlsr_sls.py are in this repo
|
| 101 |
+
|
| 102 |
+
m = XLSRSLS()
|
| 103 |
+
m.load() # loads MMpaper_model.pth (+ xlsr2_300m.pt)
|
| 104 |
+
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
|
| 105 |
+
print(m.score_batch([audio], [16000])[0]) # higher = more bona fide
|
| 106 |
+
m.unload()
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
Internally the wrapper windows the input, runs the network, and returns
|
| 110 |
+
`output[:, 1]` (class 1 = bona fide; source `main.py`: `batch_score =
|
| 111 |
+
batch_out[:, 1]`). [`xlsr_sls.py`](./xlsr_sls.py) is the exact
|
| 112 |
+
`speech_spoof_bench` model that produced the Arena `scores.txt`.
|
| 113 |
+
|
| 114 |
+
## Citation
|
| 115 |
+
|
| 116 |
+
```bibtex
|
| 117 |
+
@inproceedings{zhang2024audio,
|
| 118 |
+
title={Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier},
|
| 119 |
+
author={Zhang, Qishan and Wen, Shuangbing and Hu, Tao},
|
| 120 |
+
booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
|
| 121 |
+
pages={6765--6773},
|
| 122 |
+
year={2024},
|
| 123 |
+
doi={10.1145/3664647.3681345}
|
| 124 |
+
}
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
## License
|
| 128 |
+
|
| 129 |
+
MIT — see the [source repository](https://github.com/QiShanZhang/SLSforASVspoof-2021-DF).
|