korallll commited on
Commit
2e8b862
·
verified ·
1 Parent(s): b61f57a

Add model card with Arena badges + results (gold, #1/10)

Browse files
Files changed (1) hide show
  1. README.md +129 -0
README.md ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - audio
5
+ - anti-spoofing
6
+ - audio-deepfake-detection
7
+ - speech
8
+ - asvspoof
9
+ - wav2vec2
10
+ ---
11
+
12
+ # XLSR-SLS
13
+
14
+ [![EER% 0.23 on ASVspoof2019_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2019__LA-0.23%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
15
+ [![EER% 7.39 on ASVspoof2021_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__LA-7.39%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
16
+ [![EER% 3.93 on ASVspoof2021_DF](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__DF-3.93%25-green)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
17
+ [![EER% 7.46 on InTheWild](https://img.shields.io/badge/EER%25%20on%20InTheWild-7.46%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
18
+ [![EER% 9.81 on CD-ADD](https://img.shields.io/badge/EER%25%20on%20CD-ADD-9.81%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
19
+ [![arena tier](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/xlsr-sls/tier.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
20
+ [![arena rank](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/xlsr-sls/rank.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls)
21
+
22
+ A **wav2vec 2.0 (XLS-R 300M) + SLS** audio-deepfake-detection model, from
23
+ *"Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier"*
24
+ (Zhang, Wen & Hu, **ACM MM 2024**). A self-supervised XLS-R front-end is paired
25
+ with the **SLS (Sensitive Layer Selection)** classifier, which treats the 24
26
+ XLS-R transformer layers as a feature pyramid and learns to weight them. The
27
+ model takes a raw speech waveform and returns a score where **higher = more
28
+ bona fide**.
29
+
30
+ - **Code:** https://github.com/QiShanZhang/SLSforASVspoof-2021-DF
31
+ - **Paper:** https://doi.org/10.1145/3664647.3681345 (ACM MM 2024; no arXiv version)
32
+ - **Parameters:** 340,790,000 (340.79 M)
33
+ - **Checkpoint:** [`MMpaper_model.pth`](./MMpaper_model.pth) (the paper's released model)
34
+
35
+ The exact wrapper used to produce the Arena scores is in
36
+ [`xlsr_sls.py`](./xlsr_sls.py); the network definition is in [`_net.py`](./_net.py).
37
+
38
+ ## Architecture
39
+
40
+ 1. **wav2vec 2.0 XLS-R (300M) front-end** — a self-supervised transformer
41
+ (`fairseq` `Wav2Vec2Model`) producing 1024-d frame features from **all 24
42
+ transformer layers**.
43
+ 2. **SLS (Sensitive Layer Selection) back-end** — every layer's hidden state is
44
+ average-pooled to a 1024-d descriptor and gated by a per-layer **sigmoid
45
+ attention** (`fc0` → sigmoid); the gates re-weight the full per-layer feature
46
+ stack, which is summed across layers. The fused feature passes through
47
+ BatchNorm + SELU + `3×3` max-pool, is flattened, and goes through a two-layer
48
+ MLP (`fc1: 22847→1024`, `fc3: 1024→2`).
49
+ 3. The 2-class **log-softmax** output is read at **index 1 = bona fide**.
50
+
51
+ ## How it was trained
52
+
53
+ - **Data:** ASVspoof 2019 **Logical Access (LA)**.
54
+ - **Input length:** raw audio at 16 kHz cropped/padded to **64,600 samples**
55
+ (~4.04 s). The window length is **fixed** — `fc1` expects a 22,847-d flatten,
56
+ so the 64,600-sample window is mandatory at inference.
57
+ - **Output:** 2-class log-softmax; the bona-fide log-prob (index 1) is the score.
58
+
59
+ See the [source repository](https://github.com/QiShanZhang/SLSforASVspoof-2021-DF)
60
+ for the full training and evaluation code.
61
+
62
+ ## Benchmark result (Speech Anti-Spoofing Arena)
63
+
64
+ Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=xlsr-sls).
65
+ Scores were computed with a **deterministic first-64,600-sample window** (no random
66
+ crop), so the numbers are exactly reproducible from the pinned score file.
67
+ **Arena standing: 🥇 gold tier, rank #1 of 10.**
68
+
69
+ | Dataset | Split | EER % | Trials | Skipped | W2V2-AASIST† | Notes |
70
+ |---|---|---|---|---|---|---|
71
+ | ASVspoof2019_LA | test | **0.23** | 71,237 | 0 | 0.22 | in-domain (training data) |
72
+ | ASVspoof2021_LA | test | **7.39** | 181,566 | 0 | 8.11 | cross-dataset generalization |
73
+ | ASVspoof2021_DF | test | **3.93** | 611,829 | 0 | 8.32 | cross-dataset generalization |
74
+ | InTheWild | test | **7.46** | 31,779 | 0 | 11.22 | out-of-domain (real-world deepfakes) |
75
+ | CD-ADD | test | **9.81** | 20,786 | 0 | 38.57 | out-of-domain (modern neural-TTS) |
76
+
77
+ † Same benchmark, the other XLS-R-based system (XLS-R 300M + AASIST). XLSR-SLS's
78
+ multi-layer SLS fusion wins on **every out-of-domain set** — most strikingly on
79
+ **ASVspoof2021_DF (3.93 vs 8.32)** and **CD-ADD (9.81 vs 38.57)** — and is on par
80
+ in-domain. The benchmark's ASVspoof2021 LA/DF use curated trial sets, so absolute
81
+ EER differs from the paper's official-keys numbers (1.92 % DF, 7.46 % InTheWild —
82
+ the latter matched here exactly); the relative ordering is the meaningful comparison.
83
+
84
+ ## Usage
85
+
86
+ The checkpoint is a `state_dict` for the `Model` network defined in
87
+ [`_net.py`](./_net.py). Constructing the network requires the base XLS-R 300M
88
+ checkpoint **`xlsr2_300m.pt`** (only used to build the wav2vec 2.0 architecture;
89
+ every weight is then overwritten by `MMpaper_model.pth`):
90
+
91
+ ```bash
92
+ wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr2_300m.pt
93
+ ```
94
+
95
+ The input **must** be exactly 64,600 samples at 16 kHz mono — window the waveform
96
+ with `pad_fixed` (first 64,600 samples, tile-repeat if shorter).
97
+
98
+ ```python
99
+ import numpy as np
100
+ from xlsr_sls import XLSRSLS # _net.py + xlsr_sls.py are in this repo
101
+
102
+ m = XLSRSLS()
103
+ m.load() # loads MMpaper_model.pth (+ xlsr2_300m.pt)
104
+ audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
105
+ print(m.score_batch([audio], [16000])[0]) # higher = more bona fide
106
+ m.unload()
107
+ ```
108
+
109
+ Internally the wrapper windows the input, runs the network, and returns
110
+ `output[:, 1]` (class 1 = bona fide; source `main.py`: `batch_score =
111
+ batch_out[:, 1]`). [`xlsr_sls.py`](./xlsr_sls.py) is the exact
112
+ `speech_spoof_bench` model that produced the Arena `scores.txt`.
113
+
114
+ ## Citation
115
+
116
+ ```bibtex
117
+ @inproceedings{zhang2024audio,
118
+ title={Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier},
119
+ author={Zhang, Qishan and Wen, Shuangbing and Hu, Tao},
120
+ booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
121
+ pages={6765--6773},
122
+ year={2024},
123
+ doi={10.1145/3664647.3681345}
124
+ }
125
+ ```
126
+
127
+ ## License
128
+
129
+ MIT — see the [source repository](https://github.com/QiShanZhang/SLSforASVspoof-2021-DF).