korallll commited on
Commit
8242d1b
·
verified ·
1 Parent(s): f19b5fd

Add model card with Arena badges (all 5 datasets)

Browse files
Files changed (1) hide show
  1. README.md +125 -0
README.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - audio
5
+ - anti-spoofing
6
+ - audio-deepfake-detection
7
+ - speech
8
+ - asvspoof
9
+ - wav2vec2
10
+ - nes2net
11
+ ---
12
+
13
+ # Nes2Net
14
+
15
+ [![EER% 0.13 on ASVspoof2019_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2019__LA-0.13%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
16
+ [![EER% 6.14 on ASVspoof2021_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__LA-6.14%25-green)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
17
+ [![EER% 3.61 on ASVspoof2021_DF](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__DF-3.61%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
18
+ [![EER% 8.48 on InTheWild](https://img.shields.io/badge/EER%25%20on%20InTheWild-8.48%25-green)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
19
+ [![EER% 20.55 on CD-ADD](https://img.shields.io/badge/EER%25%20on%20CD--ADD-20.55%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
20
+ [![arena tier](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/nes2net/tier.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
21
+ [![arena rank](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/nes2net/rank.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
22
+
23
+ A **wav2vec 2.0 (XLS-R 300M) + Nes2Net-X** anti-spoofing model, from
24
+ *"Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech
25
+ Anti-Spoofing"* (Liu, Truong, Das, Lee & Li, IEEE T-IFS 2025). A self-supervised
26
+ XLS-R front-end is fine-tuned end-to-end with a **nested Res2Net** back-end that
27
+ operates directly on the foundation-model features — no dimensionality-reducing
28
+ neck — using only ~0.51 M back-end parameters. The model takes a raw speech
29
+ waveform and returns a score where **higher = more bona fide**.
30
+
31
+ - **Code:** https://github.com/Liu-Tianchi/Nes2Net_ASVspoof_ITW
32
+ - **Paper:** https://arxiv.org/abs/2504.05657 (DOI 10.1109/TIFS.2025.3626963)
33
+ - **Parameters:** 317,902,600 (317.90 M total; Nes2Net-X back-end only 0.51 M)
34
+ - **Checkpoint:** [`nes2net_x_DF1.65.pth`](./nes2net_x_DF1.65.pth) (single Nes2Net-X)
35
+
36
+ The exact wrapper used to produce the Arena scores is in
37
+ [`nes2net.py`](./nes2net.py); the network definition is in [`_net.py`](./_net.py).
38
+
39
+ ## Architecture
40
+
41
+ 1. **wav2vec 2.0 XLS-R (300M) front-end** — a self-supervised transformer
42
+ (`fairseq` `Wav2Vec2Model`) producing 1024-d frame features, fine-tuned
43
+ end-to-end with the rest of the network.
44
+ 2. **Nes2Net-X back-end** — a *nested* Res2Net TDNN: outer Res2Net groups, each an
45
+ inner Res2Net (`Bottle2neck`) with squeeze-and-excitation and a learnable
46
+ weighted multi-scale sum, applied directly to the 1024-d XLS-R features
47
+ (`Nes_ratio=[8,8]`, `SE_ratio=[1]`), then mean temporal pooling and a linear
48
+ classifier.
49
+ 3. The 2-logit output is read at **index 1 = bona fide**.
50
+
51
+ ## How it was trained
52
+
53
+ - **Data:** ASVspoof 2019 **Logical Access (LA)**, with RawBoost data augmentation.
54
+ - **Input length:** raw audio at 16 kHz cropped/padded to 64,600 samples (~4 s).
55
+ - **Output:** 2-class logits; the bona-fide logit (index 1) is the score.
56
+
57
+ See the [source repository](https://github.com/Liu-Tianchi/Nes2Net_ASVspoof_ITW) for
58
+ the full training and evaluation code.
59
+
60
+ ## Benchmark result (Speech Anti-Spoofing Arena)
61
+
62
+ Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net).
63
+ Scores were computed with a **deterministic first-64,600-sample window** (no random
64
+ crop), so the numbers are exactly reproducible from the pinned score file.
65
+
66
+ | Dataset | Split | EER % | Trials | Skipped | Notes |
67
+ |---|---|---|---|---|---|
68
+ | ASVspoof2019_LA | test | **0.13** | 71,237 | 0 | in-domain (training data) |
69
+ | ASVspoof2021_DF | test | **3.61** | 611,829 | 0 | cross-dataset generalization |
70
+ | ASVspoof2021_LA | test | **6.14** | 181,566 | 0 | cross-dataset generalization |
71
+ | InTheWild | test | **8.48** | 31,779 | 0 | out-of-domain (real-world deepfakes) |
72
+ | CD-ADD | test | **20.55** | 20,786 | 0 | out-of-domain (modern neural-TTS) |
73
+
74
+ Despite a back-end ~30× smaller than typical SSL countermeasures, Nes2Net-X
75
+ generalizes strongly to unseen attacks — beating a wav2vec 2.0 + AASIST baseline on
76
+ every dataset on this benchmark, most strikingly out-of-domain (CD-ADD and
77
+ ASVspoof2021 DF).
78
+
79
+ ## Usage
80
+
81
+ The checkpoint is a `state_dict` for the `Model` network defined in
82
+ [`_net.py`](./_net.py). Constructing the network requires the base XLS-R 300M
83
+ checkpoint **`xlsr2_300m.pt`** next to the wrapper (only used to build the
84
+ wav2vec 2.0 architecture; every weight is then overwritten by the fine-tuned
85
+ checkpoint):
86
+
87
+ ```bash
88
+ wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr2_300m.pt
89
+ ```
90
+
91
+ The input is windowed to exactly 64,600 samples at 16 kHz mono with `pad_fixed`
92
+ (first 64,600 samples, tile-repeat if shorter).
93
+
94
+ ```python
95
+ import numpy as np
96
+ from nes2net import Nes2Net # _net.py + nes2net.py are in this repo
97
+
98
+ m = Nes2Net()
99
+ m.load() # loads nes2net_x_DF1.65.pth (+ xlsr2_300m.pt)
100
+ audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
101
+ print(m.score_batch([audio], [16000])[0]) # higher = more bona fide
102
+ m.unload()
103
+ ```
104
+
105
+ Internally the wrapper windows the input, runs the network, and returns
106
+ `logits[:, 1]` (class 1 = bona fide). [`nes2net.py`](./nes2net.py) is the exact
107
+ `speech_spoof_bench` model that produced the Arena `scores.txt`.
108
+
109
+ ## Citation
110
+
111
+ ```bibtex
112
+ @article{Nes2Net,
113
+ author={Liu, Tianchi and Truong, Duc-Tuan and Das, Rohan Kumar and Lee, Kong Aik and Li, Haizhou},
114
+ journal={IEEE Transactions on Information Forensics and Security},
115
+ title={Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-Spoofing},
116
+ year={2025},
117
+ volume={20},
118
+ pages={12005--12018},
119
+ doi={10.1109/TIFS.2025.3626963}
120
+ }
121
+ ```
122
+
123
+ ## License
124
+
125
+ MIT — see the [source repository](https://github.com/Liu-Tianchi/Nes2Net_ASVspoof_ITW).