File size: 6,405 Bytes
8242d1b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 | ---
license: mit
tags:
- audio
- anti-spoofing
- audio-deepfake-detection
- speech
- asvspoof
- wav2vec2
- nes2net
---
# Nes2Net
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
A **wav2vec 2.0 (XLS-R 300M) + Nes2Net-X** anti-spoofing model, from
*"Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech
Anti-Spoofing"* (Liu, Truong, Das, Lee & Li, IEEE T-IFS 2025). A self-supervised
XLS-R front-end is fine-tuned end-to-end with a **nested Res2Net** back-end that
operates directly on the foundation-model features — no dimensionality-reducing
neck — using only ~0.51 M back-end parameters. The model takes a raw speech
waveform and returns a score where **higher = more bona fide**.
- **Code:** https://github.com/Liu-Tianchi/Nes2Net_ASVspoof_ITW
- **Paper:** https://arxiv.org/abs/2504.05657 (DOI 10.1109/TIFS.2025.3626963)
- **Parameters:** 317,902,600 (317.90 M total; Nes2Net-X back-end only 0.51 M)
- **Checkpoint:** [`nes2net_x_DF1.65.pth`](./nes2net_x_DF1.65.pth) (single Nes2Net-X)
The exact wrapper used to produce the Arena scores is in
[`nes2net.py`](./nes2net.py); the network definition is in [`_net.py`](./_net.py).
## Architecture
1. **wav2vec 2.0 XLS-R (300M) front-end** — a self-supervised transformer
(`fairseq` `Wav2Vec2Model`) producing 1024-d frame features, fine-tuned
end-to-end with the rest of the network.
2. **Nes2Net-X back-end** — a *nested* Res2Net TDNN: outer Res2Net groups, each an
inner Res2Net (`Bottle2neck`) with squeeze-and-excitation and a learnable
weighted multi-scale sum, applied directly to the 1024-d XLS-R features
(`Nes_ratio=[8,8]`, `SE_ratio=[1]`), then mean temporal pooling and a linear
classifier.
3. The 2-logit output is read at **index 1 = bona fide**.
## How it was trained
- **Data:** ASVspoof 2019 **Logical Access (LA)**, with RawBoost data augmentation.
- **Input length:** raw audio at 16 kHz cropped/padded to 64,600 samples (~4 s).
- **Output:** 2-class logits; the bona-fide logit (index 1) is the score.
See the [source repository](https://github.com/Liu-Tianchi/Nes2Net_ASVspoof_ITW) for
the full training and evaluation code.
## Benchmark result (Speech Anti-Spoofing Arena)
Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net).
Scores were computed with a **deterministic first-64,600-sample window** (no random
crop), so the numbers are exactly reproducible from the pinned score file.
| Dataset | Split | EER % | Trials | Skipped | Notes |
|---|---|---|---|---|---|
| ASVspoof2019_LA | test | **0.13** | 71,237 | 0 | in-domain (training data) |
| ASVspoof2021_DF | test | **3.61** | 611,829 | 0 | cross-dataset generalization |
| ASVspoof2021_LA | test | **6.14** | 181,566 | 0 | cross-dataset generalization |
| InTheWild | test | **8.48** | 31,779 | 0 | out-of-domain (real-world deepfakes) |
| CD-ADD | test | **20.55** | 20,786 | 0 | out-of-domain (modern neural-TTS) |
Despite a back-end ~30× smaller than typical SSL countermeasures, Nes2Net-X
generalizes strongly to unseen attacks — beating a wav2vec 2.0 + AASIST baseline on
every dataset on this benchmark, most strikingly out-of-domain (CD-ADD and
ASVspoof2021 DF).
## Usage
The checkpoint is a `state_dict` for the `Model` network defined in
[`_net.py`](./_net.py). Constructing the network requires the base XLS-R 300M
checkpoint **`xlsr2_300m.pt`** next to the wrapper (only used to build the
wav2vec 2.0 architecture; every weight is then overwritten by the fine-tuned
checkpoint):
```bash
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr2_300m.pt
```
The input is windowed to exactly 64,600 samples at 16 kHz mono with `pad_fixed`
(first 64,600 samples, tile-repeat if shorter).
```python
import numpy as np
from nes2net import Nes2Net # _net.py + nes2net.py are in this repo
m = Nes2Net()
m.load() # loads nes2net_x_DF1.65.pth (+ xlsr2_300m.pt)
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(m.score_batch([audio], [16000])[0]) # higher = more bona fide
m.unload()
```
Internally the wrapper windows the input, runs the network, and returns
`logits[:, 1]` (class 1 = bona fide). [`nes2net.py`](./nes2net.py) is the exact
`speech_spoof_bench` model that produced the Arena `scores.txt`.
## Citation
```bibtex
@article{Nes2Net,
author={Liu, Tianchi and Truong, Duc-Tuan and Das, Rohan Kumar and Lee, Kong Aik and Li, Haizhou},
journal={IEEE Transactions on Information Forensics and Security},
title={Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-Spoofing},
year={2025},
volume={20},
pages={12005--12018},
doi={10.1109/TIFS.2025.3626963}
}
```
## License
MIT — see the [source repository](https://github.com/Liu-Tianchi/Nes2Net_ASVspoof_ITW).
|