| --- |
| license: mit |
| tags: |
| - audio |
| - anti-spoofing |
| - audio-deepfake-detection |
| - speech |
| - asvspoof |
| - wav2vec2 |
| - nes2net |
| --- |
| |
| # Nes2Net |
|
|
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net) |
|
|
| A **wav2vec 2.0 (XLS-R 300M) + Nes2Net-X** anti-spoofing model, from |
| *"Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech |
| Anti-Spoofing"* (Liu, Truong, Das, Lee & Li, IEEE T-IFS 2025). A self-supervised |
| XLS-R front-end is fine-tuned end-to-end with a **nested Res2Net** back-end that |
| operates directly on the foundation-model features — no dimensionality-reducing |
| neck — using only ~0.51 M back-end parameters. The model takes a raw speech |
| waveform and returns a score where **higher = more bona fide**. |
|
|
| - **Code:** https://github.com/Liu-Tianchi/Nes2Net_ASVspoof_ITW |
| - **Paper:** https://arxiv.org/abs/2504.05657 (DOI 10.1109/TIFS.2025.3626963) |
| - **Parameters:** 317,902,600 (317.90 M total; Nes2Net-X back-end only 0.51 M) |
| - **Checkpoint:** [`nes2net_x_DF1.65.pth`](./nes2net_x_DF1.65.pth) (single Nes2Net-X) |
|
|
| The exact wrapper used to produce the Arena scores is in |
| [`nes2net.py`](./nes2net.py); the network definition is in [`_net.py`](./_net.py). |
|
|
| ## Architecture |
|
|
| 1. **wav2vec 2.0 XLS-R (300M) front-end** — a self-supervised transformer |
| (`fairseq` `Wav2Vec2Model`) producing 1024-d frame features, fine-tuned |
| end-to-end with the rest of the network. |
| 2. **Nes2Net-X back-end** — a *nested* Res2Net TDNN: outer Res2Net groups, each an |
| inner Res2Net (`Bottle2neck`) with squeeze-and-excitation and a learnable |
| weighted multi-scale sum, applied directly to the 1024-d XLS-R features |
| (`Nes_ratio=[8,8]`, `SE_ratio=[1]`), then mean temporal pooling and a linear |
| classifier. |
| 3. The 2-logit output is read at **index 1 = bona fide**. |
|
|
| ## How it was trained |
|
|
| - **Data:** ASVspoof 2019 **Logical Access (LA)**, with RawBoost data augmentation. |
| - **Input length:** raw audio at 16 kHz cropped/padded to 64,600 samples (~4 s). |
| - **Output:** 2-class logits; the bona-fide logit (index 1) is the score. |
|
|
| See the [source repository](https://github.com/Liu-Tianchi/Nes2Net_ASVspoof_ITW) for |
| the full training and evaluation code. |
|
|
| ## Benchmark result (Speech Anti-Spoofing Arena) |
|
|
| Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net). |
| Scores were computed with a **deterministic first-64,600-sample window** (no random |
| crop), so the numbers are exactly reproducible from the pinned score file. |
|
|
| | Dataset | Split | EER % | Trials | Skipped | Notes | |
| |---|---|---|---|---|---| |
| | ASVspoof2019_LA | test | **0.13** | 71,237 | 0 | in-domain (training data) | |
| | ASVspoof2021_DF | test | **3.61** | 611,829 | 0 | cross-dataset generalization | |
| | ASVspoof2021_LA | test | **6.14** | 181,566 | 0 | cross-dataset generalization | |
| | InTheWild | test | **8.48** | 31,779 | 0 | out-of-domain (real-world deepfakes) | |
| | CD-ADD | test | **20.55** | 20,786 | 0 | out-of-domain (modern neural-TTS) | |
| |
| Despite a back-end ~30× smaller than typical SSL countermeasures, Nes2Net-X |
| generalizes strongly to unseen attacks — beating a wav2vec 2.0 + AASIST baseline on |
| every dataset on this benchmark, most strikingly out-of-domain (CD-ADD and |
| ASVspoof2021 DF). |
| |
| ## Usage |
| |
| The checkpoint is a `state_dict` for the `Model` network defined in |
| [`_net.py`](./_net.py). Constructing the network requires the base XLS-R 300M |
| checkpoint **`xlsr2_300m.pt`** next to the wrapper (only used to build the |
| wav2vec 2.0 architecture; every weight is then overwritten by the fine-tuned |
| checkpoint): |
| |
| ```bash |
| wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr2_300m.pt |
| ``` |
| |
| The input is windowed to exactly 64,600 samples at 16 kHz mono with `pad_fixed` |
| (first 64,600 samples, tile-repeat if shorter). |
| |
| ```python |
| import numpy as np |
| from nes2net import Nes2Net # _net.py + nes2net.py are in this repo |
| |
| m = Nes2Net() |
| m.load() # loads nes2net_x_DF1.65.pth (+ xlsr2_300m.pt) |
| audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz |
| print(m.score_batch([audio], [16000])[0]) # higher = more bona fide |
| m.unload() |
| ``` |
| |
| Internally the wrapper windows the input, runs the network, and returns |
| `logits[:, 1]` (class 1 = bona fide). [`nes2net.py`](./nes2net.py) is the exact |
| `speech_spoof_bench` model that produced the Arena `scores.txt`. |
| |
| ## Citation |
| |
| ```bibtex |
| @article{Nes2Net, |
| author={Liu, Tianchi and Truong, Duc-Tuan and Das, Rohan Kumar and Lee, Kong Aik and Li, Haizhou}, |
| journal={IEEE Transactions on Information Forensics and Security}, |
| title={Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-Spoofing}, |
| year={2025}, |
| volume={20}, |
| pages={12005--12018}, |
| doi={10.1109/TIFS.2025.3626963} |
| } |
| ``` |
| |
| ## License |
| |
| MIT — see the [source repository](https://github.com/Liu-Tianchi/Nes2Net_ASVspoof_ITW). |
| |