Nes2Net

A wav2vec 2.0 (XLS-R 300M) + Nes2Net-X anti-spoofing model, from "Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-Spoofing" (Liu, Truong, Das, Lee & Li, IEEE T-IFS 2025). A self-supervised XLS-R front-end is fine-tuned end-to-end with a nested Res2Net back-end that operates directly on the foundation-model features — no dimensionality-reducing neck — using only ~0.51 M back-end parameters. The model takes a raw speech waveform and returns a score where higher = more bona fide.

Code: https://github.com/Liu-Tianchi/Nes2Net_ASVspoof_ITW
Paper: https://arxiv.org/abs/2504.05657 (DOI 10.1109/TIFS.2025.3626963)
Parameters: 317,902,600 (317.90 M total; Nes2Net-X back-end only 0.51 M)
Checkpoint: nes2net_x_DF1.65.pth (single Nes2Net-X)

The exact wrapper used to produce the Arena scores is in nes2net.py; the network definition is in _net.py.

Architecture

wav2vec 2.0 XLS-R (300M) front-end — a self-supervised transformer (fairseq Wav2Vec2Model) producing 1024-d frame features, fine-tuned end-to-end with the rest of the network.
Nes2Net-X back-end — a nested Res2Net TDNN: outer Res2Net groups, each an inner Res2Net (Bottle2neck) with squeeze-and-excitation and a learnable weighted multi-scale sum, applied directly to the 1024-d XLS-R features (Nes_ratio=[8,8], SE_ratio=[1]), then mean temporal pooling and a linear classifier.
The 2-logit output is read at index 1 = bona fide.

How it was trained

Data: ASVspoof 2019 Logical Access (LA), with RawBoost data augmentation.
Input length: raw audio at 16 kHz cropped/padded to 64,600 samples (~4 s).
Output: 2-class logits; the bona-fide logit (index 1) is the score.

See the source repository for the full training and evaluation code.

Benchmark result (Speech Anti-Spoofing Arena)

Evaluated through the reproducible Speech Anti-Spoofing Arena. Scores were computed with a deterministic first-64,600-sample window (no random crop), so the numbers are exactly reproducible from the pinned score file.

Dataset	Split	EER %	Trials	Notes
ASVspoof2019_LA	test	0.13	71,237	in-domain (training data)
ASVspoof2021_DF	test	3.61	611,829	cross-dataset generalization
ASVspoof2021_LA	test	6.14	181,566	cross-dataset generalization
InTheWild	test	8.48	31,779	out-of-domain (real-world deepfakes)
CD-ADD	test	20.55	20,786	out-of-domain (modern neural-TTS)
SONAR	test	33.33	3,948	out-of-domain (8 modern TTS systems)
LibriSeVoc	test	3.07	18,487	out-of-domain (neural vocoders)
CFAD	test	18.06	62,999	out-of-domain (Chinese audio deepfakes)
CVoiceFake_small	test	13.92	138,136	out-of-domain (multilingual TTS/VC)
ASVspoof5	test	22.25	680,774	out-of-domain (crowdsourced TTS/VC + adversarial)
ADD22_eval_31	test	26.58	112,861	out-of-domain (ADD 2022 Mandarin Track-3 fake-game)

Despite a back-end ~30× smaller than typical SSL countermeasures, Nes2Net-X generalizes strongly to unseen attacks — beating a wav2vec 2.0 + AASIST baseline on every dataset on this benchmark, most strikingly out-of-domain (CD-ADD and ASVspoof2021 DF).

Usage

The checkpoint is a state_dict for the Model network defined in _net.py. Constructing the network requires the base XLS-R 300M checkpoint xlsr2_300m.pt next to the wrapper (only used to build the wav2vec 2.0 architecture; every weight is then overwritten by the fine-tuned checkpoint):

wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr2_300m.pt

The input is windowed to exactly 64,600 samples at 16 kHz mono with pad_fixed (first 64,600 samples, tile-repeat if shorter).

import numpy as np
from nes2net import Nes2Net               # _net.py + nes2net.py are in this repo

m = Nes2Net()
m.load()                                          # loads nes2net_x_DF1.65.pth (+ xlsr2_300m.pt)
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(m.score_batch([audio], [16000])[0])         # higher = more bona fide
m.unload()

Internally the wrapper windows the input, runs the network, and returns logits[:, 1] (class 1 = bona fide). nes2net.py is the exact speech_spoof_bench model that produced the Arena scores.txt.

Citation

@article{Nes2Net,
  author={Liu, Tianchi and Truong, Duc-Tuan and Das, Rohan Kumar and Lee, Kong Aik and Li, Haizhou},
  journal={IEEE Transactions on Information Forensics and Security},
  title={Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-Spoofing},
  year={2025},
  volume={20},
  pages={12005--12018},
  doi={10.1109/TIFS.2025.3626963}
}

License

MIT — see the source repository.

Maintainer

Maintained by Kirill Borodin (SpeechAntiSpoofingBenchmarks).

Email: kborodin.research@gmail.com
Telegram: @korallll_ai

Downloads last month: 11

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for SpeechAntiSpoofingBenchmarks/Nes2Net

Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing

Paper • 2504.05657 • Published Apr 8, 2025