Nes2Net / README.md

Add model card with Arena badges (all 5 datasets)

8242d1b verified 2 days ago

6.41 kB

	---
	license: mit
	tags:
	- audio
	- anti-spoofing
	- audio-deepfake-detection
	- speech
	- asvspoof
	- wav2vec2
	- nes2net
	---

	# Nes2Net

	[![EER% 0.13 on ASVspoof2019_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2019__LA-0.13%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
	[![EER% 6.14 on ASVspoof2021_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__LA-6.14%25-green)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
	[![EER% 3.61 on ASVspoof2021_DF](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__DF-3.61%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
	[![EER% 8.48 on InTheWild](https://img.shields.io/badge/EER%25%20on%20InTheWild-8.48%25-green)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
	[![EER% 20.55 on CD-ADD](https://img.shields.io/badge/EER%25%20on%20CD--ADD-20.55%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
	[![arena tier](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/nes2net/tier.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)
	[![arena rank](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/nes2net/rank.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net)

	A wav2vec 2.0 (XLS-R 300M) + Nes2Net-X anti-spoofing model, from
	*"Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech
	Anti-Spoofing"* (Liu, Truong, Das, Lee & Li, IEEE T-IFS 2025). A self-supervised
	XLS-R front-end is fine-tuned end-to-end with a nested Res2Net back-end that
	operates directly on the foundation-model features — no dimensionality-reducing
	neck — using only ~0.51 M back-end parameters. The model takes a raw speech
	waveform and returns a score where higher = more bona fide.

	- Code: https://github.com/Liu-Tianchi/Nes2Net_ASVspoof_ITW
	- Paper: https://arxiv.org/abs/2504.05657 (DOI 10.1109/TIFS.2025.3626963)
	- Parameters: 317,902,600 (317.90 M total; Nes2Net-X back-end only 0.51 M)
	- Checkpoint: [`nes2net_x_DF1.65.pth`](./nes2net_x_DF1.65.pth) (single Nes2Net-X)

	The exact wrapper used to produce the Arena scores is in
	[`nes2net.py`](./nes2net.py); the network definition is in [`_net.py`](./_net.py).

	## Architecture

	1. wav2vec 2.0 XLS-R (300M) front-end — a self-supervised transformer
	(`fairseq` `Wav2Vec2Model`) producing 1024-d frame features, fine-tuned
	end-to-end with the rest of the network.
	2. Nes2Net-X back-end — a nested Res2Net TDNN: outer Res2Net groups, each an
	inner Res2Net (`Bottle2neck`) with squeeze-and-excitation and a learnable
	weighted multi-scale sum, applied directly to the 1024-d XLS-R features
	(`Nes_ratio=[8,8]`, `SE_ratio=[1]`), then mean temporal pooling and a linear
	classifier.
	3. The 2-logit output is read at index 1 = bona fide.

	## How it was trained

	- Data: ASVspoof 2019 Logical Access (LA), with RawBoost data augmentation.
	- Input length: raw audio at 16 kHz cropped/padded to 64,600 samples (~4 s).
	- Output: 2-class logits; the bona-fide logit (index 1) is the score.

	See the [source repository](https://github.com/Liu-Tianchi/Nes2Net_ASVspoof_ITW) for
	the full training and evaluation code.

	## Benchmark result (Speech Anti-Spoofing Arena)

	Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=nes2net).
	Scores were computed with a deterministic first-64,600-sample window (no random
	crop), so the numbers are exactly reproducible from the pinned score file.

	\| Dataset \| Split \| EER % \| Trials \| Skipped \| Notes \|
	\|---\|---\|---\|---\|---\|---\|
	\| ASVspoof2019_LA \| test \| 0.13 \| 71,237 \| 0 \| in-domain (training data) \|
	\| ASVspoof2021_DF \| test \| 3.61 \| 611,829 \| 0 \| cross-dataset generalization \|
	\| ASVspoof2021_LA \| test \| 6.14 \| 181,566 \| 0 \| cross-dataset generalization \|
	\| InTheWild \| test \| 8.48 \| 31,779 \| 0 \| out-of-domain (real-world deepfakes) \|
	\| CD-ADD \| test \| 20.55 \| 20,786 \| 0 \| out-of-domain (modern neural-TTS) \|

	Despite a back-end ~30× smaller than typical SSL countermeasures, Nes2Net-X
	generalizes strongly to unseen attacks — beating a wav2vec 2.0 + AASIST baseline on
	every dataset on this benchmark, most strikingly out-of-domain (CD-ADD and
	ASVspoof2021 DF).

	## Usage

	The checkpoint is a `state_dict` for the `Model` network defined in
	[`_net.py`](./_net.py). Constructing the network requires the base XLS-R 300M
	checkpoint `xlsr2_300m.pt` next to the wrapper (only used to build the
	wav2vec 2.0 architecture; every weight is then overwritten by the fine-tuned
	checkpoint):

	```bash
	wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr2_300m.pt
	```

	The input is windowed to exactly 64,600 samples at 16 kHz mono with `pad_fixed`
	(first 64,600 samples, tile-repeat if shorter).

	```python
	import numpy as np
	from nes2net import Nes2Net # _net.py + nes2net.py are in this repo

	m = Nes2Net()
	m.load() # loads nes2net_x_DF1.65.pth (+ xlsr2_300m.pt)
	audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
	print(m.score_batch([audio], [16000])[0]) # higher = more bona fide
	m.unload()
	```

	Internally the wrapper windows the input, runs the network, and returns
	`logits[:, 1]` (class 1 = bona fide). [`nes2net.py`](./nes2net.py) is the exact
	`speech_spoof_bench` model that produced the Arena `scores.txt`.

	## Citation

	```bibtex
	@article{Nes2Net,
	author={Liu, Tianchi and Truong, Duc-Tuan and Das, Rohan Kumar and Lee, Kong Aik and Li, Haizhou},
	journal={IEEE Transactions on Information Forensics and Security},
	title={Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-Spoofing},
	year={2025},
	volume={20},
	pages={12005--12018},
	doi={10.1109/TIFS.2025.3626963}
	}
	```

	## License

	MIT — see the [source repository](https://github.com/Liu-Tianchi/Nes2Net_ASVspoof_ITW).