RawTFNet / README.md

Add model card

60e61a9 verified about 24 hours ago

6.29 kB

	---
	license: mit
	tags:
	- audio
	- anti-spoofing
	- audio-deepfake-detection
	- speech
	- asvspoof
	---

	# RawTFNet

	[![EER% 1.99 on ASVspoof2019_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2019__LA-1.99%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet)
	[![EER% 8.03 on ASVspoof2021_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__LA-8.03%25-green)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet)
	[![EER% 15.16 on ASVspoof2021_DF](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__DF-15.16%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet)
	[![EER% 38.51 on InTheWild](https://img.shields.io/badge/EER%25%20on%20InTheWild-38.51%25-orange)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet)
	[![EER% 52.85 on CD-ADD](https://img.shields.io/badge/EER%25%20on%20CD--ADD-52.85%25-red)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet)
	[![arena tier](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/rawtfnet/tier.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet)
	[![arena rank](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/rawtfnet/rank.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet)

	A lightweight raw-waveform CNN for audio anti-spoofing (voice-deepfake detection),
	proposed in "RawTFNet: A Lightweight CNN Architecture for Speech Anti-spoofing"
	(Xiao, Dang & Das, 2025). The model takes a raw speech waveform and returns a score
	where higher = more bona fide.

	- Code: https://github.com/swagshaw/RawTFNet-Pytorch
	- Paper: https://arxiv.org/abs/2507.08227
	- Parameters: 177,540 (0.178 M)
	- Checkpoint: [`Best_RawTFNet_32.pth`](./Best_RawTFNet_32.pth)

	This repo is self-contained for inference: the network definition is in
	[`_net.py`](./_net.py), and the exact wrapper used to produce the Arena scores is in
	[`rawtfnet.py`](./rawtfnet.py).

	## Architecture

	RawTFNet operates directly on the raw waveform:

	1. Sinc-convolution front-end (`SincConv`, AASIST-style) — fixed band-pass
	filters that turn the waveform into a time–frequency representation, followed by a
	ResNet-style block and three depthwise-separable Res2Net-SE blocks
	(`DWS_Frontend_SE`).
	2. Tf-SepNet classifier (`TfSepNet`, depth=10, width=32) — stacked
	time–frequency separable convolution blocks with channel shuffle and adaptive
	residual normalization, ending in a 1×1 conv to 2 classes pooled over time and
	frequency.
	3. The 2-logit output is read at index 1 = bona fide.

	## How it was trained

	- Data: ASVspoof 2019 Logical Access (LA), with RawBoost data augmentation.
	- Input length: raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s).
	- Output: 2-class logits; the bona-fide logit (index 1) is the score.

	See the [source repository](https://github.com/swagshaw/RawTFNet-Pytorch) for the
	full training and evaluation code.

	## Benchmark result (Speech Anti-Spoofing Arena)

	Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet).
	Scores were computed with a deterministic first-64,600-sample window (no random
	crop), so the numbers are exactly reproducible from the pinned score file.

	\| Dataset \| Split \| EER % \| Trials \| Skipped \| Notes \|
	\|---\|---\|---\|---\|---\|---\|
	\| ASVspoof2019_LA \| test \| 1.99 \| 71,237 \| 0 \| in-domain (training data) \|
	\| ASVspoof2021_LA \| test \| 8.03 \| 181,566 \| 0 \| cross-dataset generalization \|
	\| ASVspoof2021_DF \| test \| 15.16 \| 611,829 \| 0 \| cross-dataset generalization \|
	\| InTheWild \| test \| 38.51 \| 31,779 \| 0 \| out-of-domain (real-world deepfakes) \|
	\| CD-ADD \| test \| 52.85 \| 20,786 \| 0 \| out-of-domain (modern neural-TTS); does not generalize \|

	The model trains only on ASVspoof2019 LA, so the in-domain EER is low (1.99 %) while
	the cross-dataset / out-of-domain sets measure generalization to newer, unseen
	attacks. RawTFNet generalizes notably better than the reference TCN/capsule models on
	ASVspoof2021_LA, ASVspoof2021_DF, and InTheWild.

	## Usage

	The checkpoint is a `state_dict` for the `RawTFNet` network defined in
	[`_net.py`](./_net.py). The input must be exactly 64,600 samples at 16 kHz mono —
	window the waveform with `pad_fixed` (first 64,600 samples, tile-repeat if shorter).

	```python
	import numpy as np
	from rawtfnet import RawTFNetModel # _net.py + rawtfnet.py are in this repo

	m = RawTFNetModel()
	m.load() # loads Best_RawTFNet_32.pth
	audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
	print(m.score_batch([audio], [16000])[0]) # higher = more bona fide
	m.unload()
	```

	Internally the wrapper windows the input, runs the network, and returns
	`logits[:, 1]` (class 1 = bona fide). [`rawtfnet.py`](./rawtfnet.py) is the exact
	`speech_spoof_bench` model that produced the Arena `scores.txt`.

	## Citation

	This model / paper:

	```bibtex
	@article{xiao2025rawtfnet,
	title={RawTFNet: A Lightweight CNN Architecture for Speech Anti-spoofing},
	author={Xiao, Yang and Dang, Ting and Das, Rohan Kumar},
	journal={arXiv preprint arXiv:2507.08227},
	year={2025}
	}
	```

	Training dataset — ASVspoof 2019:

	```bibtex
	@article{wang2020asvspoof,
	title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech},
	author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others},
	journal={Computer Speech \& Language},
	volume={64},
	pages={101114},
	year={2020},
	publisher={Elsevier}
	}
	```

	## License

	MIT — see the [source repository](https://github.com/swagshaw/RawTFNet-Pytorch).