license: mit
tags:
- audio
- anti-spoofing
- audio-deepfake-detection
- speech
- asvspoof
RawTFNet
A lightweight raw-waveform CNN for audio anti-spoofing (voice-deepfake detection), proposed in "RawTFNet: A Lightweight CNN Architecture for Speech Anti-spoofing" (Xiao, Dang & Das, 2025). The model takes a raw speech waveform and returns a score where higher = more bona fide.
- Code: https://github.com/swagshaw/RawTFNet-Pytorch
- Paper: https://arxiv.org/abs/2507.08227
- Parameters: 177,540 (0.178 M)
- Checkpoint:
Best_RawTFNet_32.pth
This repo is self-contained for inference: the network definition is in
_net.py, and the exact wrapper used to produce the Arena scores is in
rawtfnet.py.
Architecture
RawTFNet operates directly on the raw waveform:
- Sinc-convolution front-end (
SincConv, AASIST-style) — fixed band-pass filters that turn the waveform into a time–frequency representation, followed by a ResNet-style block and three depthwise-separable Res2Net-SE blocks (DWS_Frontend_SE). - Tf-SepNet classifier (
TfSepNet, depth=10, width=32) — stacked time–frequency separable convolution blocks with channel shuffle and adaptive residual normalization, ending in a 1×1 conv to 2 classes pooled over time and frequency. - The 2-logit output is read at index 1 = bona fide.
How it was trained
- Data: ASVspoof 2019 Logical Access (LA), with RawBoost data augmentation.
- Input length: raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s).
- Output: 2-class logits; the bona-fide logit (index 1) is the score.
See the source repository for the full training and evaluation code.
Benchmark result (Speech Anti-Spoofing Arena)
Evaluated through the reproducible Speech Anti-Spoofing Arena. Scores were computed with a deterministic first-64,600-sample window (no random crop), so the numbers are exactly reproducible from the pinned score file.
| Dataset | Split | EER % | Trials | Skipped | Notes |
|---|---|---|---|---|---|
| ASVspoof2019_LA | test | 1.99 | 71,237 | 0 | in-domain (training data) |
| ASVspoof2021_LA | test | 8.03 | 181,566 | 0 | cross-dataset generalization |
| ASVspoof2021_DF | test | 15.16 | 611,829 | 0 | cross-dataset generalization |
| InTheWild | test | 38.51 | 31,779 | 0 | out-of-domain (real-world deepfakes) |
| CD-ADD | test | 52.85 | 20,786 | 0 | out-of-domain (modern neural-TTS); does not generalize |
The model trains only on ASVspoof2019 LA, so the in-domain EER is low (1.99 %) while the cross-dataset / out-of-domain sets measure generalization to newer, unseen attacks. RawTFNet generalizes notably better than the reference TCN/capsule models on ASVspoof2021_LA, ASVspoof2021_DF, and InTheWild.
Usage
The checkpoint is a state_dict for the RawTFNet network defined in
_net.py. The input must be exactly 64,600 samples at 16 kHz mono —
window the waveform with pad_fixed (first 64,600 samples, tile-repeat if shorter).
import numpy as np
from rawtfnet import RawTFNetModel # _net.py + rawtfnet.py are in this repo
m = RawTFNetModel()
m.load() # loads Best_RawTFNet_32.pth
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(m.score_batch([audio], [16000])[0]) # higher = more bona fide
m.unload()
Internally the wrapper windows the input, runs the network, and returns
logits[:, 1] (class 1 = bona fide). rawtfnet.py is the exact
speech_spoof_bench model that produced the Arena scores.txt.
Citation
This model / paper:
@article{xiao2025rawtfnet,
title={RawTFNet: A Lightweight CNN Architecture for Speech Anti-spoofing},
author={Xiao, Yang and Dang, Ting and Das, Rohan Kumar},
journal={arXiv preprint arXiv:2507.08227},
year={2025}
}
Training dataset — ASVspoof 2019:
@article{wang2020asvspoof,
title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech},
author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others},
journal={Computer Speech \& Language},
volume={64},
pages={101114},
year={2020},
publisher={Elsevier}
}
License
MIT — see the source repository.