RawTFNet / README.md
korallll's picture
Add model card
60e61a9 verified
metadata
license: mit
tags:
  - audio
  - anti-spoofing
  - audio-deepfake-detection
  - speech
  - asvspoof

RawTFNet

EER% 1.99 on ASVspoof2019_LA EER% 8.03 on ASVspoof2021_LA EER% 15.16 on ASVspoof2021_DF EER% 38.51 on InTheWild EER% 52.85 on CD-ADD arena tier arena rank

A lightweight raw-waveform CNN for audio anti-spoofing (voice-deepfake detection), proposed in "RawTFNet: A Lightweight CNN Architecture for Speech Anti-spoofing" (Xiao, Dang & Das, 2025). The model takes a raw speech waveform and returns a score where higher = more bona fide.

This repo is self-contained for inference: the network definition is in _net.py, and the exact wrapper used to produce the Arena scores is in rawtfnet.py.

Architecture

RawTFNet operates directly on the raw waveform:

  1. Sinc-convolution front-end (SincConv, AASIST-style) — fixed band-pass filters that turn the waveform into a time–frequency representation, followed by a ResNet-style block and three depthwise-separable Res2Net-SE blocks (DWS_Frontend_SE).
  2. Tf-SepNet classifier (TfSepNet, depth=10, width=32) — stacked time–frequency separable convolution blocks with channel shuffle and adaptive residual normalization, ending in a 1×1 conv to 2 classes pooled over time and frequency.
  3. The 2-logit output is read at index 1 = bona fide.

How it was trained

  • Data: ASVspoof 2019 Logical Access (LA), with RawBoost data augmentation.
  • Input length: raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s).
  • Output: 2-class logits; the bona-fide logit (index 1) is the score.

See the source repository for the full training and evaluation code.

Benchmark result (Speech Anti-Spoofing Arena)

Evaluated through the reproducible Speech Anti-Spoofing Arena. Scores were computed with a deterministic first-64,600-sample window (no random crop), so the numbers are exactly reproducible from the pinned score file.

Dataset Split EER % Trials Skipped Notes
ASVspoof2019_LA test 1.99 71,237 0 in-domain (training data)
ASVspoof2021_LA test 8.03 181,566 0 cross-dataset generalization
ASVspoof2021_DF test 15.16 611,829 0 cross-dataset generalization
InTheWild test 38.51 31,779 0 out-of-domain (real-world deepfakes)
CD-ADD test 52.85 20,786 0 out-of-domain (modern neural-TTS); does not generalize

The model trains only on ASVspoof2019 LA, so the in-domain EER is low (1.99 %) while the cross-dataset / out-of-domain sets measure generalization to newer, unseen attacks. RawTFNet generalizes notably better than the reference TCN/capsule models on ASVspoof2021_LA, ASVspoof2021_DF, and InTheWild.

Usage

The checkpoint is a state_dict for the RawTFNet network defined in _net.py. The input must be exactly 64,600 samples at 16 kHz mono — window the waveform with pad_fixed (first 64,600 samples, tile-repeat if shorter).

import numpy as np
from rawtfnet import RawTFNetModel   # _net.py + rawtfnet.py are in this repo

m = RawTFNetModel()
m.load()                                          # loads Best_RawTFNet_32.pth
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(m.score_batch([audio], [16000])[0])         # higher = more bona fide
m.unload()

Internally the wrapper windows the input, runs the network, and returns logits[:, 1] (class 1 = bona fide). rawtfnet.py is the exact speech_spoof_bench model that produced the Arena scores.txt.

Citation

This model / paper:

@article{xiao2025rawtfnet,
  title={RawTFNet: A Lightweight CNN Architecture for Speech Anti-spoofing},
  author={Xiao, Yang and Dang, Ting and Das, Rohan Kumar},
  journal={arXiv preprint arXiv:2507.08227},
  year={2025}
}

Training dataset — ASVspoof 2019:

@article{wang2020asvspoof,
  title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech},
  author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others},
  journal={Computer Speech \& Language},
  volume={64},
  pages={101114},
  year={2020},
  publisher={Elsevier}
}

License

MIT — see the source repository.