File size: 6,286 Bytes
60e61a9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | ---
license: mit
tags:
- audio
- anti-spoofing
- audio-deepfake-detection
- speech
- asvspoof
---
# RawTFNet
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet)
A lightweight raw-waveform CNN for audio anti-spoofing (voice-deepfake detection),
proposed in *"RawTFNet: A Lightweight CNN Architecture for Speech Anti-spoofing"*
(Xiao, Dang & Das, 2025). The model takes a raw speech waveform and returns a score
where **higher = more bona fide**.
- **Code:** https://github.com/swagshaw/RawTFNet-Pytorch
- **Paper:** https://arxiv.org/abs/2507.08227
- **Parameters:** 177,540 (0.178 M)
- **Checkpoint:** [`Best_RawTFNet_32.pth`](./Best_RawTFNet_32.pth)
This repo is self-contained for inference: the network definition is in
[`_net.py`](./_net.py), and the exact wrapper used to produce the Arena scores is in
[`rawtfnet.py`](./rawtfnet.py).
## Architecture
RawTFNet operates directly on the raw waveform:
1. **Sinc-convolution front-end** (`SincConv`, AASIST-style) — fixed band-pass
filters that turn the waveform into a time–frequency representation, followed by a
ResNet-style block and three **depthwise-separable Res2Net-SE** blocks
(`DWS_Frontend_SE`).
2. **Tf-SepNet classifier** (`TfSepNet`, depth=10, width=32) — stacked
**time–frequency separable** convolution blocks with channel shuffle and adaptive
residual normalization, ending in a 1×1 conv to 2 classes pooled over time and
frequency.
3. The 2-logit output is read at **index 1 = bona fide**.
## How it was trained
- **Data:** ASVspoof 2019 **Logical Access (LA)**, with RawBoost data augmentation.
- **Input length:** raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s).
- **Output:** 2-class logits; the bona-fide logit (index 1) is the score.
See the [source repository](https://github.com/swagshaw/RawTFNet-Pytorch) for the
full training and evaluation code.
## Benchmark result (Speech Anti-Spoofing Arena)
Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet).
Scores were computed with a **deterministic first-64,600-sample window** (no random
crop), so the numbers are exactly reproducible from the pinned score file.
| Dataset | Split | EER % | Trials | Skipped | Notes |
|---|---|---|---|---|---|
| ASVspoof2019_LA | test | **1.99** | 71,237 | 0 | in-domain (training data) |
| ASVspoof2021_LA | test | **8.03** | 181,566 | 0 | cross-dataset generalization |
| ASVspoof2021_DF | test | **15.16** | 611,829 | 0 | cross-dataset generalization |
| InTheWild | test | **38.51** | 31,779 | 0 | out-of-domain (real-world deepfakes) |
| CD-ADD | test | **52.85** | 20,786 | 0 | out-of-domain (modern neural-TTS); does not generalize |
The model trains only on ASVspoof2019 LA, so the in-domain EER is low (1.99 %) while
the cross-dataset / out-of-domain sets measure generalization to newer, unseen
attacks. RawTFNet generalizes notably better than the reference TCN/capsule models on
ASVspoof2021_LA, ASVspoof2021_DF, and InTheWild.
## Usage
The checkpoint is a `state_dict` for the `RawTFNet` network defined in
[`_net.py`](./_net.py). The input **must** be exactly 64,600 samples at 16 kHz mono —
window the waveform with `pad_fixed` (first 64,600 samples, tile-repeat if shorter).
```python
import numpy as np
from rawtfnet import RawTFNetModel # _net.py + rawtfnet.py are in this repo
m = RawTFNetModel()
m.load() # loads Best_RawTFNet_32.pth
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(m.score_batch([audio], [16000])[0]) # higher = more bona fide
m.unload()
```
Internally the wrapper windows the input, runs the network, and returns
`logits[:, 1]` (class 1 = bona fide). [`rawtfnet.py`](./rawtfnet.py) is the exact
`speech_spoof_bench` model that produced the Arena `scores.txt`.
## Citation
**This model / paper:**
```bibtex
@article{xiao2025rawtfnet,
title={RawTFNet: A Lightweight CNN Architecture for Speech Anti-spoofing},
author={Xiao, Yang and Dang, Ting and Das, Rohan Kumar},
journal={arXiv preprint arXiv:2507.08227},
year={2025}
}
```
**Training dataset — ASVspoof 2019:**
```bibtex
@article{wang2020asvspoof,
title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech},
author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others},
journal={Computer Speech \& Language},
volume={64},
pages={101114},
year={2020},
publisher={Elsevier}
}
```
## License
MIT — see the [source repository](https://github.com/swagshaw/RawTFNet-Pytorch).
|