| --- |
| license: mit |
| tags: |
| - audio |
| - anti-spoofing |
| - audio-deepfake-detection |
| - speech |
| - asvspoof |
| --- |
| |
| # RawTFNet |
|
|
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet) |
|
|
| A lightweight raw-waveform CNN for audio anti-spoofing (voice-deepfake detection), |
| proposed in *"RawTFNet: A Lightweight CNN Architecture for Speech Anti-spoofing"* |
| (Xiao, Dang & Das, 2025). The model takes a raw speech waveform and returns a score |
| where **higher = more bona fide**. |
|
|
| - **Code:** https://github.com/swagshaw/RawTFNet-Pytorch |
| - **Paper:** https://arxiv.org/abs/2507.08227 |
| - **Parameters:** 177,540 (0.178 M) |
| - **Checkpoint:** [`Best_RawTFNet_32.pth`](./Best_RawTFNet_32.pth) |
|
|
| This repo is self-contained for inference: the network definition is in |
| [`_net.py`](./_net.py), and the exact wrapper used to produce the Arena scores is in |
| [`rawtfnet.py`](./rawtfnet.py). |
|
|
| ## Architecture |
|
|
| RawTFNet operates directly on the raw waveform: |
|
|
| 1. **Sinc-convolution front-end** (`SincConv`, AASIST-style) — fixed band-pass |
| filters that turn the waveform into a time–frequency representation, followed by a |
| ResNet-style block and three **depthwise-separable Res2Net-SE** blocks |
| (`DWS_Frontend_SE`). |
| 2. **Tf-SepNet classifier** (`TfSepNet`, depth=10, width=32) — stacked |
| **time–frequency separable** convolution blocks with channel shuffle and adaptive |
| residual normalization, ending in a 1×1 conv to 2 classes pooled over time and |
| frequency. |
| 3. The 2-logit output is read at **index 1 = bona fide**. |
|
|
| ## How it was trained |
|
|
| - **Data:** ASVspoof 2019 **Logical Access (LA)**, with RawBoost data augmentation. |
| - **Input length:** raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s). |
| - **Output:** 2-class logits; the bona-fide logit (index 1) is the score. |
|
|
| See the [source repository](https://github.com/swagshaw/RawTFNet-Pytorch) for the |
| full training and evaluation code. |
|
|
| ## Benchmark result (Speech Anti-Spoofing Arena) |
|
|
| Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rawtfnet). |
| Scores were computed with a **deterministic first-64,600-sample window** (no random |
| crop), so the numbers are exactly reproducible from the pinned score file. |
|
|
| | Dataset | Split | EER % | Trials | Skipped | Notes | |
| |---|---|---|---|---|---| |
| | ASVspoof2019_LA | test | **1.99** | 71,237 | 0 | in-domain (training data) | |
| | ASVspoof2021_LA | test | **8.03** | 181,566 | 0 | cross-dataset generalization | |
| | ASVspoof2021_DF | test | **15.16** | 611,829 | 0 | cross-dataset generalization | |
| | InTheWild | test | **38.51** | 31,779 | 0 | out-of-domain (real-world deepfakes) | |
| | CD-ADD | test | **52.85** | 20,786 | 0 | out-of-domain (modern neural-TTS); does not generalize | |
| |
| The model trains only on ASVspoof2019 LA, so the in-domain EER is low (1.99 %) while |
| the cross-dataset / out-of-domain sets measure generalization to newer, unseen |
| attacks. RawTFNet generalizes notably better than the reference TCN/capsule models on |
| ASVspoof2021_LA, ASVspoof2021_DF, and InTheWild. |
| |
| ## Usage |
| |
| The checkpoint is a `state_dict` for the `RawTFNet` network defined in |
| [`_net.py`](./_net.py). The input **must** be exactly 64,600 samples at 16 kHz mono — |
| window the waveform with `pad_fixed` (first 64,600 samples, tile-repeat if shorter). |
|
|
| ```python |
| import numpy as np |
| from rawtfnet import RawTFNetModel # _net.py + rawtfnet.py are in this repo |
| |
| m = RawTFNetModel() |
| m.load() # loads Best_RawTFNet_32.pth |
| audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz |
| print(m.score_batch([audio], [16000])[0]) # higher = more bona fide |
| m.unload() |
| ``` |
|
|
| Internally the wrapper windows the input, runs the network, and returns |
| `logits[:, 1]` (class 1 = bona fide). [`rawtfnet.py`](./rawtfnet.py) is the exact |
| `speech_spoof_bench` model that produced the Arena `scores.txt`. |
|
|
| ## Citation |
|
|
| **This model / paper:** |
|
|
| ```bibtex |
| @article{xiao2025rawtfnet, |
| title={RawTFNet: A Lightweight CNN Architecture for Speech Anti-spoofing}, |
| author={Xiao, Yang and Dang, Ting and Das, Rohan Kumar}, |
| journal={arXiv preprint arXiv:2507.08227}, |
| year={2025} |
| } |
| ``` |
|
|
| **Training dataset — ASVspoof 2019:** |
|
|
| ```bibtex |
| @article{wang2020asvspoof, |
| title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech}, |
| author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others}, |
| journal={Computer Speech \& Language}, |
| volume={64}, |
| pages={101114}, |
| year={2020}, |
| publisher={Elsevier} |
| } |
| ``` |
|
|
| ## License |
|
|
| MIT — see the [source repository](https://github.com/swagshaw/RawTFNet-Pytorch). |
|
|