| --- |
| license: mit |
| tags: |
| - audio |
| - anti-spoofing |
| - audio-deepfake-detection |
| - speech |
| - asvspoof |
| --- |
| |
| # Res2TCNGuard |
|
|
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard) |
|
|
| TCN-based audio anti-spoofing (voice-deepfake detection) countermeasure proposed in |
| *"Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry"* |
| (Borodin et al., ETASR 2024). The model takes a raw speech waveform and returns a |
| score where **higher = more bona fide**. |
|
|
| - **Code:** https://github.com/lab260ru/Res2TCNGuard |
| - **Paper:** https://etasr.com/index.php/ETASR/article/view/8906 (DOI: 10.48084/etasr.8906) |
| - **Parameters:** 172,102 (0.172 M) |
| - **Checkpoint:** [`best_1.495.pth`](./best_1.495.pth) |
|
|
| This repo is self-contained for inference: the network definition is in |
| [`_net.py`](./_net.py), a standalone scorer in [`evaluate.py`](./evaluate.py), and |
| the exact wrapper used to produce the Arena scores in |
| [`res2tcnguard.py`](./res2tcnguard.py). |
|
|
| ## Architecture |
|
|
| Res2TCNGuard operates directly on the raw waveform: |
|
|
| 1. **Sinc-convolution front-end** (`SincConv_fast`) — learnable band-pass filters |
| that turn the waveform into a time–frequency representation. |
| 2. **Res2Net encoder** — stacked `Res2Block`s with multi-scale residual connections |
| and squeeze-and-excitation (SE) attention. |
| 3. **Dual temporal convolutional networks** — two `TemporalConvNet` branches model |
| the time and spectral axes separately; their pooled features are concatenated and |
| passed to a small linear classifier (bona fide vs. spoof). |
|
|
| ## How it was trained |
|
|
| - **Data:** the ASVspoof 2019 **Logical Access (LA)** dataset. Following the protocol |
| in the paper, the model is trained and validated on subsets representing a *single* |
| attack type and then evaluated on the eval split, which contains *more advanced and |
| unseen* spoofing attacks — testing the model's ability to generalize to harder |
| attack scenarios. |
| - **Input length:** raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s). |
| During training a random segment is cut from each utterance (so reported numbers can |
| vary slightly between runs). |
| - **Optimization:** Adam (lr = 1e-4), trained for up to 70 epochs; the checkpoint with |
| the best eval EER is kept. |
| - **Best reported result (paper):** EER = **1.49 %**, min t-DCF = 0.0451. |
|
|
| See the [training notebook](https://github.com/lab260ru/Res2TCNGuard/blob/main/TCN.ipynb) |
| for the full training and evaluation code. |
|
|
| ## Benchmark result (Speech Anti-Spoofing Arena) |
|
|
| Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard). |
| Scores were computed with a **deterministic first-64,600-sample window** (no random |
| crop), so the numbers are exactly reproducible from the pinned score file. |
|
|
| | Dataset | Split | EER % | Trials | Skipped | Notes | |
| |---|---|---|---|---|---| |
| | ASVspoof2019_LA | test | **1.50** | 71,237 | 0 | in-domain (training data) | |
| | ASVspoof2021_DF | test | **17.02** | 611,829 | 0 | cross-dataset generalization | |
| | ASVspoof2021_LA | test | **13.67** | 181,566 | 0 | cross-dataset generalization | |
| | CD-ADD | test | **56.10** | 20,786 | 0 | out-of-domain (modern neural-TTS); does not generalize | |
| | InTheWild | test | **52.52** | 31,779 | 0 | out-of-domain (real-world deepfakes); does not generalize | |
| |
| The ASVspoof2019_LA result reproduces the paper's reported 1.49 % on the LA eval set. |
| ASVspoof2021_DF is an out-of-domain test (the model was trained only on ASVspoof2019 LA), |
| so a higher EER is expected — it measures generalization to newer, unseen attacks. |
| |
| ## Usage |
| |
| The checkpoint is a `state_dict` for the `TestModel` network defined in |
| [`_net.py`](./_net.py) (extracted verbatim from the source notebook). The input |
| **must** be exactly 64,600 samples at 16 kHz mono — the classifier head is |
| fixed-size — so window the waveform with `pad_fixed` (first 64,600 samples, |
| tile-repeat if shorter). |
|
|
| Score one file from the command line: |
|
|
| ```bash |
| pip install torch numpy soundfile scipy |
| python evaluate.py path/to/audio.wav |
| # -> bona-fide score: <float> (higher = more bona fide) |
| ``` |
|
|
| Or from Python: |
|
|
| ```python |
| import numpy as np |
| from evaluate import load_model, score # _net.py + evaluate.py are in this repo |
| |
| model = load_model("best_1.495.pth", device="cpu") |
| audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz |
| print(score(model, audio)) # higher = more bona fide |
| ``` |
|
|
| Internally `score` does `_, logits = model(x)` on the windowed input and returns |
| `logits[:, 1]` (class 1 = bona fide). [`res2tcnguard.py`](./res2tcnguard.py) is the |
| same logic packaged as a `speech_spoof_bench` model — the exact code that produced |
| the Arena `scores.txt`. |
|
|
| ## Citation |
|
|
| **This model / paper:** |
|
|
| ```bibtex |
| @article{Borodin_Kudryavtsev_Mkrtchian_Gorodnichev_2024, |
| place={Greece}, |
| title={Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry}, |
| volume={14}, |
| number={6}, |
| url={https://etasr.com/index.php/ETASR/article/view/8906}, |
| DOI={10.48084/etasr.8906}, |
| journal={Engineering, Technology & Applied Science Research}, |
| author={Borodin, Kirill and Kudryavtsev, Vasiliy and Mkrtchian, Grach and Gorodnichev, Mikhail}, |
| year={2024}, |
| month={Dec.}, |
| pages={18409--18414} |
| } |
| ``` |
|
|
| **Training dataset — ASVspoof 2019:** |
|
|
| ```bibtex |
| @article{wang2020asvspoof, |
| title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech}, |
| author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others}, |
| journal={Computer Speech \& Language}, |
| volume={64}, |
| pages={101114}, |
| year={2020}, |
| publisher={Elsevier} |
| } |
| ``` |
|
|
| ## License |
|
|
| MIT — see the [source repository](https://github.com/lab260ru/Res2TCNGuard). |
|
|