SpeechAntiSpoofingBenchmarks
/

Res2TCNGuard

+---
+license: mit
+tags:
+  - audio
+  - anti-spoofing
+  - audio-deepfake-detection
+  - speech
+  - asvspoof
+---
+# Res2TCNGuard
+[![EER% 1.5 on ASVspoof2019_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2019__LA-1.5%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
+[![arena tier](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/res2tcnguard/tier.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
+[![arena rank](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/res2tcnguard/rank.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
+TCN-based audio anti-spoofing (voice-deepfake detection) countermeasure proposed in
+*"Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry"*
+(Borodin et al., ETASR 2024). The model takes a raw speech waveform and returns a
+score where **higher = more bona fide**.
+- **Code:** https://github.com/lab260ru/Res2TCNGuard
+- **Paper:** https://etasr.com/index.php/ETASR/article/view/8906 (DOI: 10.48084/etasr.8906)
+- **Parameters:** ~0.17 M
+- **Checkpoint:** [`best_1.495.pth`](./best_1.495.pth)
+## Architecture
+Res2TCNGuard operates directly on the raw waveform:
+1. **Sinc-convolution front-end** (`SincConv_fast`) — learnable band-pass filters
+   that turn the waveform into a time–frequency representation.
+2. **Res2Net encoder** — stacked `Res2Block`s with multi-scale residual connections
+   and squeeze-and-excitation (SE) attention.
+3. **Dual temporal convolutional networks** — two `TemporalConvNet` branches model
+   the time and spectral axes separately; their pooled features are concatenated and
+   passed to a small linear classifier (bona fide vs. spoof).
+## How it was trained
+- **Data:** the ASVspoof 2019 **Logical Access (LA)** dataset. Following the protocol
+  in the paper, the model is trained and validated on subsets representing a *single*
+  attack type and then evaluated on the eval split, which contains *more advanced and
+  unseen* spoofing attacks — testing the model's ability to generalize to harder
+  attack scenarios.
+- **Input length:** raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s).
+  During training a random segment is cut from each utterance (so reported numbers can
+  vary slightly between runs).
+- **Optimization:** Adam (lr = 1e-4), trained for up to 70 epochs; the checkpoint with
+  the best eval EER is kept.
+- **Best reported result (paper):** EER = **1.49 %**, min t-DCF = 0.0451.
+See the [training notebook](https://github.com/lab260ru/Res2TCNGuard/blob/main/TCN.ipynb)
+for the full training and evaluation code.
+## Benchmark result (Speech Anti-Spoofing Arena)
+Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard).
+Scores were computed with a **deterministic first-64,600-sample window** (no random
+crop), so the numbers are exactly reproducible from the pinned score file.
+| Dataset | Split | EER % | Trials | Skipped |
+|---|---|---|---|---|
+| ASVspoof2019_LA | test | **1.50** | 71,237 | 0 |
+This reproduces the paper's reported 1.49 % on the ASVspoof 2019 LA eval set.
+## Usage
+This checkpoint is a `state_dict` for the `TestModel` network defined in the
+[source repository](https://github.com/lab260ru/Res2TCNGuard). Load the architecture
+from there, then:
+```python
+import torch
+from TCN import TestModel  # network definition from the source repo
+model = TestModel()
+model.load_state_dict(torch.load("best_1.495.pth", map_location="cpu"))
+model.eval()
+# x: float32 waveform, 16 kHz mono, shape (batch, 64600)
+_, logits = model(x)
+bonafide_score = logits[:, 1]   # higher = more bona fide
+```
+## Citation
+**This model / paper:**
+```bibtex
+@article{Borodin_Kudryavtsev_Mkrtchian_Gorodnichev_2024,
+  place={Greece},
+  title={Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry},
+  volume={14},
+  number={6},
+  url={https://etasr.com/index.php/ETASR/article/view/8906},
+  DOI={10.48084/etasr.8906},
+  journal={Engineering, Technology & Applied Science Research},
+  author={Borodin, Kirill and Kudryavtsev, Vasiliy and Mkrtchian, Grach and Gorodnichev, Mikhail},
+  year={2024},
+  month={Dec.},
+  pages={18409--18414}
+}
+```
+**Training dataset — ASVspoof 2019:**
+```bibtex
+@article{wang2020asvspoof,
+  title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech},
+  author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others},
+  journal={Computer Speech \& Language},
+  volume={64},
+  pages={101114},
+  year={2020},
+  publisher={Elsevier}
+}
+```
+## License
+MIT — see the [source repository](https://github.com/lab260ru/Res2TCNGuard).