File size: 7,442 Bytes
c8057ae 8731d1f 8f4d084 c4370b0 b464fdc c8057ae f2beec2 c8057ae f2beec2 c8057ae 8731d1f 8f4d084 c4370b0 b464fdc c8057ae 8731d1f c8057ae f2beec2 c8057ae f2beec2 c8057ae f2beec2 c8057ae f2beec2 c8057ae f2beec2 c8057ae | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | ---
license: mit
tags:
- audio
- anti-spoofing
- audio-deepfake-detection
- speech
- asvspoof
---
# Res2TCNGuard
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
TCN-based audio anti-spoofing (voice-deepfake detection) countermeasure proposed in
*"Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry"*
(Borodin et al., ETASR 2024). The model takes a raw speech waveform and returns a
score where **higher = more bona fide**.
- **Code:** https://github.com/lab260ru/Res2TCNGuard
- **Paper:** https://etasr.com/index.php/ETASR/article/view/8906 (DOI: 10.48084/etasr.8906)
- **Parameters:** 172,102 (0.172 M)
- **Checkpoint:** [`best_1.495.pth`](./best_1.495.pth)
This repo is self-contained for inference: the network definition is in
[`_net.py`](./_net.py), a standalone scorer in [`evaluate.py`](./evaluate.py), and
the exact wrapper used to produce the Arena scores in
[`res2tcnguard.py`](./res2tcnguard.py).
## Architecture
Res2TCNGuard operates directly on the raw waveform:
1. **Sinc-convolution front-end** (`SincConv_fast`) — learnable band-pass filters
that turn the waveform into a time–frequency representation.
2. **Res2Net encoder** — stacked `Res2Block`s with multi-scale residual connections
and squeeze-and-excitation (SE) attention.
3. **Dual temporal convolutional networks** — two `TemporalConvNet` branches model
the time and spectral axes separately; their pooled features are concatenated and
passed to a small linear classifier (bona fide vs. spoof).
## How it was trained
- **Data:** the ASVspoof 2019 **Logical Access (LA)** dataset. Following the protocol
in the paper, the model is trained and validated on subsets representing a *single*
attack type and then evaluated on the eval split, which contains *more advanced and
unseen* spoofing attacks — testing the model's ability to generalize to harder
attack scenarios.
- **Input length:** raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s).
During training a random segment is cut from each utterance (so reported numbers can
vary slightly between runs).
- **Optimization:** Adam (lr = 1e-4), trained for up to 70 epochs; the checkpoint with
the best eval EER is kept.
- **Best reported result (paper):** EER = **1.49 %**, min t-DCF = 0.0451.
See the [training notebook](https://github.com/lab260ru/Res2TCNGuard/blob/main/TCN.ipynb)
for the full training and evaluation code.
## Benchmark result (Speech Anti-Spoofing Arena)
Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard).
Scores were computed with a **deterministic first-64,600-sample window** (no random
crop), so the numbers are exactly reproducible from the pinned score file.
| Dataset | Split | EER % | Trials | Skipped | Notes |
|---|---|---|---|---|---|
| ASVspoof2019_LA | test | **1.50** | 71,237 | 0 | in-domain (training data) |
| ASVspoof2021_DF | test | **17.02** | 611,829 | 0 | cross-dataset generalization |
| ASVspoof2021_LA | test | **13.67** | 181,566 | 0 | cross-dataset generalization |
| CD-ADD | test | **56.10** | 20,786 | 0 | out-of-domain (modern neural-TTS); does not generalize |
| InTheWild | test | **52.52** | 31,779 | 0 | out-of-domain (real-world deepfakes); does not generalize |
The ASVspoof2019_LA result reproduces the paper's reported 1.49 % on the LA eval set.
ASVspoof2021_DF is an out-of-domain test (the model was trained only on ASVspoof2019 LA),
so a higher EER is expected — it measures generalization to newer, unseen attacks.
## Usage
The checkpoint is a `state_dict` for the `TestModel` network defined in
[`_net.py`](./_net.py) (extracted verbatim from the source notebook). The input
**must** be exactly 64,600 samples at 16 kHz mono — the classifier head is
fixed-size — so window the waveform with `pad_fixed` (first 64,600 samples,
tile-repeat if shorter).
Score one file from the command line:
```bash
pip install torch numpy soundfile scipy
python evaluate.py path/to/audio.wav
# -> bona-fide score: <float> (higher = more bona fide)
```
Or from Python:
```python
import numpy as np
from evaluate import load_model, score # _net.py + evaluate.py are in this repo
model = load_model("best_1.495.pth", device="cpu")
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(score(model, audio)) # higher = more bona fide
```
Internally `score` does `_, logits = model(x)` on the windowed input and returns
`logits[:, 1]` (class 1 = bona fide). [`res2tcnguard.py`](./res2tcnguard.py) is the
same logic packaged as a `speech_spoof_bench` model — the exact code that produced
the Arena `scores.txt`.
## Citation
**This model / paper:**
```bibtex
@article{Borodin_Kudryavtsev_Mkrtchian_Gorodnichev_2024,
place={Greece},
title={Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry},
volume={14},
number={6},
url={https://etasr.com/index.php/ETASR/article/view/8906},
DOI={10.48084/etasr.8906},
journal={Engineering, Technology & Applied Science Research},
author={Borodin, Kirill and Kudryavtsev, Vasiliy and Mkrtchian, Grach and Gorodnichev, Mikhail},
year={2024},
month={Dec.},
pages={18409--18414}
}
```
**Training dataset — ASVspoof 2019:**
```bibtex
@article{wang2020asvspoof,
title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech},
author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others},
journal={Computer Speech \& Language},
volume={64},
pages={101114},
year={2020},
publisher={Elsevier}
}
```
## License
MIT — see the [source repository](https://github.com/lab260ru/Res2TCNGuard).
|