Add model card with badges, training notes, citations
Browse files
README.md
ADDED
|
@@ -0,0 +1,123 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- audio
|
| 5 |
+
- anti-spoofing
|
| 6 |
+
- audio-deepfake-detection
|
| 7 |
+
- speech
|
| 8 |
+
- asvspoof
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# Res2TCNGuard
|
| 12 |
+
|
| 13 |
+
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
|
| 14 |
+
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
|
| 15 |
+
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
|
| 16 |
+
|
| 17 |
+
TCN-based audio anti-spoofing (voice-deepfake detection) countermeasure proposed in
|
| 18 |
+
*"Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry"*
|
| 19 |
+
(Borodin et al., ETASR 2024). The model takes a raw speech waveform and returns a
|
| 20 |
+
score where **higher = more bona fide**.
|
| 21 |
+
|
| 22 |
+
- **Code:** https://github.com/lab260ru/Res2TCNGuard
|
| 23 |
+
- **Paper:** https://etasr.com/index.php/ETASR/article/view/8906 (DOI: 10.48084/etasr.8906)
|
| 24 |
+
- **Parameters:** ~0.17 M
|
| 25 |
+
- **Checkpoint:** [`best_1.495.pth`](./best_1.495.pth)
|
| 26 |
+
|
| 27 |
+
## Architecture
|
| 28 |
+
|
| 29 |
+
Res2TCNGuard operates directly on the raw waveform:
|
| 30 |
+
|
| 31 |
+
1. **Sinc-convolution front-end** (`SincConv_fast`) — learnable band-pass filters
|
| 32 |
+
that turn the waveform into a time–frequency representation.
|
| 33 |
+
2. **Res2Net encoder** — stacked `Res2Block`s with multi-scale residual connections
|
| 34 |
+
and squeeze-and-excitation (SE) attention.
|
| 35 |
+
3. **Dual temporal convolutional networks** — two `TemporalConvNet` branches model
|
| 36 |
+
the time and spectral axes separately; their pooled features are concatenated and
|
| 37 |
+
passed to a small linear classifier (bona fide vs. spoof).
|
| 38 |
+
|
| 39 |
+
## How it was trained
|
| 40 |
+
|
| 41 |
+
- **Data:** the ASVspoof 2019 **Logical Access (LA)** dataset. Following the protocol
|
| 42 |
+
in the paper, the model is trained and validated on subsets representing a *single*
|
| 43 |
+
attack type and then evaluated on the eval split, which contains *more advanced and
|
| 44 |
+
unseen* spoofing attacks — testing the model's ability to generalize to harder
|
| 45 |
+
attack scenarios.
|
| 46 |
+
- **Input length:** raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s).
|
| 47 |
+
During training a random segment is cut from each utterance (so reported numbers can
|
| 48 |
+
vary slightly between runs).
|
| 49 |
+
- **Optimization:** Adam (lr = 1e-4), trained for up to 70 epochs; the checkpoint with
|
| 50 |
+
the best eval EER is kept.
|
| 51 |
+
- **Best reported result (paper):** EER = **1.49 %**, min t-DCF = 0.0451.
|
| 52 |
+
|
| 53 |
+
See the [training notebook](https://github.com/lab260ru/Res2TCNGuard/blob/main/TCN.ipynb)
|
| 54 |
+
for the full training and evaluation code.
|
| 55 |
+
|
| 56 |
+
## Benchmark result (Speech Anti-Spoofing Arena)
|
| 57 |
+
|
| 58 |
+
Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard).
|
| 59 |
+
Scores were computed with a **deterministic first-64,600-sample window** (no random
|
| 60 |
+
crop), so the numbers are exactly reproducible from the pinned score file.
|
| 61 |
+
|
| 62 |
+
| Dataset | Split | EER % | Trials | Skipped |
|
| 63 |
+
|---|---|---|---|---|
|
| 64 |
+
| ASVspoof2019_LA | test | **1.50** | 71,237 | 0 |
|
| 65 |
+
|
| 66 |
+
This reproduces the paper's reported 1.49 % on the ASVspoof 2019 LA eval set.
|
| 67 |
+
|
| 68 |
+
## Usage
|
| 69 |
+
|
| 70 |
+
This checkpoint is a `state_dict` for the `TestModel` network defined in the
|
| 71 |
+
[source repository](https://github.com/lab260ru/Res2TCNGuard). Load the architecture
|
| 72 |
+
from there, then:
|
| 73 |
+
|
| 74 |
+
```python
|
| 75 |
+
import torch
|
| 76 |
+
from TCN import TestModel # network definition from the source repo
|
| 77 |
+
|
| 78 |
+
model = TestModel()
|
| 79 |
+
model.load_state_dict(torch.load("best_1.495.pth", map_location="cpu"))
|
| 80 |
+
model.eval()
|
| 81 |
+
|
| 82 |
+
# x: float32 waveform, 16 kHz mono, shape (batch, 64600)
|
| 83 |
+
_, logits = model(x)
|
| 84 |
+
bonafide_score = logits[:, 1] # higher = more bona fide
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
## Citation
|
| 88 |
+
|
| 89 |
+
**This model / paper:**
|
| 90 |
+
|
| 91 |
+
```bibtex
|
| 92 |
+
@article{Borodin_Kudryavtsev_Mkrtchian_Gorodnichev_2024,
|
| 93 |
+
place={Greece},
|
| 94 |
+
title={Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry},
|
| 95 |
+
volume={14},
|
| 96 |
+
number={6},
|
| 97 |
+
url={https://etasr.com/index.php/ETASR/article/view/8906},
|
| 98 |
+
DOI={10.48084/etasr.8906},
|
| 99 |
+
journal={Engineering, Technology & Applied Science Research},
|
| 100 |
+
author={Borodin, Kirill and Kudryavtsev, Vasiliy and Mkrtchian, Grach and Gorodnichev, Mikhail},
|
| 101 |
+
year={2024},
|
| 102 |
+
month={Dec.},
|
| 103 |
+
pages={18409--18414}
|
| 104 |
+
}
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
**Training dataset — ASVspoof 2019:**
|
| 108 |
+
|
| 109 |
+
```bibtex
|
| 110 |
+
@article{wang2020asvspoof,
|
| 111 |
+
title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech},
|
| 112 |
+
author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others},
|
| 113 |
+
journal={Computer Speech \& Language},
|
| 114 |
+
volume={64},
|
| 115 |
+
pages={101114},
|
| 116 |
+
year={2020},
|
| 117 |
+
publisher={Elsevier}
|
| 118 |
+
}
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
## License
|
| 122 |
+
|
| 123 |
+
MIT — see the [source repository](https://github.com/lab260ru/Res2TCNGuard).
|