korallll commited on
Commit
c8057ae
·
verified ·
1 Parent(s): 047789b

Add model card with badges, training notes, citations

Browse files
Files changed (1) hide show
  1. README.md +123 -0
README.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - audio
5
+ - anti-spoofing
6
+ - audio-deepfake-detection
7
+ - speech
8
+ - asvspoof
9
+ ---
10
+
11
+ # Res2TCNGuard
12
+
13
+ [![EER% 1.5 on ASVspoof2019_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2019__LA-1.5%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
14
+ [![arena tier](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/res2tcnguard/tier.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
15
+ [![arena rank](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/res2tcnguard/rank.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
16
+
17
+ TCN-based audio anti-spoofing (voice-deepfake detection) countermeasure proposed in
18
+ *"Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry"*
19
+ (Borodin et al., ETASR 2024). The model takes a raw speech waveform and returns a
20
+ score where **higher = more bona fide**.
21
+
22
+ - **Code:** https://github.com/lab260ru/Res2TCNGuard
23
+ - **Paper:** https://etasr.com/index.php/ETASR/article/view/8906 (DOI: 10.48084/etasr.8906)
24
+ - **Parameters:** ~0.17 M
25
+ - **Checkpoint:** [`best_1.495.pth`](./best_1.495.pth)
26
+
27
+ ## Architecture
28
+
29
+ Res2TCNGuard operates directly on the raw waveform:
30
+
31
+ 1. **Sinc-convolution front-end** (`SincConv_fast`) — learnable band-pass filters
32
+ that turn the waveform into a time–frequency representation.
33
+ 2. **Res2Net encoder** — stacked `Res2Block`s with multi-scale residual connections
34
+ and squeeze-and-excitation (SE) attention.
35
+ 3. **Dual temporal convolutional networks** — two `TemporalConvNet` branches model
36
+ the time and spectral axes separately; their pooled features are concatenated and
37
+ passed to a small linear classifier (bona fide vs. spoof).
38
+
39
+ ## How it was trained
40
+
41
+ - **Data:** the ASVspoof 2019 **Logical Access (LA)** dataset. Following the protocol
42
+ in the paper, the model is trained and validated on subsets representing a *single*
43
+ attack type and then evaluated on the eval split, which contains *more advanced and
44
+ unseen* spoofing attacks — testing the model's ability to generalize to harder
45
+ attack scenarios.
46
+ - **Input length:** raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s).
47
+ During training a random segment is cut from each utterance (so reported numbers can
48
+ vary slightly between runs).
49
+ - **Optimization:** Adam (lr = 1e-4), trained for up to 70 epochs; the checkpoint with
50
+ the best eval EER is kept.
51
+ - **Best reported result (paper):** EER = **1.49 %**, min t-DCF = 0.0451.
52
+
53
+ See the [training notebook](https://github.com/lab260ru/Res2TCNGuard/blob/main/TCN.ipynb)
54
+ for the full training and evaluation code.
55
+
56
+ ## Benchmark result (Speech Anti-Spoofing Arena)
57
+
58
+ Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard).
59
+ Scores were computed with a **deterministic first-64,600-sample window** (no random
60
+ crop), so the numbers are exactly reproducible from the pinned score file.
61
+
62
+ | Dataset | Split | EER % | Trials | Skipped |
63
+ |---|---|---|---|---|
64
+ | ASVspoof2019_LA | test | **1.50** | 71,237 | 0 |
65
+
66
+ This reproduces the paper's reported 1.49 % on the ASVspoof 2019 LA eval set.
67
+
68
+ ## Usage
69
+
70
+ This checkpoint is a `state_dict` for the `TestModel` network defined in the
71
+ [source repository](https://github.com/lab260ru/Res2TCNGuard). Load the architecture
72
+ from there, then:
73
+
74
+ ```python
75
+ import torch
76
+ from TCN import TestModel # network definition from the source repo
77
+
78
+ model = TestModel()
79
+ model.load_state_dict(torch.load("best_1.495.pth", map_location="cpu"))
80
+ model.eval()
81
+
82
+ # x: float32 waveform, 16 kHz mono, shape (batch, 64600)
83
+ _, logits = model(x)
84
+ bonafide_score = logits[:, 1] # higher = more bona fide
85
+ ```
86
+
87
+ ## Citation
88
+
89
+ **This model / paper:**
90
+
91
+ ```bibtex
92
+ @article{Borodin_Kudryavtsev_Mkrtchian_Gorodnichev_2024,
93
+ place={Greece},
94
+ title={Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry},
95
+ volume={14},
96
+ number={6},
97
+ url={https://etasr.com/index.php/ETASR/article/view/8906},
98
+ DOI={10.48084/etasr.8906},
99
+ journal={Engineering, Technology & Applied Science Research},
100
+ author={Borodin, Kirill and Kudryavtsev, Vasiliy and Mkrtchian, Grach and Gorodnichev, Mikhail},
101
+ year={2024},
102
+ month={Dec.},
103
+ pages={18409--18414}
104
+ }
105
+ ```
106
+
107
+ **Training dataset — ASVspoof 2019:**
108
+
109
+ ```bibtex
110
+ @article{wang2020asvspoof,
111
+ title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech},
112
+ author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others},
113
+ journal={Computer Speech \& Language},
114
+ volume={64},
115
+ pages={101114},
116
+ year={2020},
117
+ publisher={Elsevier}
118
+ }
119
+ ```
120
+
121
+ ## License
122
+
123
+ MIT — see the [source repository](https://github.com/lab260ru/Res2TCNGuard).