File size: 7,442 Bytes
c8057ae
 
 
 
 
 
 
 
 
 
 
 
 
8731d1f
8f4d084
c4370b0
b464fdc
c8057ae
 
 
 
 
 
 
 
 
 
f2beec2
c8057ae
 
f2beec2
 
 
 
 
c8057ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8731d1f
 
 
 
8f4d084
c4370b0
b464fdc
c8057ae
8731d1f
 
 
c8057ae
 
 
f2beec2
 
 
 
 
c8057ae
f2beec2
 
 
 
 
 
 
c8057ae
f2beec2
c8057ae
f2beec2
 
 
 
 
 
 
c8057ae
 
f2beec2
 
 
 
 
c8057ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
license: mit
tags:
  - audio
  - anti-spoofing
  - audio-deepfake-detection
  - speech
  - asvspoof
---

# Res2TCNGuard

[![EER% 1.5 on ASVspoof2019_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2019__LA-1.5%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
[![EER% 17.02 on ASVspoof2021_DF](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__DF-17.02%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
[![EER% 13.67 on ASVspoof2021_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__LA-13.67%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
[![EER% 56.10 on CD-ADD](https://img.shields.io/badge/EER%25%20on%20CD--ADD-56.10%25-red)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
[![EER% 52.52 on InTheWild](https://img.shields.io/badge/EER%25%20on%20InTheWild-52.52%25-red)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
[![arena tier](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/res2tcnguard/tier.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
[![arena rank](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/res2tcnguard/rank.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)

TCN-based audio anti-spoofing (voice-deepfake detection) countermeasure proposed in
*"Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry"*
(Borodin et al., ETASR 2024). The model takes a raw speech waveform and returns a
score where **higher = more bona fide**.

- **Code:** https://github.com/lab260ru/Res2TCNGuard
- **Paper:** https://etasr.com/index.php/ETASR/article/view/8906 (DOI: 10.48084/etasr.8906)
- **Parameters:** 172,102 (0.172 M)
- **Checkpoint:** [`best_1.495.pth`](./best_1.495.pth)

This repo is self-contained for inference: the network definition is in
[`_net.py`](./_net.py), a standalone scorer in [`evaluate.py`](./evaluate.py), and
the exact wrapper used to produce the Arena scores in
[`res2tcnguard.py`](./res2tcnguard.py).

## Architecture

Res2TCNGuard operates directly on the raw waveform:

1. **Sinc-convolution front-end** (`SincConv_fast`) — learnable band-pass filters
   that turn the waveform into a time–frequency representation.
2. **Res2Net encoder** — stacked `Res2Block`s with multi-scale residual connections
   and squeeze-and-excitation (SE) attention.
3. **Dual temporal convolutional networks** — two `TemporalConvNet` branches model
   the time and spectral axes separately; their pooled features are concatenated and
   passed to a small linear classifier (bona fide vs. spoof).

## How it was trained

- **Data:** the ASVspoof 2019 **Logical Access (LA)** dataset. Following the protocol
  in the paper, the model is trained and validated on subsets representing a *single*
  attack type and then evaluated on the eval split, which contains *more advanced and
  unseen* spoofing attacks — testing the model's ability to generalize to harder
  attack scenarios.
- **Input length:** raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s).
  During training a random segment is cut from each utterance (so reported numbers can
  vary slightly between runs).
- **Optimization:** Adam (lr = 1e-4), trained for up to 70 epochs; the checkpoint with
  the best eval EER is kept.
- **Best reported result (paper):** EER = **1.49 %**, min t-DCF = 0.0451.

See the [training notebook](https://github.com/lab260ru/Res2TCNGuard/blob/main/TCN.ipynb)
for the full training and evaluation code.

## Benchmark result (Speech Anti-Spoofing Arena)

Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard).
Scores were computed with a **deterministic first-64,600-sample window** (no random
crop), so the numbers are exactly reproducible from the pinned score file.

| Dataset | Split | EER % | Trials | Skipped | Notes |
|---|---|---|---|---|---|
| ASVspoof2019_LA | test | **1.50** | 71,237 | 0 | in-domain (training data) |
| ASVspoof2021_DF | test | **17.02** | 611,829 | 0 | cross-dataset generalization |
| ASVspoof2021_LA | test | **13.67** | 181,566 | 0 | cross-dataset generalization |
| CD-ADD | test | **56.10** | 20,786 | 0 | out-of-domain (modern neural-TTS); does not generalize |
| InTheWild | test | **52.52** | 31,779 | 0 | out-of-domain (real-world deepfakes); does not generalize |

The ASVspoof2019_LA result reproduces the paper's reported 1.49 % on the LA eval set.
ASVspoof2021_DF is an out-of-domain test (the model was trained only on ASVspoof2019 LA),
so a higher EER is expected — it measures generalization to newer, unseen attacks.

## Usage

The checkpoint is a `state_dict` for the `TestModel` network defined in
[`_net.py`](./_net.py) (extracted verbatim from the source notebook). The input
**must** be exactly 64,600 samples at 16 kHz mono — the classifier head is
fixed-size — so window the waveform with `pad_fixed` (first 64,600 samples,
tile-repeat if shorter).

Score one file from the command line:

```bash
pip install torch numpy soundfile scipy
python evaluate.py path/to/audio.wav
# -> bona-fide score: <float>  (higher = more bona fide)
```

Or from Python:

```python
import numpy as np
from evaluate import load_model, score   # _net.py + evaluate.py are in this repo

model = load_model("best_1.495.pth", device="cpu")
audio = np.random.randn(48000).astype(np.float32)  # float32 mono 16 kHz
print(score(model, audio))                          # higher = more bona fide
```

Internally `score` does `_, logits = model(x)` on the windowed input and returns
`logits[:, 1]` (class 1 = bona fide). [`res2tcnguard.py`](./res2tcnguard.py) is the
same logic packaged as a `speech_spoof_bench` model — the exact code that produced
the Arena `scores.txt`.

## Citation

**This model / paper:**

```bibtex
@article{Borodin_Kudryavtsev_Mkrtchian_Gorodnichev_2024,
  place={Greece},
  title={Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry},
  volume={14},
  number={6},
  url={https://etasr.com/index.php/ETASR/article/view/8906},
  DOI={10.48084/etasr.8906},
  journal={Engineering, Technology & Applied Science Research},
  author={Borodin, Kirill and Kudryavtsev, Vasiliy and Mkrtchian, Grach and Gorodnichev, Mikhail},
  year={2024},
  month={Dec.},
  pages={18409--18414}
}
```

**Training dataset — ASVspoof 2019:**

```bibtex
@article{wang2020asvspoof,
  title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech},
  author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others},
  journal={Computer Speech \& Language},
  volume={64},
  pages={101114},
  year={2020},
  publisher={Elsevier}
}
```

## License

MIT — see the [source repository](https://github.com/lab260ru/Res2TCNGuard).