File size: 7,442 Bytes

c8057ae
 
 
 
 
 
 
 
 
 
 
 
 
8731d1f
8f4d084
c4370b0
b464fdc
c8057ae
 
 
 
 
 
 
 
 
 
f2beec2
c8057ae
 
f2beec2
 
 
 
 
c8057ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8731d1f
 
 
 
8f4d084
c4370b0
b464fdc
c8057ae
8731d1f
 
 
c8057ae
 
 
f2beec2
 
 
 
 
c8057ae
f2beec2
 
 
 
 
 
 
c8057ae
f2beec2
c8057ae
f2beec2
 
 
 
 
 
 
c8057ae
 
f2beec2
 
 
 
 
c8057ae

---
license: mit
tags:
  - audio
  - anti-spoofing
  - audio-deepfake-detection
  - speech
  - asvspoof
---

# Res2TCNGuard

[![EER% 1.5 on ASVspoof2019_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2019__LA-1.5%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
[![EER% 17.02 on ASVspoof2021_DF](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__DF-17.02%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
[![EER% 13.67 on ASVspoof2021_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__LA-13.67%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
[![EER% 56.10 on CD-ADD](https://img.shields.io/badge/EER%25%20on%20CD--ADD-56.10%25-red)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
[![EER% 52.52 on InTheWild](https://img.shields.io/badge/EER%25%20on%20InTheWild-52.52%25-red)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
[![arena tier](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/res2tcnguard/tier.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)
[![arena rank](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/res2tcnguard/rank.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard)

TCN-based audio anti-spoofing (voice-deepfake detection) countermeasure proposed in
*"Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry"*
(Borodin et al., ETASR 2024). The model takes a raw speech waveform and returns a
score where **higher = more bona fide**.

- **Code:** https://github.com/lab260ru/Res2TCNGuard
- **Paper:** https://etasr.com/index.php/ETASR/article/view/8906 (DOI: 10.48084/etasr.8906)
- **Parameters:** 172,102 (0.172 M)
- **Checkpoint:** [`best_1.495.pth`](./best_1.495.pth)

This repo is self-contained for inference: the network definition is in
[`_net.py`](./_net.py), a standalone scorer in [`evaluate.py`](./evaluate.py), and
the exact wrapper used to produce the Arena scores in
[`res2tcnguard.py`](./res2tcnguard.py).

## Architecture

Res2TCNGuard operates directly on the raw waveform:

1. **Sinc-convolution front-end** (`SincConv_fast`) — learnable band-pass filters
   that turn the waveform into a time–frequency representation.
2. **Res2Net encoder** — stacked `Res2Block`s with multi-scale residual connections
   and squeeze-and-excitation (SE) attention.
3. **Dual temporal convolutional networks** — two `TemporalConvNet` branches model
   the time and spectral axes separately; their pooled features are concatenated and
   passed to a small linear classifier (bona fide vs. spoof).

## How it was trained

- **Data:** the ASVspoof 2019 **Logical Access (LA)** dataset. Following the protocol
  in the paper, the model is trained and validated on subsets representing a *single*
  attack type and then evaluated on the eval split, which contains *more advanced and
  unseen* spoofing attacks — testing the model's ability to generalize to harder
  attack scenarios.
- **Input length:** raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s).
  During training a random segment is cut from each utterance (so reported numbers can
  vary slightly between runs).
- **Optimization:** Adam (lr = 1e-4), trained for up to 70 epochs; the checkpoint with
  the best eval EER is kept.
- **Best reported result (paper):** EER = **1.49 %**, min t-DCF = 0.0451.

See the [training notebook](https://github.com/lab260ru/Res2TCNGuard/blob/main/TCN.ipynb)
for the full training and evaluation code.

## Benchmark result (Speech Anti-Spoofing Arena)

Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=res2tcnguard).
Scores were computed with a **deterministic first-64,600-sample window** (no random
crop), so the numbers are exactly reproducible from the pinned score file.

| Dataset | Split | EER % | Trials | Skipped | Notes |
|---|---|---|---|---|---|
| ASVspoof2019_LA | test | **1.50** | 71,237 | 0 | in-domain (training data) |
| ASVspoof2021_DF | test | **17.02** | 611,829 | 0 | cross-dataset generalization |
| ASVspoof2021_LA | test | **13.67** | 181,566 | 0 | cross-dataset generalization |
| CD-ADD | test | **56.10** | 20,786 | 0 | out-of-domain (modern neural-TTS); does not generalize |
| InTheWild | test | **52.52** | 31,779 | 0 | out-of-domain (real-world deepfakes); does not generalize |

The ASVspoof2019_LA result reproduces the paper's reported 1.49 % on the LA eval set.
ASVspoof2021_DF is an out-of-domain test (the model was trained only on ASVspoof2019 LA),
so a higher EER is expected — it measures generalization to newer, unseen attacks.

## Usage

The checkpoint is a `state_dict` for the `TestModel` network defined in
[`_net.py`](./_net.py) (extracted verbatim from the source notebook). The input
**must** be exactly 64,600 samples at 16 kHz mono — the classifier head is
fixed-size — so window the waveform with `pad_fixed` (first 64,600 samples,
tile-repeat if shorter).

Score one file from the command line:

```bash
pip install torch numpy soundfile scipy
python evaluate.py path/to/audio.wav
# -> bona-fide score: <float>  (higher = more bona fide)
```

Or from Python:

```python
import numpy as np
from evaluate import load_model, score   # _net.py + evaluate.py are in this repo

model = load_model("best_1.495.pth", device="cpu")
audio = np.random.randn(48000).astype(np.float32)  # float32 mono 16 kHz
print(score(model, audio))                          # higher = more bona fide
```

Internally `score` does `_, logits = model(x)` on the windowed input and returns
`logits[:, 1]` (class 1 = bona fide). [`res2tcnguard.py`](./res2tcnguard.py) is the
same logic packaged as a `speech_spoof_bench` model — the exact code that produced
the Arena `scores.txt`.

## Citation

**This model / paper:**

```bibtex
@article{Borodin_Kudryavtsev_Mkrtchian_Gorodnichev_2024,
  place={Greece},
  title={Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry},
  volume={14},
  number={6},
  url={https://etasr.com/index.php/ETASR/article/view/8906},
  DOI={10.48084/etasr.8906},
  journal={Engineering, Technology & Applied Science Research},
  author={Borodin, Kirill and Kudryavtsev, Vasiliy and Mkrtchian, Grach and Gorodnichev, Mikhail},
  year={2024},
  month={Dec.},
  pages={18409--18414}
}
```

**Training dataset — ASVspoof 2019:**

```bibtex
@article{wang2020asvspoof,
  title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech},
  author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others},
  journal={Computer Speech \& Language},
  volume={64},
  pages={101114},
  year={2020},
  publisher={Elsevier}
}
```

## License

MIT — see the [source repository](https://github.com/lab260ru/Res2TCNGuard).