File size: 3,620 Bytes
487a8c1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | ---
license: mit
tags:
- audio
- anti-spoofing
- audio-deepfake-detection
- speech
- asvspoof
---
# AASIST
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist)
AASIST audio anti-spoofing (voice-deepfake detection) countermeasure from
*"AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention
Networks"* (Jung et al., ICASSP 2022). This is the **official `AASIST` variant**
(not AASIST-L), using the upstream [clovaai/aasist](https://github.com/clovaai/aasist)
ASVspoof2019 LA pretrained checkpoint. The model takes a raw speech waveform and
returns a score where **higher = more bona fide**.
- **Code:** https://github.com/clovaai/aasist
- **Paper:** https://arxiv.org/abs/2110.01200
- **Parameters:** 297,866 (0.298 M)
- **Checkpoint:** [`AASIST.pth`](./AASIST.pth)
This repo is self-contained for inference: the network definition is in
[`_net.py`](./_net.py) and the exact wrapper used to produce the Arena scores in
[`aasist.py`](./aasist.py).
## Architecture
AASIST operates directly on the raw waveform: a sinc-convolution front-end and a
RawNet2-style residual encoder produce a spectro-temporal feature map, which is
modelled by heterogeneous stacking graph attention layers over spectral and
temporal sub-graphs with a learnable max/average readout, followed by a 2-class
output (bona fide vs. spoof). The Arena score is the bona-fide logit.
## Reproducing the Arena scores
Inference uses a deterministic first-64600-sample window (no random crop),
matching the upstream `data_utils.pad()` used at eval. Audio is provided as
float32 mono at 16 kHz (no resampling in the wrapper).
```python
from aasist import AASIST
m = AASIST(); m.load()
scores = m.score_batch([wav], [16000]) # higher = more bona fide
```
| Dataset | EER % | n_trials |
|---------|------:|---------:|
| ASVspoof2019_LA (in-domain) | 0.83 | 71,237 |
| ASVspoof2021_LA | 12.35 | 181,566 |
| ASVspoof2021_DF | 17.04 | 611,829 |
| InTheWild | 43.01 | 31,779 |
| CD-ADD | 51.05 | 20,786 |
The in-domain ASVspoof2019 LA result reproduces the paper's reported EER (~0.83%).
## License
MIT (inherited from clovaai/aasist; see [`LICENSE`](./LICENSE)).
|