| --- |
| license: mit |
| tags: |
| - audio |
| - anti-spoofing |
| - audio-deepfake-detection |
| - speech |
| - asvspoof |
| --- |
| |
| # AASIST-L |
|
|
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist-l) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist-l) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist-l) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist-l) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist-l) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist-l) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist-l) |
|
|
| AASIST-L is the **lightweight variant** of AASIST audio anti-spoofing |
| (voice-deepfake detection) from *"AASIST: Audio Anti-Spoofing using Integrated |
| Spectro-Temporal Graph Attention Networks"* (Jung et al., ICASSP 2022). It uses |
| the upstream [clovaai/aasist](https://github.com/clovaai/aasist) ASVspoof2019 LA |
| pretrained `AASIST-L` checkpoint. The model takes a raw speech waveform and |
| returns a score where **higher = more bona fide**. |
|
|
| - **Code:** https://github.com/clovaai/aasist |
| - **Paper:** https://arxiv.org/abs/2110.01200 |
| - **Parameters:** 85,306 (0.085 M) |
| - **Checkpoint:** [`AASIST-L.pth`](./AASIST-L.pth) |
|
|
| This repo is self-contained for inference: the network definition is in |
| [`_net.py`](./_net.py) (identical to the full AASIST) and the exact wrapper used |
| to produce the Arena scores in [`aasist_l.py`](./aasist_l.py). AASIST-L shares |
| the AASIST architecture but with a narrower residual stack and graph dimensions |
| (~85k params vs ~298k). |
|
|
| ## Architecture |
|
|
| AASIST operates directly on the raw waveform: a sinc-convolution front-end and a |
| RawNet2-style residual encoder produce a spectro-temporal feature map, which is |
| modelled by heterogeneous stacking graph attention layers over spectral and |
| temporal sub-graphs with a learnable max/average readout, followed by a 2-class |
| output (bona fide vs. spoof). The Arena score is the bona-fide logit. The "-L" |
| variant narrows the residual channels (`…[32,24],[24,24]`) and graph dims |
| (`[24,32]`). |
|
|
| ## Reproducing the Arena scores |
|
|
| Inference uses a deterministic first-64600-sample window (no random crop), |
| matching the upstream `data_utils.pad()` used at eval. Audio is provided as |
| float32 mono at 16 kHz (no resampling in the wrapper). |
|
|
| ```python |
| from aasist_l import AASIST_L |
| m = AASIST_L(); m.load() |
| scores = m.score_batch([wav], [16000]) # higher = more bona fide |
| ``` |
|
|
| | Dataset | EER % | n_trials | |
| |---------|------:|---------:| |
| | ASVspoof2019_LA (in-domain) | 0.99 | 71,237 | |
| | ASVspoof2021_LA | 13.15 | 181,566 | |
| | ASVspoof2021_DF | 15.96 | 611,829 | |
| | InTheWild | 44.45 | 31,779 | |
| | CD-ADD | 50.72 | 20,786 | |
|
|
| The in-domain ASVspoof2019 LA result (~0.99%) reproduces the paper's reported |
| AASIST-L EER. AASIST-L matches the full AASIST closely at ~3.5× fewer parameters. |
|
|
| ## License |
|
|
| MIT (inherited from clovaai/aasist; see [`LICENSE`](./LICENSE)). |
|
|