File size: 7,099 Bytes
ed8c770 352743a ed8c770 5a71963 ed8c770 ccc4eb6 fae1630 ccc4eb6 cfc9724 ed8c770 2052667 ed8c770 2052667 ed8c770 cfc9724 ed8c770 cfc9724 ed8c770 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 | ---
library_name: pytorch
tags:
- audio
- spoofing-detection
- anti-spoofing
- wav2vec2
- aasist
license: apache-2.0
pipeline_tag: audio-classification
model-index:
- name: spectra_aasist
results:
- task:
type: Speech Antispoofing
dataset:
name: ASVspoof19_LA
type: ASVspoof19_LA
metrics:
- name: Equal Error Rate
type: Equal Error Rate
value: 0.159
- task:
type: Speech Antispoofing
dataset:
name: ASVspoof21_LA
type: ASVspoof21_LA
metrics:
- name: Equal Error Rate
type: Equal Error Rate
value: 5.164
- task:
type: Speech Antispoofing
dataset:
name: ASVspoof21_DF
type: ASVspoof21_DF
metrics:
- name: Equal Error Rate
type: Equal Error Rate
value: 2.568
- task:
type: Speech Antispoofing
dataset:
name: ASVspoof5
type: ASVspoof5
metrics:
- name: Equal Error Rate
type: Equal Error Rate
value: 14.056
- task:
type: Speech Antispoofing
dataset:
name: ADD2022
type: ADD2022
metrics:
- name: Equal Error Rate
type: Equal Error Rate
value: 15.205
- task:
type: Speech Antispoofing
dataset:
name: In-the-Wild
type: In-the-Wild
metrics:
- name: Equal Error Rate
type: Equal Error Rate
value: 1.461
- task:
type: Speech Antispoofing
dataset:
name: AD2R1
type: AD2R1
metrics:
- name: Equal Error Rate
type: Equal Error Rate
value: 0.939
- task:
type: Speech Antispoofing
dataset:
name: AD2R2
type: AD2R2
metrics:
- name: Equal Error Rate
type: Equal Error Rate
value: 1.802
- task:
type: Speech Antispoofing
dataset:
name: AD3R1
type: AD3R1
metrics:
- name: Equal Error Rate
type: Equal Error Rate
value: 6.502
- task:
type: Speech Antispoofing
dataset:
name: AD3R2
type: AD3R2
metrics:
- name: Equal Error Rate
type: Equal Error Rate
value: 14.481
---
## Model Card: Spectra-AASIST (anti-spoofing / bonafide vs spoof)
`Spectra-AASIST` is a model for **speech spoofing detection** (binary classification: `bonafide` vs `spoof`) from **raw audio waveforms**. Architecture: SSL encoder (`Wav2Vec2`) → MLP projection → `AASIST` 2-class classifier.
- **Input**: waveform \(float32\), shape `(batch, num_samples)` (typically 16 kHz).
- **Output**: logits of shape `(batch, 2)`, where **index 0 = spoof**, **index 1 = bonafide**.
On first run, the model will automatically download the SSL encoder `facebook/wav2vec2-xls-r-300m` via `transformers`.
## Evaluation Results
| Model | ASVspoof19 LA | ASVspoof21 LA | ASVspoof21 DF | ASVspoof5 | ADD2022 | In-the-Wild | AD2R1 | AD2R2 | AD3R1 | AD2R2 |
|-----------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
| [Res2TCNGuard](https://github.com/mtuciru/Res2TCNGuard) | 7.487 | 19.130 | 19.883 | 37.620 | 49.538 | 49.246 | 34.683 | 35.343 | 48.051 | 39.558 |
| [AASIST3](https://huggingface.co/lab260/AASIST3) | 27.585 | 37.407 | 33.099 | 41.001 | 47.192 | 39.626 | 36.581 | 37.351 | 41.333 | 44.278 |
| [XSLS](https://github.com/QiShanZhang/SLSforASVspoof-2021-DF) | 0.231 | 7.714 | 4.220 | 17.688 | 33.951 | 7.453 | 14.386 | 15.743 | 19.368 | 21.095 |
| [TCM-ADD](https://github.com/ductuantruong/tcm_add) | **0.152** | 6.655 | 3.444 | 19.505 | 35.252 | 7.767 | 16.951 | 17.688 | 21.913 | 18.627 |
| [DF Arena 1B](https://huggingface.co/Speech-Arena-2025/DF_Arena_1B_V_1) | 43.793 | 40.137 | 42.994 | 35.333 | 42.139 | 17.598 | 12.442 | 13.292 | 33.381 | 43.42 |
| [Spectra-0](https://huggingface.co/lab260/spectra_0) | 0.181 | 6.475 | 5.410 | 14.426 | **14.716** | 1.026 | 1.578 | 2.372 | 6.535 | 15.154 |
| Spectra-AASIST | 0.159 | 5.164 | 2.568 | 14.056 | 15.205 | 1.461 | 0.939 | **1.802** | **6.427** | **12.968** |
| **[Spectra-AASIST3](https://huggingface.co/lab260/Spectra-AASIST3)** | 0.723 | **4.506** | **1.998** | **13.82** | 15.187 | **0.961** | **0.727** | 1.806 | 6.502 | 14.481 |
## Quickstart
### Clone from Hugging Face
This repository is hosted on Hugging Face Hub: `https://huggingface.co/lab260/spectra_aasist`.
```bash
git lfs install
git clone https://huggingface.co/lab260/spectra_aasist
cd spectra_aasist
```
### Install dependencies
```bash
pip install -U torch torchaudio transformers huggingface_hub safetensors soundfile
```
### Single-file inference (example preprocessing)
```python
import random
import torch
import torchaudio
import soundfile as sf
from model import spectra_aasist
def pad_random(x: torch.Tensor, max_len: int = 64600) -> torch.Tensor:
# x: (num_samples,) or (1, num_samples)
if x.ndim > 1:
x = x.squeeze()
x_len = x.shape[0]
if x_len >= max_len:
start = random.randint(0, x_len - max_len)
return x[start:start + max_len]
num_repeats = int(max_len / x_len) + 1
return x.repeat(num_repeats)[:max_len]
def load_audio_mono(path: str) -> torch.Tensor:
audio, sr = sf.read(path, dtype="float32")
audio = torch.from_numpy(audio)
if audio.ndim > 1:
# (num_samples, channels) -> mono
audio = audio.mean(dim=1)
if sr != 16000:
audio = torchaudio.functional.resample(audio, sr, 16000)
return audio
device = "cuda" if torch.cuda.is_available() else "cpu"
model = spectra_aasist.from_pretrained(pretrained_model_name_or_path=".").eval().to(device)
audio = load_audio_mono("path/to/audio.wav")
audio = torchaudio.functional.preemphasis(audio.unsqueeze(0)) # (1, T)
audio = pad_random(audio.squeeze(0), 64600).unsqueeze(0) # (1, 64600)
with torch.inference_mode():
logits = model(audio.to(device)) # (1, 2)
score_spoof = logits[0, 0].item()
score_bonafide = logits[0, 1].item()
print({"score_bonafide": score_bonafide, "score_spoof": score_spoof})
```
## Threshold-based classification (and how to tune it)
In `model.py`, the `SpectraAASIST` class provides `classify()` with a **default threshold** chosen as an “optimal” value for the original setting:
- **Default threshold**: `-1.140625` (it thresholds `logit_bonafide = logits[:, 1]`)
- **Note**: this threshold **may not be optimal** on a different dataset/domain. It’s recommended to tune the threshold on your dataset using **EER** (Equal Error Rate) or a target FAR/FRR.
Example:
```python
with torch.inference_mode():
pred = model.classify(audio.to(device), threshold=-1.140625) # 1=bonafide, 0=spoof
```
### Tuning the threshold via EER (typical workflow)
1) Run the model on a labeled set and collect scores for both classes.
2) Compute EER and the threshold
## Limitations and notes
- This is a **pre-release** model.
- Significantly stronger models are planned for **Q3–Q4 2026** — stay tuned.
## License
MIT (see the `license` field in the model repo header).
## Contacts
TG channel: https://t.me/korallll_ai
email: k.n.borodin@mtuci.ru
website: https://lab260.ru/
|