| --- |
| library_name: pytorch |
| tags: |
| - audio |
| - spoofing-detection |
| - anti-spoofing |
| - wav2vec2 |
| - aasist |
| license: apache-2.0 |
| pipeline_tag: audio-classification |
| model-index: |
| - name: spectra_aasist |
| results: |
| - task: |
| type: Speech Antispoofing |
| dataset: |
| name: ASVspoof19_LA |
| type: ASVspoof19_LA |
| metrics: |
| - name: Equal Error Rate |
| type: Equal Error Rate |
| value: 0.159 |
| - task: |
| type: Speech Antispoofing |
| dataset: |
| name: ASVspoof21_LA |
| type: ASVspoof21_LA |
| metrics: |
| - name: Equal Error Rate |
| type: Equal Error Rate |
| value: 5.164 |
| - task: |
| type: Speech Antispoofing |
| dataset: |
| name: ASVspoof21_DF |
| type: ASVspoof21_DF |
| metrics: |
| - name: Equal Error Rate |
| type: Equal Error Rate |
| value: 2.568 |
| - task: |
| type: Speech Antispoofing |
| dataset: |
| name: ASVspoof5 |
| type: ASVspoof5 |
| metrics: |
| - name: Equal Error Rate |
| type: Equal Error Rate |
| value: 14.056 |
| - task: |
| type: Speech Antispoofing |
| dataset: |
| name: ADD2022 |
| type: ADD2022 |
| metrics: |
| - name: Equal Error Rate |
| type: Equal Error Rate |
| value: 15.205 |
| - task: |
| type: Speech Antispoofing |
| dataset: |
| name: In-the-Wild |
| type: In-the-Wild |
| metrics: |
| - name: Equal Error Rate |
| type: Equal Error Rate |
| value: 1.461 |
| - task: |
| type: Speech Antispoofing |
| dataset: |
| name: AD2R1 |
| type: AD2R1 |
| metrics: |
| - name: Equal Error Rate |
| type: Equal Error Rate |
| value: 0.939 |
| - task: |
| type: Speech Antispoofing |
| dataset: |
| name: AD2R2 |
| type: AD2R2 |
| metrics: |
| - name: Equal Error Rate |
| type: Equal Error Rate |
| value: 1.802 |
| - task: |
| type: Speech Antispoofing |
| dataset: |
| name: AD3R1 |
| type: AD3R1 |
| metrics: |
| - name: Equal Error Rate |
| type: Equal Error Rate |
| value: 6.502 |
| - task: |
| type: Speech Antispoofing |
| dataset: |
| name: AD3R2 |
| type: AD3R2 |
| metrics: |
| - name: Equal Error Rate |
| type: Equal Error Rate |
| value: 14.481 |
| |
| --- |
| |
| ## Model Card: Spectra-AASIST (anti-spoofing / bonafide vs spoof) |
|
|
| `Spectra-AASIST` is a model for **speech spoofing detection** (binary classification: `bonafide` vs `spoof`) from **raw audio waveforms**. Architecture: SSL encoder (`Wav2Vec2`) → MLP projection → `AASIST` 2-class classifier. |
|
|
| - **Input**: waveform \(float32\), shape `(batch, num_samples)` (typically 16 kHz). |
| - **Output**: logits of shape `(batch, 2)`, where **index 0 = spoof**, **index 1 = bonafide**. |
|
|
| On first run, the model will automatically download the SSL encoder `facebook/wav2vec2-xls-r-300m` via `transformers`. |
|
|
| ## Evaluation Results |
|
|
| | Model | ASVspoof19 LA | ASVspoof21 LA | ASVspoof21 DF | ASVspoof5 | ADD2022 | In-the-Wild | AD2R1 | AD2R2 | AD3R1 | AD2R2 | |
| |-----------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------| |
| | [Res2TCNGuard](https://github.com/mtuciru/Res2TCNGuard) | 7.487 | 19.130 | 19.883 | 37.620 | 49.538 | 49.246 | 34.683 | 35.343 | 48.051 | 39.558 | |
| | [AASIST3](https://huggingface.co/lab260/AASIST3) | 27.585 | 37.407 | 33.099 | 41.001 | 47.192 | 39.626 | 36.581 | 37.351 | 41.333 | 44.278 | |
| | [XSLS](https://github.com/QiShanZhang/SLSforASVspoof-2021-DF) | 0.231 | 7.714 | 4.220 | 17.688 | 33.951 | 7.453 | 14.386 | 15.743 | 19.368 | 21.095 | |
| | [TCM-ADD](https://github.com/ductuantruong/tcm_add) | **0.152** | 6.655 | 3.444 | 19.505 | 35.252 | 7.767 | 16.951 | 17.688 | 21.913 | 18.627 | |
| | [DF Arena 1B](https://huggingface.co/Speech-Arena-2025/DF_Arena_1B_V_1) | 43.793 | 40.137 | 42.994 | 35.333 | 42.139 | 17.598 | 12.442 | 13.292 | 33.381 | 43.42 | |
| | [Spectra-0](https://huggingface.co/lab260/spectra_0) | 0.181 | 6.475 | 5.410 | 14.426 | **14.716** | 1.026 | 1.578 | 2.372 | 6.535 | 15.154 | |
| | Spectra-AASIST | 0.159 | 5.164 | 2.568 | 14.056 | 15.205 | 1.461 | 0.939 | **1.802** | **6.427** | **12.968** | |
| | **[Spectra-AASIST3](https://huggingface.co/lab260/Spectra-AASIST3)** | 0.723 | **4.506** | **1.998** | **13.82** | 15.187 | **0.961** | **0.727** | 1.806 | 6.502 | 14.481 | |
|
|
|
|
| ## Quickstart |
|
|
| ### Clone from Hugging Face |
|
|
| This repository is hosted on Hugging Face Hub: `https://huggingface.co/lab260/spectra_aasist`. |
|
|
| ```bash |
| git lfs install |
| git clone https://huggingface.co/lab260/spectra_aasist |
| cd spectra_aasist |
| ``` |
|
|
| ### Install dependencies |
|
|
| ```bash |
| pip install -U torch torchaudio transformers huggingface_hub safetensors soundfile |
| ``` |
|
|
| ### Single-file inference (example preprocessing) |
|
|
| ```python |
| import random |
| import torch |
| import torchaudio |
| import soundfile as sf |
| |
| from model import spectra_aasist |
| |
| |
| def pad_random(x: torch.Tensor, max_len: int = 64600) -> torch.Tensor: |
| # x: (num_samples,) or (1, num_samples) |
| if x.ndim > 1: |
| x = x.squeeze() |
| x_len = x.shape[0] |
| if x_len >= max_len: |
| start = random.randint(0, x_len - max_len) |
| return x[start:start + max_len] |
| num_repeats = int(max_len / x_len) + 1 |
| return x.repeat(num_repeats)[:max_len] |
| |
| |
| def load_audio_mono(path: str) -> torch.Tensor: |
| audio, sr = sf.read(path, dtype="float32") |
| audio = torch.from_numpy(audio) |
| if audio.ndim > 1: |
| # (num_samples, channels) -> mono |
| audio = audio.mean(dim=1) |
| if sr != 16000: |
| audio = torchaudio.functional.resample(audio, sr, 16000) |
| return audio |
| |
| |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| model = spectra_aasist.from_pretrained(pretrained_model_name_or_path=".").eval().to(device) |
| |
| audio = load_audio_mono("path/to/audio.wav") |
| audio = torchaudio.functional.preemphasis(audio.unsqueeze(0)) # (1, T) |
| audio = pad_random(audio.squeeze(0), 64600).unsqueeze(0) # (1, 64600) |
| |
| with torch.inference_mode(): |
| logits = model(audio.to(device)) # (1, 2) |
| score_spoof = logits[0, 0].item() |
| score_bonafide = logits[0, 1].item() |
| |
| print({"score_bonafide": score_bonafide, "score_spoof": score_spoof}) |
| ``` |
|
|
| ## Threshold-based classification (and how to tune it) |
|
|
| In `model.py`, the `SpectraAASIST` class provides `classify()` with a **default threshold** chosen as an “optimal” value for the original setting: |
|
|
| - **Default threshold**: `-1.140625` (it thresholds `logit_bonafide = logits[:, 1]`) |
| - **Note**: this threshold **may not be optimal** on a different dataset/domain. It’s recommended to tune the threshold on your dataset using **EER** (Equal Error Rate) or a target FAR/FRR. |
|
|
| Example: |
|
|
| ```python |
| with torch.inference_mode(): |
| pred = model.classify(audio.to(device), threshold=-1.140625) # 1=bonafide, 0=spoof |
| ``` |
|
|
| ### Tuning the threshold via EER (typical workflow) |
|
|
| 1) Run the model on a labeled set and collect scores for both classes. |
|
|
| 2) Compute EER and the threshold |
|
|
| ## Limitations and notes |
|
|
| - This is a **pre-release** model. |
| - Significantly stronger models are planned for **Q3–Q4 2026** — stay tuned. |
|
|
| ## License |
|
|
| MIT (see the `license` field in the model repo header). |
|
|
| ## Contacts |
|
|
| TG channel: https://t.me/korallll_ai |
| email: k.n.borodin@mtuci.ru |
| website: https://lab260.ru/ |
| |