File size: 8,293 Bytes
366f7db
 
 
46af44e
 
 
 
 
072f81a
46af44e
0e0594e
072f81a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
366f7db
 
 
 
 
 
 
 
 
 
 
8eab698
 
a944d69
 
 
 
 
 
 
 
8eab698
366f7db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a944d69
 
 
 
 
 
896f6e4
 
 
 
 
6dd23a4
 
 
 
 
 
 
 
 
896f6e4
 
 
 
 
 
 
3d1d981
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
896f6e4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
---
library_name: pytorch
tags:
- audio
- spoofing-detection
- anti-spoofing
- wav2vec2
- ecapa-tdnn
license: apache-2.0
pipeline_tag: audio-classification
model-index:
- name: spectra_0
  results:
  - task:
      type: Speech Antispoofing
    dataset:
      name: ASVspoof19_LA
      type: ASVspoof19_LA
    metrics:
    - name: Equal Error Rate
      type: Equal Error Rate
      value: 0.181
  - task:
      type: Speech Antispoofing
    dataset:
      name: ASVspoof21_LA
      type: ASVspoof21_LA
    metrics:
    - name: Equal Error Rate
      type: Equal Error Rate
      value: 6.475
  - task:
      type: Speech Antispoofing
    dataset:
      name: ASVspoof21_DF
      type: ASVspoof21_DF
    metrics:
    - name: Equal Error Rate
      type: Equal Error Rate
      value: 5.41
  - task:
      type: Speech Antispoofing
    dataset:
      name: ASVspoof5
      type: ASVspoof5
    metrics:
    - name: Equal Error Rate
      type: Equal Error Rate
      value: 14.426
  - task:
      type: Speech Antispoofing
    dataset:
      name: ADD2022
      type: ADD2022
    metrics:
    - name: Equal Error Rate
      type: Equal Error Rate
      value: 14.716
  - task:
      type: Speech Antispoofing
    dataset:
      name: In-the-Wild
      type: In-the-Wild
    metrics:
    - name: Equal Error Rate
      type: Equal Error Rate
      value: 1.026
---

## Model Card: Spectra-0 (anti-spoofing / bonafide vs spoof)

`Spectra-0` is a model for **speech spoofing detection** (binary classification: `bonafide` vs `spoof`) from **raw audio waveforms**. Architecture: SSL encoder (`Wav2Vec2`) → MLP projection → `ECAPA-TDNN` 2-class classifier.

- **Input**: waveform \(float32\), shape `(batch, num_samples)` (typically 16 kHz).
- **Output**: logits of shape `(batch, 2)`, where **index 0 = spoof**, **index 1 = bonafide**.

On first run, the model will automatically download the SSL encoder `facebook/wav2vec2-xls-r-300m` via `transformers`.

## Evaluation Results

| Model     | ASVspoof19 LA | ASVspoof21 LA | ASVspoof21 DF | ASVspoof5 | ADD2022  | In-the-Wild |
|-----------|--------|--------|--------|--------|--------|--------|
| [Res2TCNGuard](https://github.com/mtuciru/Res2TCNGuard)      | 7.487  | 19.130 | 19.883 | 37.620 | 49.538 | 49.246 |
| [AASIST3](https://huggingface.co/MTUCI/AASIST3)    | 27.585 | 37.407 | 33.099 | 41.001 | 47.192 | 39.626 | 
| [XSLS](https://github.com/QiShanZhang/SLSforASVspoof-2021-DF)      | 0.231  | 7.714  | 4.220  | 17.688 | 33.951 | 7.453 | 
| [TCM-ADD](https://github.com/ductuantruong/tcm_add)       | **0.152** | 6.655  | **3.444** | 19.505 | 35.252 | 7.767 |
| [DF Arena 1B](https://huggingface.co/Speech-Arena-2025/DF_Arena_1B_V_1)    | 43.793 | 40.137 | 42.994 | 35.333 | 42.139 | 17.598 |
| **Spectra-0** | 0.181  | **6.475** | 5.410  | **14.426** | **14.716** | **1.026** |

## Quickstart

### Clone from Hugging Face

This repository is hosted on Hugging Face Hub: `https://huggingface.co/MTUCI/spectra_0`.

```bash
git lfs install
git clone https://huggingface.co/MTUCI/spectra_0
cd spectra_0
```

### Install dependencies

```bash
pip install -U torch torchaudio transformers huggingface_hub safetensors soundfile
```

### Single-file inference (example preprocessing)

```python
import random
import torch
import torchaudio
import soundfile as sf

from model import spectra_0


def pad_random(x: torch.Tensor, max_len: int = 64600) -> torch.Tensor:
    # x: (num_samples,) or (1, num_samples)
    if x.ndim > 1:
        x = x.squeeze()
    x_len = x.shape[0]
    if x_len >= max_len:
        start = random.randint(0, x_len - max_len)
        return x[start:start + max_len]
    num_repeats = int(max_len / x_len) + 1
    return x.repeat(num_repeats)[:max_len]


def load_audio_mono(path: str) -> torch.Tensor:
    audio, sr = sf.read(path, dtype="float32")
    audio = torch.from_numpy(audio)
    if audio.ndim > 1:
        # (num_samples, channels) -> mono
        audio = audio.mean(dim=1)
    if sr != 16000:
        audio = torchaudio.functional.resample(audio, sr, 16000)
    return audio


device = "cuda" if torch.cuda.is_available() else "cpu"
model = spectra_0.from_pretrained(pretrained_model_name_or_path=".").eval().to(device)

audio = load_audio_mono("path/to/audio.wav")
audio = torchaudio.functional.preemphasis(audio.unsqueeze(0))  # (1, T)
audio = pad_random(audio.squeeze(0), 64600).unsqueeze(0)       # (1, 64600)

with torch.inference_mode():
    logits = model(audio.to(device))  # (1, 2)
    score_spoof = logits[0, 0].item()
    score_bonafide = logits[0, 1].item()

print({"score_bonafide": score_bonafide, "score_spoof": score_spoof})
```

## Threshold-based classification (and how to tune it)

In `model.py`, the `Spectra0Model` class provides `classify()` with a **default threshold** chosen as an “optimal” value for the original setting:

- **Default threshold**: `-1.0625009` (it thresholds `logit_bonafide = logits[:, 1]`)
- **Note**: this threshold **may not be optimal** on a different dataset/domain. It’s recommended to tune the threshold on your dataset using **EER** (Equal Error Rate) or a target FAR/FRR.

Example:

```python
with torch.inference_mode():
    pred = model.classify(audio.to(device), threshold=-1.0625009)  # 1=bonafide, 0=spoof
```

### Tuning the threshold via EER (typical workflow)

1) Run the model on a labeled set and collect scores for both classes.

2) Compute EER and the threshold

## Limitations and notes

- This is a **pre-release** model.
- Significantly stronger models are planned for **Q3–Q4 2026** — stay tuned.

## License

MIT (see the `license` field in the model repo header).

## Contacts

TG channel: https://t.me/korallll_ai
email: k.n.borodin@mtuci.ru
website: https://lab260.ru/

## Benchmarks on Papers with code

```
@misc{wang2020asvspoof2019largescalepublic,
      title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech}, 
      author={Xin Wang and Junichi Yamagishi and Massimiliano Todisco and Hector Delgado and Andreas Nautsch and Nicholas Evans and Md Sahidullah and Ville Vestman and Tomi Kinnunen and Kong Aik Lee and Lauri Juvela and Paavo Alku and Yu-Huai Peng and Hsin-Te Hwang and Yu Tsao and Hsin-Min Wang and Sebastien Le Maguer and Markus Becker and Fergus Henderson and Rob Clark and Yu Zhang and Quan Wang and Ye Jia and Kai Onuma and Koji Mushika and Takashi Kaneda and Yuan Jiang and Li-Juan Liu and Yi-Chiao Wu and Wen-Chin Huang and Tomoki Toda and Kou Tanaka and Hirokazu Kameoka and Ingmar Steiner and Driss Matrouf and Jean-Francois Bonastre and Avashna Govender and Srikanth Ronanki and Jing-Xuan Zhang and Zhen-Hua Ling},
      year={2020},
      eprint={1911.01601},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/1911.01601}, 
}
@article{210900535,
  title={{ASVspoof 2021: Automatic Speaker Verification Spoofing and  …}},
  author={{}},
  year={{2021}},
  eprint={{2109.00535}},
  archivePrefix={{arXiv}}
}
@misc{wang2024asvspoof5crowdsourcedspeech,
      title={ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale}, 
      author={Xin Wang and Hector Delgado and Hemlata Tak and Jee-weon Jung and Hye-jin Shim and Massimiliano Todisco and Ivan Kukanov and Xuechen Liu and Md Sahidullah and Tomi Kinnunen and Nicholas Evans and Kong Aik Lee and Junichi Yamagishi},
      year={2024},
      eprint={2408.08739},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2408.08739}, 
}
@misc{yi2024add2022audiodeep,
      title={ADD 2022: the First Audio Deep Synthesis Detection Challenge}, 
      author={Jiangyan Yi and Ruibo Fu and Jianhua Tao and Shuai Nie and Haoxin Ma and Chenglong Wang and Tao Wang and Zhengkun Tian and Xiaohui Zhang and Ye Bai and Cunhang Fan and Shan Liang and Shiming Wang and Shuai Zhang and Xinrui Yan and Le Xu and Zhengqi Wen and Haizhou Li and Zheng Lian and Bin Liu},
      year={2024},
      eprint={2202.08433},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2202.08433}, 
}
@article{220316263,
  title={{Does Audio Deepfake Detection Generalize?}},
  author={{Nicolas M. Müller et al.}},
  year={{2022}},
  eprint={{2203.16263}},
  archivePrefix={{arXiv}}
}
```