Audio Classification
ONNX
Safetensors
English
model_hub_mixin
pytorch_model_hub_mixin
AASIST3 / README.md
korallll's picture
Add contact: email + Telegram channel
f0e9f10 verified
|
Raw
History Blame Contribute Delete
12.6 kB
---
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
license: cc-by-nc-4.0
datasets:
- mueller91/MLAAD
- jungjee/asvspoof5
- Bisher/ASVspoof_2019_LA
language:
- en
pipeline_tag: audio-classification
---
# AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection
⚠️ **Deprecation Notice**: This model is outdated and no longer maintained.
Please use the updated version: **[lab260/Spectra-AASIST3](https://huggingface.co/lab260/Spectra-AASIST3)** for improved performance and support.
[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Model-blue)](https://huggingface.co/MTUCI/AASIST3)
[![License](https://img.shields.io/badge/License-CC%20BY--NC--ND%204.0-red.svg)](https://creativecommons.org/licenses/by-nc-nd/4.0/)
## 🛡️ Speech Anti-Spoofing Arena
Independently re-scored on the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3) (**EER %, lower is better**; the model returns a score where higher = more bona fide):
[![EER% 9.44 on ASVspoof2019_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2019__LA-9.44%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 28.73 on ASVspoof2021_DF](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__DF-28.73%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 29.72 on InTheWild](https://img.shields.io/badge/EER%25%20on%20InTheWild-29.72%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 30.73 on CD-ADD](https://img.shields.io/badge/EER%25%20on%20CD--ADD-30.73%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 32.06 on ASVspoof2021_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__LA-32.06%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 23.24 on SONAR](https://img.shields.io/badge/EER%25%20on%20SONAR-23.24%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 31.82 on LibriSeVoc](https://img.shields.io/badge/EER%25%20on%20LibriSeVoc-31.82%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 33.13 on CFAD](https://img.shields.io/badge/EER%25%20on%20CFAD-33.13%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 44.23 on CVoiceFake_small](https://img.shields.io/badge/EER%25%20on%20CVoiceFake__small-44.23%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 34.59 on ASVspoof5](https://img.shields.io/badge/EER%25%20on%20ASVspoof5-34.59%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 35.91 on DeepVoice](https://img.shields.io/badge/EER%25%20on%20DeepVoice-35.91%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 36.91 on ArAD](https://img.shields.io/badge/EER%25%20on%20ArAD-36.91%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 23.58 on DECRO](https://img.shields.io/badge/EER%25%20on%20DECRO-23.58%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 16.16 on J-SPAW_LA](https://img.shields.io/badge/EER%25%20on%20J--SPAW__LA-16.16%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 39.04 on ODSS](https://img.shields.io/badge/EER%25%20on%20ODSS-39.04%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 27.17 on HABLA](https://img.shields.io/badge/EER%25%20on%20HABLA-27.17%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 1.4 on DFADD](https://img.shields.io/badge/EER%25%20on%20DFADD-1.4%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 26.56 on PyAra](https://img.shields.io/badge/EER%25%20on%20PyAra-26.56%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 28.84 on XMAD](https://img.shields.io/badge/EER%25%20on%20XMAD-28.84%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![1-SRR% 2.96 on LRLspoof](https://img.shields.io/badge/1--SRR%25%20on%20LRLspoof-2.96%25-green)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 24.69 on ADD22_eval_31](https://img.shields.io/badge/EER%25%20on%20ADD22__eval__31-24.69%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 44.12 on ADD2023_track12_test_r1](https://img.shields.io/badge/EER%25%20on%20ADD2023__track12__test__r1-44.12%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![EER% 7.31 on EmoFake_test](https://img.shields.io/badge/EER%25%20on%20EmoFake__test-7.31%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![1-SRR% 18.61 on EmoSpoofTTS](https://img.shields.io/badge/1--SRR%25%20on%20EmoSpoofTTS-18.61%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![arena tier](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/aasist3/tier.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
[![arena rank](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/aasist3/rank.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
| Dataset | EER % | Trials |
|---|---|---|
| ASVspoof2019_LA | **9.44** | 71,237 |
| ASVspoof2021_DF | **28.73** | 611,829 |
| InTheWild | **29.72** | 31,779 |
| CD-ADD | **30.73** | 20,786 |
| ASVspoof2021_LA | **32.06** | 181,566 |
> Scores produced with the `speech-spoof-bench` wrapper: preemphasis (0.97) + a deterministic first-64,600-sample window; score = output logit for class 1 (bona fide). Pinned score files live under [`.eval_results/`](./tree/main/.eval_results).
This repository contains the original implementation of **AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge**.
## Paper
**AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge**
*This is the original implementation of the paper. The model weights provided here are NOT the same weights used in the paper results.*
## Overview
AASIST3 is an enhanced version of the AASIST (Anti-spoofing with Adaptive Softmax and Instance-wise Temperature) architecture that incorporates **Kolmogorov-Arnold Networks (KAN)** for improved speech deepfake detection. The model leverages:
- **Self-Supervised Learning (SSL) Features**: Uses Wav2Vec2 encoder for robust audio representation
- **KAN Linear Layers**: Kolmogorov-Arnold Networks for enhanced feature transformation
- **Graph Attention Networks (GAT)**: For spatial and temporal feature modeling
- **Multi-branch Inference**: Multiple inference branches for robust decision making
## Architecture
The AASIST3 model consists of several key components:
1. **Wav2Vec2 Encoder**: Extracts SSL features from raw audio
2. **KAN Bridge**: Transforms SSL features using Kolmogorov-Arnold Networks
3. **Residual Encoder**: Processes features through multiple residual blocks
4. **Graph Attention Networks**:
- GAT-S: Spatial attention mechanism
- GAT-T: Temporal attention mechanism
5. **Multi-branch Inference**: Four parallel inference branches with master tokens
6. **KAN Output Layer**: Final classification using KAN linear layers
### Key Innovations
- **KAN Integration**: Replaces traditional linear layers with KAN linear layers for better feature approximation
- **Enhanced Regularization**: Additional dropout and regularization techniques
- **Multi-dataset Training**: Trained on multiple ASVspoof datasets for robustness
## 🚀 Quick Start
### Installation
```bash
git clone https://github.com/mtuciru/AASIST3.git
cd AASIST3
pip install -r requirements.txt
```
### Loading the Model
```python
from model import aasist3
# Load the model from Hugging Face Hub
model = aasist3.from_pretrained("MTUCI/AASIST3")
model.eval()
```
### Basic Usage
```python
import torch
import torchaudio
# Load and preprocess audio
audio, sr = torchaudio.load("audio_file.wav")
# Ensure audio is 16kHz and mono
if sr != 16000:
audio = torchaudio.transforms.Resample(sr, 16000)(audio)
if audio.shape[0] > 1:
audio = torch.mean(audio, dim=0, keepdim=True)
# Prepare input (model expects ~4 seconds of audio at 16kHz)
# Pad or truncate to 64600 samples
if audio.shape[1] < 64600:
audio = torch.nn.functional.pad(audio, (0, 64600 - audio.shape[1]))
else:
audio = audio[:, :64600]
# Run inference
with torch.no_grad():
output = model(audio)
probabilities = torch.softmax(output, dim=1)
prediction = torch.argmax(probabilities, dim=1)
# prediction: 0 = bonafide, 1 = spoof
print(f"Prediction: {'Bonafide' if prediction.item() == 0 else 'Spoof'}")
print(f"Confidence: {probabilities.max().item():.3f}")
```
## Training Details
### Datasets Used
The model was trained on a combination of multiple datasets:
- **ASVspoof 2019 LA** (Logical Access)
- **ASVspoof 2024 (ASVspoof5)**
- **MLAAD** (Multi-Language Audio Anti-Spoofing Dataset)
- **M-AILABS** (Multi-Language Audio Dataset)
### Training Configuration
- **Epochs**: 20
- **Batch Size**: 12 (training), 24 (validation)
- **Learning Rate**: 1e-4
- **Optimizer**: AdamW
- **Loss Function**: CrossEntropyLoss
- **Gradient Accumulation Steps**: 2
### Hardware
- **GPUs**: 2xA100 40GB
- **Framework**: PyTorch with Accelerate for distributed training
## Advanced Usage
### Custom Training
```bash
# Train the model
bash train.sh
```
### Validation
```bash
# Run validation on test sets
bash validate.sh
```
### Model Configuration
The model can be configured through the `configs/train.yaml` file:
```yaml
# Key parameters
num_epochs: 20
train_batch_size: 12
val_batch_size: 24
learning_rate: 1e-4
gradient_accumulation_steps: 2
```
## 🤝 Citation
If you use this implementation in your research, please cite the original paper:
```bibtex
@inproceedings{borodin24_asvspoof,
title = {AASIST3: KAN-enhanced AASIST speech deepfake detection using SSL features and additional regularization for the ASVspoof 2024 Challenge},
author = {Kirill Borodin and Vasiliy Kudryavtsev and Dmitrii Korzh and Alexey Efimenko and Grach Mkrtchian and Mikhail Gorodnichev and Oleg Y. Rogov},
year = {2024},
booktitle = {The Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024)},
pages = {48--55},
doi = {10.21437/ASVspoof.2024-8},
}
```
## License
This project is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0) - see the [LICENSE](LICENSE) file for details.
This license allows you to:
- **Share**: Copy and redistribute the material in any medium or format
- **Attribution**: You must give appropriate credit, provide a link to the license, and indicate if changes were made
But does NOT allow:
- **Commercial use**: You may not use the material for commercial purposes
- **Derivatives**: You may not distribute modified versions of the material
For more information, visit: https://creativecommons.org/licenses/by-nc-nd/4.0/
**Disclaimer**: This is a research implementation. The model weights provided are for demonstration purposes and may not match the exact performance reported in the paper.
## Contact
- Email: kborodin.research@gmail.com
- Telegram: [@korallll_ai](https://t.me/korallll_ai)