AASIST3 / README.md

Add contact: email + Telegram channel

f0e9f10 verified 5 days ago

12.6 kB

	---
	tags:
	- model_hub_mixin
	- pytorch_model_hub_mixin
	license: cc-by-nc-4.0
	datasets:
	- mueller91/MLAAD
	- jungjee/asvspoof5
	- Bisher/ASVspoof_2019_LA
	language:
	- en
	pipeline_tag: audio-classification
	---

	# AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection

	⚠️ Deprecation Notice: This model is outdated and no longer maintained.
	Please use the updated version: [lab260/Spectra-AASIST3](https://huggingface.co/lab260/Spectra-AASIST3) for improved performance and support.

	[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Model-blue)](https://huggingface.co/MTUCI/AASIST3)
	[![License](https://img.shields.io/badge/License-CC%20BY--NC--ND%204.0-red.svg)](https://creativecommons.org/licenses/by-nc-nd/4.0/)

	## 🛡️ Speech Anti-Spoofing Arena

	Independently re-scored on the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3) (EER %, lower is better; the model returns a score where higher = more bona fide):

	[![EER% 9.44 on ASVspoof2019_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2019__LA-9.44%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 28.73 on ASVspoof2021_DF](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__DF-28.73%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 29.72 on InTheWild](https://img.shields.io/badge/EER%25%20on%20InTheWild-29.72%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 30.73 on CD-ADD](https://img.shields.io/badge/EER%25%20on%20CD--ADD-30.73%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 32.06 on ASVspoof2021_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__LA-32.06%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 23.24 on SONAR](https://img.shields.io/badge/EER%25%20on%20SONAR-23.24%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 31.82 on LibriSeVoc](https://img.shields.io/badge/EER%25%20on%20LibriSeVoc-31.82%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 33.13 on CFAD](https://img.shields.io/badge/EER%25%20on%20CFAD-33.13%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 44.23 on CVoiceFake_small](https://img.shields.io/badge/EER%25%20on%20CVoiceFake__small-44.23%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 34.59 on ASVspoof5](https://img.shields.io/badge/EER%25%20on%20ASVspoof5-34.59%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 35.91 on DeepVoice](https://img.shields.io/badge/EER%25%20on%20DeepVoice-35.91%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 36.91 on ArAD](https://img.shields.io/badge/EER%25%20on%20ArAD-36.91%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 23.58 on DECRO](https://img.shields.io/badge/EER%25%20on%20DECRO-23.58%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 16.16 on J-SPAW_LA](https://img.shields.io/badge/EER%25%20on%20J--SPAW__LA-16.16%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 39.04 on ODSS](https://img.shields.io/badge/EER%25%20on%20ODSS-39.04%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 27.17 on HABLA](https://img.shields.io/badge/EER%25%20on%20HABLA-27.17%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 1.4 on DFADD](https://img.shields.io/badge/EER%25%20on%20DFADD-1.4%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 26.56 on PyAra](https://img.shields.io/badge/EER%25%20on%20PyAra-26.56%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 28.84 on XMAD](https://img.shields.io/badge/EER%25%20on%20XMAD-28.84%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![1-SRR% 2.96 on LRLspoof](https://img.shields.io/badge/1--SRR%25%20on%20LRLspoof-2.96%25-green)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 24.69 on ADD22_eval_31](https://img.shields.io/badge/EER%25%20on%20ADD22__eval__31-24.69%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 44.12 on ADD2023_track12_test_r1](https://img.shields.io/badge/EER%25%20on%20ADD2023__track12__test__r1-44.12%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![EER% 7.31 on EmoFake_test](https://img.shields.io/badge/EER%25%20on%20EmoFake__test-7.31%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![1-SRR% 18.61 on EmoSpoofTTS](https://img.shields.io/badge/1--SRR%25%20on%20EmoSpoofTTS-18.61%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![arena tier](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/aasist3/tier.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)
	[![arena rank](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/aasist3/rank.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=aasist3)

	\| Dataset \| EER % \| Trials \|
	\|---\|---\|---\|
	\| ASVspoof2019_LA \| 9.44 \| 71,237 \|
	\| ASVspoof2021_DF \| 28.73 \| 611,829 \|
	\| InTheWild \| 29.72 \| 31,779 \|
	\| CD-ADD \| 30.73 \| 20,786 \|
	\| ASVspoof2021_LA \| 32.06 \| 181,566 \|

	> Scores produced with the `speech-spoof-bench` wrapper: preemphasis (0.97) + a deterministic first-64,600-sample window; score = output logit for class 1 (bona fide). Pinned score files live under [`.eval_results/`](./tree/main/.eval_results).


	This repository contains the original implementation of AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge.

	## Paper

	AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge

	This is the original implementation of the paper. The model weights provided here are NOT the same weights used in the paper results.

	## Overview

	AASIST3 is an enhanced version of the AASIST (Anti-spoofing with Adaptive Softmax and Instance-wise Temperature) architecture that incorporates Kolmogorov-Arnold Networks (KAN) for improved speech deepfake detection. The model leverages:

	- Self-Supervised Learning (SSL) Features: Uses Wav2Vec2 encoder for robust audio representation
	- KAN Linear Layers: Kolmogorov-Arnold Networks for enhanced feature transformation
	- Graph Attention Networks (GAT): For spatial and temporal feature modeling
	- Multi-branch Inference: Multiple inference branches for robust decision making

	## Architecture

	The AASIST3 model consists of several key components:

	1. Wav2Vec2 Encoder: Extracts SSL features from raw audio
	2. KAN Bridge: Transforms SSL features using Kolmogorov-Arnold Networks
	3. Residual Encoder: Processes features through multiple residual blocks
	4. Graph Attention Networks:
	- GAT-S: Spatial attention mechanism
	- GAT-T: Temporal attention mechanism
	5. Multi-branch Inference: Four parallel inference branches with master tokens
	6. KAN Output Layer: Final classification using KAN linear layers

	### Key Innovations

	- KAN Integration: Replaces traditional linear layers with KAN linear layers for better feature approximation
	- Enhanced Regularization: Additional dropout and regularization techniques
	- Multi-dataset Training: Trained on multiple ASVspoof datasets for robustness

	## 🚀 Quick Start

	### Installation

	```bash
	git clone https://github.com/mtuciru/AASIST3.git
	cd AASIST3
	pip install -r requirements.txt
	```

	### Loading the Model

	```python
	from model import aasist3

	# Load the model from Hugging Face Hub
	model = aasist3.from_pretrained("MTUCI/AASIST3")
	model.eval()
	```

	### Basic Usage

	```python
	import torch
	import torchaudio

	# Load and preprocess audio
	audio, sr = torchaudio.load("audio_file.wav")
	# Ensure audio is 16kHz and mono
	if sr != 16000:
	audio = torchaudio.transforms.Resample(sr, 16000)(audio)
	if audio.shape[0] > 1:
	audio = torch.mean(audio, dim=0, keepdim=True)

	# Prepare input (model expects ~4 seconds of audio at 16kHz)
	# Pad or truncate to 64600 samples
	if audio.shape[1] < 64600:
	audio = torch.nn.functional.pad(audio, (0, 64600 - audio.shape[1]))
	else:
	audio = audio[:, :64600]

	# Run inference
	with torch.no_grad():
	output = model(audio)
	probabilities = torch.softmax(output, dim=1)
	prediction = torch.argmax(probabilities, dim=1)

	# prediction: 0 = bonafide, 1 = spoof
	print(f"Prediction: {'Bonafide' if prediction.item() == 0 else 'Spoof'}")
	print(f"Confidence: {probabilities.max().item():.3f}")
	```

	## Training Details

	### Datasets Used

	The model was trained on a combination of multiple datasets:

	- ASVspoof 2019 LA (Logical Access)
	- ASVspoof 2024 (ASVspoof5)
	- MLAAD (Multi-Language Audio Anti-Spoofing Dataset)
	- M-AILABS (Multi-Language Audio Dataset)

	### Training Configuration

	- Epochs: 20
	- Batch Size: 12 (training), 24 (validation)
	- Learning Rate: 1e-4
	- Optimizer: AdamW
	- Loss Function: CrossEntropyLoss
	- Gradient Accumulation Steps: 2

	### Hardware

	- GPUs: 2xA100 40GB
	- Framework: PyTorch with Accelerate for distributed training

	## Advanced Usage

	### Custom Training

	```bash
	# Train the model
	bash train.sh
	```

	### Validation

	```bash
	# Run validation on test sets
	bash validate.sh
	```

	### Model Configuration

	The model can be configured through the `configs/train.yaml` file:

	```yaml
	# Key parameters
	num_epochs: 20
	train_batch_size: 12
	val_batch_size: 24
	learning_rate: 1e-4
	gradient_accumulation_steps: 2
	```


	## 🤝 Citation

	If you use this implementation in your research, please cite the original paper:

	```bibtex
	@inproceedings{borodin24_asvspoof,
	title = {AASIST3: KAN-enhanced AASIST speech deepfake detection using SSL features and additional regularization for the ASVspoof 2024 Challenge},
	author = {Kirill Borodin and Vasiliy Kudryavtsev and Dmitrii Korzh and Alexey Efimenko and Grach Mkrtchian and Mikhail Gorodnichev and Oleg Y. Rogov},
	year = {2024},
	booktitle = {The Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024)},
	pages = {48--55},
	doi = {10.21437/ASVspoof.2024-8},
	}
	```

	## License

	This project is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0) - see the [LICENSE](LICENSE) file for details.

	This license allows you to:
	- Share: Copy and redistribute the material in any medium or format
	- Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made

	But does NOT allow:
	- Commercial use: You may not use the material for commercial purposes
	- Derivatives: You may not distribute modified versions of the material

	For more information, visit: https://creativecommons.org/licenses/by-nc-nd/4.0/


	Disclaimer: This is a research implementation. The model weights provided are for demonstration purposes and may not match the exact performance reported in the paper.

	## Contact

	- Email: kborodin.research@gmail.com
	- Telegram: [@korallll_ai](https://t.me/korallll_ai)