DF Arena 500M — Speech Anti-Spoofing Arena results

RAPTOR universal anti-spoofing model. A wav2vec 2.0 XLS-R 300M self-supervised front-end whose per-layer hidden states are combined by learnable attention pooling (a layer-wise sigmoid gate over an attention-pooled summary), then passed through a 4-block Conformer head with a class token to a 2-way classifier. FP32, deterministic first-64600-sample (~4.04 s @ 16 kHz) window, tile-repeat if shorter (no random crop, no resampling). score = softmax(logits)[bonafide]; higher = more bona fide. Official Speech-Arena-2025/DF_Arena_500M_V_1 checkpoint.

Paper: arXiv:2603.06164 · Params: 436M · Checkpoint: SpeechAntiSpoofingBenchmarks/DF_Arena_500M_V_1

Arena standing

EER% 0 on J-SPAW_LA EER% 9.01 on ArAD EER% 0 on DFADD EER% 2.11 on SONAR EER% 9.71 on DeepVoice EER% 2.63 on EmoFake_test EER% 0.11 on LibriSeVoc EER% 2.46 on CD-ADD EER% 8.4 on ODSS EER% 1.87 on InTheWild EER% 4.33 on DECRO EER% 8 on CFAD EER% 1.19 on ASVspoof2019_LA EER% 3.27 on HABLA EER% 7.9 on CVoiceFake_small EER% 5.78 on ASVspoof2021_LA EER% 15.96 on PyAra EER% 2.83 on XMAD EER% 3.5 on ASVspoof2021_DF EER% 13.43 on ASVspoof5 EER% 1.97 on ADD22_eval_31 EER% 7.44 on ADD2023_track12_test_r1 1-SRR% 3.1 on EmoSpoofTTS 1-SRR% 1.61 on LRLspoof arena tier arena rank

Live leaderboard: DF Arena 500M on the Speech Anti-Spoofing Arena

Per-dataset results (24 datasets, mean EER 5.09%)

Dataset Metric Score
J-SPAW_LA EER 0%
ArAD EER 9.01%
DFADD EER 0%
SONAR EER 2.11%
DeepVoice EER 9.71%
EmoFake_test EER 2.63%
LibriSeVoc EER 0.11%
CD-ADD EER 2.46%
ODSS EER 8.4%
InTheWild EER 1.87%
DECRO EER 4.33%
CFAD EER 8%
ASVspoof2019_LA EER 1.19%
HABLA EER 3.27%
CVoiceFake_small EER 7.9%
ASVspoof2021_LA EER 5.78%
PyAra EER 15.96%
XMAD EER 2.83%
ASVspoof2021_DF EER 3.5%
ASVspoof5 EER 13.43%
ADD22_eval_31 EER 1.97%
ADD2023_track12_test_r1 EER 7.44%
EmoSpoofTTS 1-SRR 3.1%
LRLspoof 1-SRR 1.61%

EER = Equal Error Rate (lower better). 1-SRR = spoof-only complement of the Spoof Recall Rate at the model's own DeepVoice EER operating point (lower better). All rows scoring-verified (reproduce --scoring, Δ 0.0) and computed with the TensorRT engine (parity-verified vs PyTorch).

Usage

from transformers import pipeline
import librosa
pipe = pipeline("antispoofing", model="SpeechAntiSpoofingBenchmarks/DF_Arena_500M_V_1", trust_remote_code=True, device="cuda")
audio, sr = librosa.load("sample.wav", sr=16000)
print(pipe(audio))   # {'label': 'bonafide'|'spoof', 'all_scores': {...}}

Citation

@misc{kulkarni2026compactsslbackbonesmatter,
  title={Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR},
  author={Ajinkya Kulkarni and Sandipana Dowerah and Atharva Kulkarni and Tanel Alumäe and Mathew Magimai Doss},
  year={2026},
  eprint={2603.06164},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  url={https://arxiv.org/abs/2603.06164}
}
Downloads last month
440
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for SpeechAntiSpoofingBenchmarks/DF_Arena_500M_V_1