Improved version available: caa-speech-detection-asvspoof2019/lcnn-v7-cqt — lcnn_v7 (CQT + label-smoothing + cosine + grad-clip) achieves 3.26% eval EER (vs 8.43% here) and 0.4930 tandem min t-DCF.

LCNN — ASVspoof 2019 LA Countermeasure

Light CNN for binary classification of bonafide vs spoofed speech, trained on the ASVspoof 2019 Logical Access (LA) dataset.

This is one of three models compared in our study (LCNN, RawNet2, Wav2Vec 2.0) under identical training/evaluation conditions.

Architecture

2D CNN over LFCC features
Reference: Lavrentyeva et al., "Audio Replay Attack Detection with Deep Learning Frameworks", Interspeech 2017


Input	LFCC (60 coefficients, 512 FFT, 160 hop, ~4 s audio)
Channels	`[32, 48, 64, 128]`
Kernel sizes	`[5, 5, 3, 3]`
FC hidden	64
Dropout	0.3

See config.yaml for the full training/model configuration.

Training

Dataset: ASVspoof 2019 LA train split (~25k utterances)
Batch size: 128
Learning rate: 1e-4, cosine schedule
Gradient clipping: 1.0
Sample rate: 16 kHz mono
No data augmentation

Results

Baseline to beat: EER 8.09% (LFCC+GMM).

Split	EER	tandem min t-DCF	In-the-Wild EER
Dev (baseline → improved)	0.902% → 0.708%	—	—
Eval	8.43%	—	—
Eval (improved: lcnn_v7)	3.26%	0.4930	33.41%

Trajectory and loss curves: see learning_curves.png and metrics.csv.

Note on the metric: t-DCF values use the normalized [0, 1] convention (0 = perfect, 1 = no better than trivial baseline). Tandem t-DCF uses ASV scores following the official ASVspoof 2019 formula.

Caveat on dev performance: dev shares attacks (A01–A06) with the training split. Eval-set performance against unseen attacks (A07–A19) is the meaningful generalisation metric.

Usage

import torch
# Load checkpoint
state = torch.load("best.pt", map_location="cpu")
# Plug into the LCNN model from the source repo:
# https://github.com/sebastiaoteixeira/caa-ai-generated-speech-detector

License

MIT

Downloads last month: 5