Voice Detection Model

Binary audio classifier — classifies speech as FAKE (AI-generated) or REAL (human).

Built by fine-tuning facebook/wav2vec2-large-xlsr-53 on the garystafford/deepfake-audio-detection dataset.

Model Details


Architecture	`Wav2Vec2ForSequenceClassification`
Base model	`facebook/wav2vec2-large-xlsr-53`
Parameters	315,701,634 (F32)
File size	1.26 GB
Format	Safetensors
Transformers	`4.57.6`

Labels

{
  "id2label": { "0": "FAKE", "1": "REAL" },
  "label2id": { "FAKE": 0, "REAL": 1 }
}

ID	Label	Meaning
`0`	FAKE	AI-generated / synthetic / deepfake
`1`	REAL	Authentic human speech

Architecture

Transformer Encoder


Hidden size	1024
Intermediate size	4096
Layers	24
Attention heads	16
Activation	gelu
Stable layer norm	✅
Layer norm eps	1e-05
Layerdrop	0.1

CNN Feature Extractor (7 layers)

Layer	Channels	Kernel	Stride
1	512	10	5
2	512	3	2
3	512	3	2
4	512	3	2
5	512	3	2
6	512	2	2
7	512	2	2

Activation: gelu · Norm: layer · Bias: true
Conv positional embeddings: 128, 16 groups
Feature encoder was frozen during fine-tuning

TDNN Classifier Head

Layer	Dim	Kernel	Dilation
1	512	5	1
2	512	3	2
3	512	3	3
4	512	1	1
5	1500	1	1

Classifier projection: 256
X-vector output dim: 512

Regularization


Attention dropout	0.1
Hidden dropout	0.1
Feature proj dropout	0.1
Activation dropout	0.0
Final dropout	0.0
SpecAugment	✅ enabled
Time mask prob	0.075
Time mask length	10

Preprocessor

From preprocessor_config.json:


Type	`Wav2Vec2FeatureExtractor`
Sampling rate	16000 Hz
Feature size	1 (mono)
Normalize	✅
Return attention mask	✅
Padding side	right
Padding value	0

Training

Dataset

garystafford/deepfake-audio-detection — 1,866 samples total.

Group-aware defensive splits (speaker isolation):

Split	Samples
Train	1,467
Validation	206
Test	193

Hyperparameters


Learning rate	3e-5
Batch size	8 per device
Gradient accumulation	2 (effective batch 16)
Max epochs	10
Early stopping patience	3 (metric: F1)
Warmup ratio	0.1
Weight decay	0.01
Max grad norm	1.0
Precision	FP16
Loss	Weighted CrossEntropy (class-balanced)
Seed	42

Preprocessing

Silence trimming (librosa.effects.trim, 30 dB threshold)
Truncate/pad to 5.0s (80,000 samples at 16 kHz)
Random crop during training, center crop during eval

Augmentation (training only)

Transform	Range	Probability
Gaussian noise	0.001–0.01 amplitude	0.4
Time stretch	0.9–1.1×	0.3
Pitch shift	±2 semitones	0.3
Gain	±6 dB	0.5

Infrastructure


Platform	Google Colab
GPU	NVIDIA T4
Python	3.12

Usage

import torch
import librosa
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor

model_id = "shivam-2211/voice-detection-model"
extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)
model.eval()

# Load audio at 16 kHz mono
audio, sr = librosa.load("sample.wav", sr=16000, mono=True)

inputs = extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1)
    pred_id = torch.argmax(probs, dim=-1).item()
    confidence = probs[0][pred_id].item()

label = model.config.id2label[pred_id]
print(f"{label} ({confidence:.2%})")

Repository Files

File	Description
`model.safetensors`	Model weights (1.26 GB)
`config.json`	Architecture and label config
`preprocessor_config.json`	Feature extractor settings
`training_args.bin`	Serialized training hyperparameters

Limitations

Performance may degrade on heavily compressed, noisy, or very short audio.
Newer voice synthesis methods may produce artifacts not represented in training data.
Should not be used as sole evidence without expert review.

License

MIT

Downloads last month: 6

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for shivam-2211/voice-detection-model

Base model

facebook/wav2vec2-large-xlsr-53

Finetuned

(369)

this model

shivam-2211
/

voice-detection-model