Voice Detection Model

Binary audio classifier — classifies speech as FAKE (AI-generated) or REAL (human).

Built by fine-tuning facebook/wav2vec2-large-xlsr-53 on the garystafford/deepfake-audio-detection dataset.


Model Details

Architecture Wav2Vec2ForSequenceClassification
Base model facebook/wav2vec2-large-xlsr-53
Parameters 315,701,634 (F32)
File size 1.26 GB
Format Safetensors
Transformers 4.57.6

Labels

{
  "id2label": { "0": "FAKE", "1": "REAL" },
  "label2id": { "FAKE": 0, "REAL": 1 }
}
ID Label Meaning
0 FAKE AI-generated / synthetic / deepfake
1 REAL Authentic human speech

Architecture

Transformer Encoder

Hidden size 1024
Intermediate size 4096
Layers 24
Attention heads 16
Activation gelu
Stable layer norm ✅
Layer norm eps 1e-05
Layerdrop 0.1

CNN Feature Extractor (7 layers)

Layer Channels Kernel Stride
1 512 10 5
2 512 3 2
3 512 3 2
4 512 3 2
5 512 3 2
6 512 2 2
7 512 2 2
  • Activation: gelu · Norm: layer · Bias: true
  • Conv positional embeddings: 128, 16 groups
  • Feature encoder was frozen during fine-tuning

TDNN Classifier Head

Layer Dim Kernel Dilation
1 512 5 1
2 512 3 2
3 512 3 3
4 512 1 1
5 1500 1 1
  • Classifier projection: 256
  • X-vector output dim: 512

Regularization

Attention dropout 0.1
Hidden dropout 0.1
Feature proj dropout 0.1
Activation dropout 0.0
Final dropout 0.0
SpecAugment ✅ enabled
Time mask prob 0.075
Time mask length 10

Preprocessor

From preprocessor_config.json:

Type Wav2Vec2FeatureExtractor
Sampling rate 16000 Hz
Feature size 1 (mono)
Normalize ✅
Return attention mask ✅
Padding side right
Padding value 0

Training

Dataset

garystafford/deepfake-audio-detection — 1,866 samples total.

Group-aware defensive splits (speaker isolation):

Split Samples
Train 1,467
Validation 206
Test 193

Hyperparameters

Learning rate 3e-5
Batch size 8 per device
Gradient accumulation 2 (effective batch 16)
Max epochs 10
Early stopping patience 3 (metric: F1)
Warmup ratio 0.1
Weight decay 0.01
Max grad norm 1.0
Precision FP16
Loss Weighted CrossEntropy (class-balanced)
Seed 42

Preprocessing

  • Silence trimming (librosa.effects.trim, 30 dB threshold)
  • Truncate/pad to 5.0s (80,000 samples at 16 kHz)
  • Random crop during training, center crop during eval

Augmentation (training only)

Transform Range Probability
Gaussian noise 0.001–0.01 amplitude 0.4
Time stretch 0.9–1.1× 0.3
Pitch shift ±2 semitones 0.3
Gain ±6 dB 0.5

Infrastructure

Platform Google Colab
GPU NVIDIA T4
Python 3.12

Usage

import torch
import librosa
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor

model_id = "shivam-2211/voice-detection-model"
extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)
model.eval()

# Load audio at 16 kHz mono
audio, sr = librosa.load("sample.wav", sr=16000, mono=True)

inputs = extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1)
    pred_id = torch.argmax(probs, dim=-1).item()
    confidence = probs[0][pred_id].item()

label = model.config.id2label[pred_id]
print(f"{label} ({confidence:.2%})")

Repository Files

File Description
model.safetensors Model weights (1.26 GB)
config.json Architecture and label config
preprocessor_config.json Feature extractor settings
training_args.bin Serialized training hyperparameters

Limitations

  • Performance may degrade on heavily compressed, noisy, or very short audio.
  • Newer voice synthesis methods may produce artifacts not represented in training data.
  • Should not be used as sole evidence without expert review.

License

MIT

Downloads last month
6
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shivam-2211/voice-detection-model

Finetuned
(369)
this model

Dataset used to train shivam-2211/voice-detection-model

Space using shivam-2211/voice-detection-model 1