Voice Detection Model
Binary audio classifier — classifies speech as FAKE (AI-generated) or REAL (human).
Built by fine-tuning facebook/wav2vec2-large-xlsr-53 on the garystafford/deepfake-audio-detection dataset.
Model Details
|
|
| Architecture |
Wav2Vec2ForSequenceClassification |
| Base model |
facebook/wav2vec2-large-xlsr-53 |
| Parameters |
315,701,634 (F32) |
| File size |
1.26 GB |
| Format |
Safetensors |
| Transformers |
4.57.6 |
Labels
{
"id2label": { "0": "FAKE", "1": "REAL" },
"label2id": { "FAKE": 0, "REAL": 1 }
}
| ID |
Label |
Meaning |
0 |
FAKE |
AI-generated / synthetic / deepfake |
1 |
REAL |
Authentic human speech |
Architecture
Transformer Encoder
|
|
| Hidden size |
1024 |
| Intermediate size |
4096 |
| Layers |
24 |
| Attention heads |
16 |
| Activation |
gelu |
| Stable layer norm |
✅ |
| Layer norm eps |
1e-05 |
| Layerdrop |
0.1 |
CNN Feature Extractor (7 layers)
| Layer |
Channels |
Kernel |
Stride |
| 1 |
512 |
10 |
5 |
| 2 |
512 |
3 |
2 |
| 3 |
512 |
3 |
2 |
| 4 |
512 |
3 |
2 |
| 5 |
512 |
3 |
2 |
| 6 |
512 |
2 |
2 |
| 7 |
512 |
2 |
2 |
- Activation:
gelu · Norm: layer · Bias: true
- Conv positional embeddings: 128, 16 groups
- Feature encoder was frozen during fine-tuning
TDNN Classifier Head
| Layer |
Dim |
Kernel |
Dilation |
| 1 |
512 |
5 |
1 |
| 2 |
512 |
3 |
2 |
| 3 |
512 |
3 |
3 |
| 4 |
512 |
1 |
1 |
| 5 |
1500 |
1 |
1 |
- Classifier projection: 256
- X-vector output dim: 512
Regularization
|
|
| Attention dropout |
0.1 |
| Hidden dropout |
0.1 |
| Feature proj dropout |
0.1 |
| Activation dropout |
0.0 |
| Final dropout |
0.0 |
| SpecAugment |
✅ enabled |
| Time mask prob |
0.075 |
| Time mask length |
10 |
Preprocessor
From preprocessor_config.json:
|
|
| Type |
Wav2Vec2FeatureExtractor |
| Sampling rate |
16000 Hz |
| Feature size |
1 (mono) |
| Normalize |
✅ |
| Return attention mask |
✅ |
| Padding side |
right |
| Padding value |
0 |
Training
Dataset
garystafford/deepfake-audio-detection — 1,866 samples total.
Group-aware defensive splits (speaker isolation):
| Split |
Samples |
| Train |
1,467 |
| Validation |
206 |
| Test |
193 |
Hyperparameters
|
|
| Learning rate |
3e-5 |
| Batch size |
8 per device |
| Gradient accumulation |
2 (effective batch 16) |
| Max epochs |
10 |
| Early stopping patience |
3 (metric: F1) |
| Warmup ratio |
0.1 |
| Weight decay |
0.01 |
| Max grad norm |
1.0 |
| Precision |
FP16 |
| Loss |
Weighted CrossEntropy (class-balanced) |
| Seed |
42 |
Preprocessing
- Silence trimming (
librosa.effects.trim, 30 dB threshold)
- Truncate/pad to 5.0s (80,000 samples at 16 kHz)
- Random crop during training, center crop during eval
Augmentation (training only)
| Transform |
Range |
Probability |
| Gaussian noise |
0.001–0.01 amplitude |
0.4 |
| Time stretch |
0.9–1.1× |
0.3 |
| Pitch shift |
±2 semitones |
0.3 |
| Gain |
±6 dB |
0.5 |
Infrastructure
|
|
| Platform |
Google Colab |
| GPU |
NVIDIA T4 |
| Python |
3.12 |
Usage
import torch
import librosa
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor
model_id = "shivam-2211/voice-detection-model"
extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)
model.eval()
audio, sr = librosa.load("sample.wav", sr=16000, mono=True)
inputs = extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)
pred_id = torch.argmax(probs, dim=-1).item()
confidence = probs[0][pred_id].item()
label = model.config.id2label[pred_id]
print(f"{label} ({confidence:.2%})")
Repository Files
| File |
Description |
model.safetensors |
Model weights (1.26 GB) |
config.json |
Architecture and label config |
preprocessor_config.json |
Feature extractor settings |
training_args.bin |
Serialized training hyperparameters |
Limitations
- Performance may degrade on heavily compressed, noisy, or very short audio.
- Newer voice synthesis methods may produce artifacts not represented in training data.
- Should not be used as sole evidence without expert review.
License
MIT