Simma7 commited on
Commit
1b63c88
·
verified ·
1 Parent(s): 7f89d7b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +219 -59
README.md CHANGED
@@ -1,68 +1,228 @@
1
- ---
2
- license: mit
3
- language: en
4
- pipeline_tag: audio-classification
5
  library_name: transformers
 
 
 
 
 
 
6
  tags:
7
- - deepfake
8
- - audio
9
- - wav2vec2
10
- - pytorch
11
- ---
12
-
13
- # 🔊 Deepfake Audio Detection Model
14
-
15
- ## 📌 Overview
16
- This model detects whether an audio file is **REAL or FAKE (AI-generated voice)**.
17
-
18
- It is based on **Wav2Vec2 architecture** and uses transformer-based audio embeddings.
19
-
20
- ---
21
-
22
- ## 🎯 Task
23
- Binary Classification:
24
- - 0 → REAL AUDIO
25
- - 1 → FAKE AUDIO
26
-
27
- ---
28
-
29
- ## 📥 Input
30
- - Audio file (.wav)
31
- - Sampling rate: 16kHz
32
-
33
- ---
34
-
35
- ## 📤 Output
36
- - Fake probability (0 to 1)
37
-
38
- ---
39
-
40
- ## ⚙️ Model Files
41
- - pytorch_model.bin
42
- - config.json
43
- - preprocessor_config.json
44
- - tokenizer files
45
-
46
- ---
47
-
48
- ## 🚀 Usage
49
-
50
- ```python
51
- from transformers import AutoProcessor, AutoModel
52
- import librosa
53
  import torch
 
 
54
 
55
- processor = AutoProcessor.from_pretrained("Simma7/audio_model")
56
- model = AutoModel.from_pretrained("Simma7/audio_model")
 
 
57
 
58
- audio, sr = librosa.load("test.wav", sr=16000)
 
 
 
59
 
60
- inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
 
 
 
61
 
 
62
  with torch.no_grad():
63
  outputs = model(**inputs)
64
-
65
- embedding = outputs.last_hidden_state.mean(dim=1)
66
- prob = torch.sigmoid(embedding.mean()).item()
67
-
68
- print(prob)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ metadata
 
 
 
2
  library_name: transformers
3
+ base_model: Gustking/wav2vec2-large-xlsr-deepfake-audio-classification
4
+ base_model_relation: finetune
5
+ license: apache-2.0
6
+ language:
7
+ - en
8
+ pipeline_tag: audio-classification
9
  tags:
10
+ - audio
11
+ - wav2vec2
12
+ - deepfake-detection
13
+ - synthetic-speech
14
+ - tts
15
+ - voice-cloning
16
+ datasets:
17
+ - garystafford/deepfake-audio-detection
18
+ metrics:
19
+ - accuracy
20
+ - f1
21
+ - precision
22
+ - recall
23
+ - roc_auc
24
+ Deepfake Audio Detection Model
25
+ Fine-tuned Wav2Vec2 model for detecting AI-generated speech. Determines if audio was spoken by a human or created by AI text-to-speech/voice cloning software.
26
+
27
+ Model Details
28
+ Model Description
29
+ Fine-tuned Wav2Vec2 transformer for binary audio classification (real vs AI-generated speech). Trained to distinguish authentic human speech from synthetic audio generated by AI text-to-speech and voice cloning services including:
30
+
31
+ ElevenLabs
32
+ Amazon Polly
33
+ Hexgrad Kokoro
34
+ Hume AI
35
+ Speechify
36
+ Luvvoice
37
+ Developed by: Gary A. Stafford
38
+
39
+ Note: This model uses transfer learning from a base model already trained for deepfake detection. Fast convergence is expected due to task similarity and TTS engine overlap with the base model's training data.
40
+
41
+ How to Use
42
+ Installation
43
+ Install the required dependencies:
44
+
45
+ pip install transformers torch librosa
46
+ Optional: For GPU acceleration (recommended):
47
+
48
+ # For CUDA 11.8
49
+ pip install torch --index-url https://download.pytorch.org/whl/cu118
50
+
51
+ # For CUDA 12.1
52
+ pip install torch --index-url https://download.pytorch.org/whl/cu121
53
+ Quick Start
 
 
54
  import torch
55
+ import librosa
56
+ from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
57
 
58
+ # Load model and feature extractor
59
+ model_name = "garystafford/wav2vec2-deepfake-voice-detector"
60
+ model = AutoModelForAudioClassification.from_pretrained(model_name)
61
+ feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
62
 
63
+ # Move to GPU if available
64
+ device = "cuda" if torch.cuda.is_available() else "cpu"
65
+ model.to(device)
66
+ model.eval()
67
 
68
+ # Load and preprocess audio (automatically resamples to 16kHz)
69
+ audio, sr = librosa.load("path/to/audio.wav", sr=16000, mono=True)
70
+ inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
71
+ inputs = {k: v.to(device) for k, v in inputs.items()}
72
 
73
+ # Run inference
74
  with torch.no_grad():
75
  outputs = model(**inputs)
76
+ logits = outputs.logits
77
+ probs = torch.nn.functional.softmax(logits, dim=-1)
78
+
79
+ # Get prediction
80
+ prob_real = probs[0][0].item()
81
+ prob_fake = probs[0][1].item()
82
+ prediction = "fake" if prob_fake > 0.5 else "real"
83
+
84
+ print(f"Prediction: {prediction}")
85
+ print(f"Confidence: {max(prob_real, prob_fake):.2%}")
86
+ print(f"Probabilities - Real: {prob_real:.2%}, Fake: {prob_fake:.2%}")
87
+ Expected Input
88
+ Audio format: WAV, MP3, FLAC, or any format supported by librosa
89
+ Sample rate: Automatically resampled to 16kHz
90
+ Channels: Converted to mono
91
+ Duration: Optimal performance on 2.5-13 second clips (model training range)
92
+ Output
93
+ The model outputs logits (raw, unnormalized scores) for two classes:
94
+
95
+ Class 0: Real (human) audio
96
+ Class 1: Fake (AI-generated) audio
97
+ Converting Logits to Probabilities:
98
+
99
+ Apply softmax to convert raw logits into interpretable probability scores:
100
+
101
+ probs = torch.nn.functional.softmax(logits, dim=-1)
102
+ Single sample: logits.shape = (1, 2) → probs.shape = (1, 2) where probs[0] contains [prob_real, prob_fake] summing to 1.0
103
+ Batch processing: logits.shape = (N, 2) → probs.shape = (N, 2) where each sample's probabilities sum to 1.0 independently
104
+ dim=-1: Applies softmax across classes for each sample, not across samples
105
+ Batch Processing Example
106
+ import glob
107
+
108
+ audio_files = glob.glob("audio_folder/*.wav")
109
+
110
+ for audio_path in audio_files:
111
+ audio, _ = librosa.load(audio_path, sr=16000, mono=True)
112
+ inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
113
+ inputs = {k: v.to(device) for k, v in inputs.items()}
114
+
115
+ with torch.no_grad():
116
+ outputs = model(**inputs)
117
+ probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
118
+
119
+ prediction = "fake" if probs[0][1] > 0.5 else "real"
120
+ print(f"{audio_path}: {prediction} ({probs[0][1]:.2%} fake)")
121
+ Training Details
122
+ Dataset
123
+ Source: garystafford/deepfake-audio-detection
124
+
125
+ Composition:
126
+
127
+ Real audio: YouTube recordings from 14 source videos, human speech samples
128
+ Synthetic audio: Generated using 6 TTS platforms (ElevenLabs, Amazon Polly, Hexgrad Kokoro, Hume AI, Speechify, Luvvoice)
129
+ Format: FLAC, 16kHz mono, 2.5-13 second chunks
130
+ Total samples: 1,866 (balanced: 933 real, 933 fake)
131
+ Processing: Two-pass audio splitting with silence detection, concatenation of short segments, and VAD-based sub-chunking
132
+ Split:
133
+
134
+ Split Real Fake Total Percentage
135
+ Train 746 746 1,492 80%
136
+ Validation 93 94 187 10%
137
+ Test 94 93 187 10%
138
+ Stratified splitting applied to ensure balanced class distribution across all splits.
139
+
140
+ Training Approach
141
+ Base Model: Gustking/wav2vec2-large-xlsr-deepfake-audio-classification - A Wav2Vec2-XLSR model pre-trained on 53 languages and already fine-tuned for deepfake audio detection.
142
+
143
+ Method: Transfer learning with selective layer freezing:
144
+
145
+ Frozen:
146
+ Wav2Vec2 feature extractor (convolutional layers)
147
+ Bottom 12 transformer encoder layers
148
+ Trained:
149
+ Top 12 transformer encoder layers (upper half)
150
+ Classification head (256-dimensional projection + linear classifier)
151
+ ~160M trainable parameters (approximately half the model)
152
+ Rationale: Freezing low-level acoustic features while training high-level semantic layers allows the model to adapt to this dataset's specific TTS characteristics and speaker patterns while preserving general audio understanding.
153
+ Hyperparameters
154
+ Parameter Value
155
+ Learning rate 3e-5
156
+ Epochs (max) 5
157
+ Early stopping patience 3 evaluations
158
+ Evaluation frequency Every 30 steps
159
+ Per-device batch size 4
160
+ Gradient accumulation steps 4
161
+ Effective batch size 16
162
+ Optimizer AdamW
163
+ Warmup ratio 0.1 (10%)
164
+ Weight decay 0.01
165
+ Save strategy Every 30 steps
166
+ Metric for best model ROC-AUC
167
+ Precision FP16
168
+ Training Statistics:
169
+
170
+ Training samples: 1,492 (746 real, 746 fake)
171
+ Validation samples: 187 (93 real, 94 fake)
172
+ Trainable parameters: 160,336,770 (~160M parameters, approximately 50% of full model)
173
+ Training approach: Freeze feature extractor and bottom 12 transformer layers; train top 12 transformer layers + classification head
174
+ Convergence: Efficient convergence (typically ~3-4 epochs) due to base model's existing deepfake detection capabilities
175
+ Why high performance? Transfer learning from a specialist deepfake detector allows rapid adaptation to this dataset while training substantial portions of the model to capture dataset-specific patterns
176
+ Architecture
177
+ The model uses AutoModelForAudioClassification with a two-class output (0=real, 1=fake):
178
+
179
+ Feature Extractor (Frozen): 7 convolutional layers extract acoustic features from raw audio
180
+ Transformer Encoder:
181
+ Layers 0-11 (Frozen): Preserve low-level acoustic and phonetic representations
182
+ Layers 12-23 (Trained): Adapt high-level semantic features to deepfake patterns
183
+ Classification Head (Trained): 256-dimensional projection + linear classifier
184
+ This architecture balances efficiency with adaptability—frozen layers preserve general audio understanding while trained layers (~160M parameters) learn dataset-specific deepfake detection patterns.
185
+
186
+ Model Performance
187
+ ⚠️ IMPORTANT CONTEXT: These high-performance metrics reflect fine-tuning a specialist model on its own domain. The base model (Gustking/wav2vec2-large-xlsr-deepfake-audio-classification) was already trained for deepfake detection, likely on similar TTS engines. These results demonstrate successful adaptation to this specific dataset of 1,866 samples, NOT general deepfake detection capability from scratch. The excellent ROC-AUC (0.998) indicates near-perfect class separation, though 4 samples (2.1%) are still misclassified at the default 0.5 threshold.
188
+
189
+ Validation Set Performance
190
+ The model performs well on the validation set of 187 audio clips (94 real, 93 fake):
191
+
192
+ Validation Results (at threshold 0.5):
193
+
194
+ Accuracy: 97.9% (183 out of 187 samples correctly classified)
195
+ ROC-AUC: 0.998 (near-perfect class separation)
196
+ Balanced Accuracy: 97.9%
197
+ Per-Class Metrics (threshold 0.5):
198
+
199
+ Class Precision Recall F1-Score Support
200
+ Real 1.00 0.96 0.98 94
201
+ Fake 0.96 1.00 0.98 93
202
+ Confusion Matrix (threshold 0.5):
203
+
204
+ Pred Real Pred Fake
205
+ True Real 90 4
206
+ True Fake 0 93
207
+ Note: Best balanced accuracy of 98.4% achieved at threshold 0.9 (96.8% real recall, 100% fake recall).
208
+
209
+ Important Notes on Performance
210
+ Context for High Performance:
211
+
212
+ Moderate validation set: 187 samples provides reasonable evaluation, though larger test sets recommended for production validation
213
+ Transfer learning: Base model already trained for deepfake detection on similar TTS engines - fine-tuning adapts existing knowledge
214
+ Dataset characteristics: TTS-generated audio has distinctive artifacts (prosody patterns, spectral signatures) that differentiate it from human speech
215
+ ROC-AUC of 0.998: Indicates near-perfect ranking/separation of classes; 4 real samples misclassified as fake at threshold 0.5, while all fake samples correctly identified
216
+ Recommended validation: Test on TTS engines NOT in training data (e.g., OpenAI TTS, Azure Neural, advanced voice cloning systems) for true generalization assessment
217
+ Generalization Limitations:
218
+
219
+ Model may not generalize well to:
220
+ Novel TTS engines not represented in training data
221
+ Advanced voice cloning/conversion systems
222
+ Real-time voice manipulation
223
+ Low-quality recordings with significant noise
224
+ Inference Performance
225
+ Estimated based on model architecture:
226
+
227
+ Latency: ~50-100ms per sample (varies by hardware)
228
+ Recommended use: Batch processing for efficiency