Simma7
/

audio_model

@@ -1,68 +1,228 @@
----
-license: mit
-language: en
-pipeline_tag: audio-classification
 library_name: transformers
 tags:
-- deepfake
-- audio
-- wav2vec2
-- pytorch
----
-# 🔊 Deepfake Audio Detection Model
-## 📌 Overview
-This model detects whether an audio file is **REAL or FAKE (AI-generated voice)**.
-It is based on **Wav2Vec2 architecture** and uses transformer-based audio embeddings.
----
-## 🎯 Task
-Binary Classification:
-- 0 → REAL AUDIO
-- 1 → FAKE AUDIO
----
-## 📥 Input
-- Audio file (.wav)
-- Sampling rate: 16kHz
----
-## 📤 Output
-- Fake probability (0 to 1)
----
-## ⚙️ Model Files
-- pytorch_model.bin
-- config.json
-- preprocessor_config.json
-- tokenizer files
----
-## 🚀 Usage
-```python
-from transformers import AutoProcessor, AutoModel
-import librosa
 import torch
-processor = AutoProcessor.from_pretrained("Simma7/audio_model")
-model = AutoModel.from_pretrained("Simma7/audio_model")
-audio, sr = librosa.load("test.wav", sr=16000)
-inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
 with torch.no_grad():
     outputs = model(**inputs)
-embedding = outputs.last_hidden_state.mean(dim=1)
-prob = torch.sigmoid(embedding.mean()).item()
-print(prob)

+metadata
 library_name: transformers
+base_model: Gustking/wav2vec2-large-xlsr-deepfake-audio-classification
+base_model_relation: finetune
+license: apache-2.0
+language:
+  - en
+pipeline_tag: audio-classification
 tags:
+  - audio
+  - wav2vec2
+  - deepfake-detection
+  - synthetic-speech
+  - tts
+  - voice-cloning
+datasets:
+  - garystafford/deepfake-audio-detection
+metrics:
+  - accuracy
+  - f1
+  - precision
+  - recall
+  - roc_auc
+Deepfake Audio Detection Model
+Fine-tuned Wav2Vec2 model for detecting AI-generated speech. Determines if audio was spoken by a human or created by AI text-to-speech/voice cloning software.
+Model Details
+Model Description
+Fine-tuned Wav2Vec2 transformer for binary audio classification (real vs AI-generated speech). Trained to distinguish authentic human speech from synthetic audio generated by AI text-to-speech and voice cloning services including:
+ElevenLabs
+Amazon Polly
+Hexgrad Kokoro
+Hume AI
+Speechify
+Luvvoice
+Developed by: Gary A. Stafford
+Note: This model uses transfer learning from a base model already trained for deepfake detection. Fast convergence is expected due to task similarity and TTS engine overlap with the base model's training data.
+How to Use
+Installation
+Install the required dependencies:
+pip install transformers torch librosa
+Optional: For GPU acceleration (recommended):
+# For CUDA 11.8
+pip install torch --index-url https://download.pytorch.org/whl/cu118
+# For CUDA 12.1
+pip install torch --index-url https://download.pytorch.org/whl/cu121
+Quick Start
 import torch
+import librosa
+from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
+# Load model and feature extractor
+model_name = "garystafford/wav2vec2-deepfake-voice-detector"
+model = AutoModelForAudioClassification.from_pretrained(model_name)
+feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
+# Move to GPU if available
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model.to(device)
+model.eval()
+# Load and preprocess audio (automatically resamples to 16kHz)
+audio, sr = librosa.load("path/to/audio.wav", sr=16000, mono=True)
+inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
+inputs = {k: v.to(device) for k, v in inputs.items()}
+# Run inference
 with torch.no_grad():
     outputs = model(**inputs)
+    logits = outputs.logits
+    probs = torch.nn.functional.softmax(logits, dim=-1)
+# Get prediction
+prob_real = probs[0][0].item()
+prob_fake = probs[0][1].item()
+prediction = "fake" if prob_fake > 0.5 else "real"
+print(f"Prediction: {prediction}")
+print(f"Confidence: {max(prob_real, prob_fake):.2%}")
+print(f"Probabilities - Real: {prob_real:.2%}, Fake: {prob_fake:.2%}")
+Expected Input
+Audio format: WAV, MP3, FLAC, or any format supported by librosa
+Sample rate: Automatically resampled to 16kHz
+Channels: Converted to mono
+Duration: Optimal performance on 2.5-13 second clips (model training range)
+Output
+The model outputs logits (raw, unnormalized scores) for two classes:
+Class 0: Real (human) audio
+Class 1: Fake (AI-generated) audio
+Converting Logits to Probabilities:
+Apply softmax to convert raw logits into interpretable probability scores:
+probs = torch.nn.functional.softmax(logits, dim=-1)
+Single sample: logits.shape = (1, 2) → probs.shape = (1, 2) where probs[0] contains [prob_real, prob_fake] summing to 1.0
+Batch processing: logits.shape = (N, 2) → probs.shape = (N, 2) where each sample's probabilities sum to 1.0 independently
+dim=-1: Applies softmax across classes for each sample, not across samples
+Batch Processing Example
+import glob
+audio_files = glob.glob("audio_folder/*.wav")
+for audio_path in audio_files:
+    audio, _ = librosa.load(audio_path, sr=16000, mono=True)
+    inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
+    inputs = {k: v.to(device) for k, v in inputs.items()}
+    with torch.no_grad():
+        outputs = model(**inputs)
+        probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
+    prediction = "fake" if probs[0][1] > 0.5 else "real"
+    print(f"{audio_path}: {prediction} ({probs[0][1]:.2%} fake)")
+Training Details
+Dataset
+Source: garystafford/deepfake-audio-detection
+Composition:
+Real audio: YouTube recordings from 14 source videos, human speech samples
+Synthetic audio: Generated using 6 TTS platforms (ElevenLabs, Amazon Polly, Hexgrad Kokoro, Hume AI, Speechify, Luvvoice)
+Format: FLAC, 16kHz mono, 2.5-13 second chunks
+Total samples: 1,866 (balanced: 933 real, 933 fake)
+Processing: Two-pass audio splitting with silence detection, concatenation of short segments, and VAD-based sub-chunking
+Split:
+Split	Real	Fake	Total	Percentage
+Train	746	746	1,492	80%
+Validation	93	94	187	10%
+Test	94	93	187	10%
+Stratified splitting applied to ensure balanced class distribution across all splits.
+Training Approach
+Base Model: Gustking/wav2vec2-large-xlsr-deepfake-audio-classification - A Wav2Vec2-XLSR model pre-trained on 53 languages and already fine-tuned for deepfake audio detection.
+Method: Transfer learning with selective layer freezing:
+Frozen:
+Wav2Vec2 feature extractor (convolutional layers)
+Bottom 12 transformer encoder layers
+Trained:
+Top 12 transformer encoder layers (upper half)
+Classification head (256-dimensional projection + linear classifier)
+~160M trainable parameters (approximately half the model)
+Rationale: Freezing low-level acoustic features while training high-level semantic layers allows the model to adapt to this dataset's specific TTS characteristics and speaker patterns while preserving general audio understanding.
+Hyperparameters
+Parameter	Value
+Learning rate	3e-5
+Epochs (max)	5
+Early stopping patience	3 evaluations
+Evaluation frequency	Every 30 steps
+Per-device batch size	4
+Gradient accumulation steps	4
+Effective batch size	16
+Optimizer	AdamW
+Warmup ratio	0.1 (10%)
+Weight decay	0.01
+Save strategy	Every 30 steps
+Metric for best model	ROC-AUC
+Precision	FP16
+Training Statistics:
+Training samples: 1,492 (746 real, 746 fake)
+Validation samples: 187 (93 real, 94 fake)
+Trainable parameters: 160,336,770 (~160M parameters, approximately 50% of full model)
+Training approach: Freeze feature extractor and bottom 12 transformer layers; train top 12 transformer layers + classification head
+Convergence: Efficient convergence (typically ~3-4 epochs) due to base model's existing deepfake detection capabilities
+Why high performance? Transfer learning from a specialist deepfake detector allows rapid adaptation to this dataset while training substantial portions of the model to capture dataset-specific patterns
+Architecture
+The model uses AutoModelForAudioClassification with a two-class output (0=real, 1=fake):
+Feature Extractor (Frozen): 7 convolutional layers extract acoustic features from raw audio
+Transformer Encoder:
+Layers 0-11 (Frozen): Preserve low-level acoustic and phonetic representations
+Layers 12-23 (Trained): Adapt high-level semantic features to deepfake patterns
+Classification Head (Trained): 256-dimensional projection + linear classifier
+This architecture balances efficiency with adaptability—frozen layers preserve general audio understanding while trained layers (~160M parameters) learn dataset-specific deepfake detection patterns.
+Model Performance
+⚠️ IMPORTANT CONTEXT: These high-performance metrics reflect fine-tuning a specialist model on its own domain. The base model (Gustking/wav2vec2-large-xlsr-deepfake-audio-classification) was already trained for deepfake detection, likely on similar TTS engines. These results demonstrate successful adaptation to this specific dataset of 1,866 samples, NOT general deepfake detection capability from scratch. The excellent ROC-AUC (0.998) indicates near-perfect class separation, though 4 samples (2.1%) are still misclassified at the default 0.5 threshold.
+Validation Set Performance
+The model performs well on the validation set of 187 audio clips (94 real, 93 fake):
+Validation Results (at threshold 0.5):
+Accuracy: 97.9% (183 out of 187 samples correctly classified)
+ROC-AUC: 0.998 (near-perfect class separation)
+Balanced Accuracy: 97.9%
+Per-Class Metrics (threshold 0.5):
+Class	Precision	Recall	F1-Score	Support
+Real	1.00	0.96	0.98	94
+Fake	0.96	1.00	0.98	93
+Confusion Matrix (threshold 0.5):
+Pred Real	Pred Fake
+True Real	90	4
+True Fake	0	93
+Note: Best balanced accuracy of 98.4% achieved at threshold 0.9 (96.8% real recall, 100% fake recall).
+Important Notes on Performance
+Context for High Performance:
+Moderate validation set: 187 samples provides reasonable evaluation, though larger test sets recommended for production validation
+Transfer learning: Base model already trained for deepfake detection on similar TTS engines - fine-tuning adapts existing knowledge
+Dataset characteristics: TTS-generated audio has distinctive artifacts (prosody patterns, spectral signatures) that differentiate it from human speech
+ROC-AUC of 0.998: Indicates near-perfect ranking/separation of classes; 4 real samples misclassified as fake at threshold 0.5, while all fake samples correctly identified
+Recommended validation: Test on TTS engines NOT in training data (e.g., OpenAI TTS, Azure Neural, advanced voice cloning systems) for true generalization assessment
+Generalization Limitations:
+Model may not generalize well to:
+Novel TTS engines not represented in training data
+Advanced voice cloning/conversion systems
+Real-time voice manipulation
+Low-quality recordings with significant noise
+Inference Performance
+Estimated based on model architecture:
+Latency: ~50-100ms per sample (varies by hardware)
+Recommended use: Batch processing for efficiency