# 🎵 Deepfake Audio Detection Model A machine learning model to detect deepfake/synthetic audio using Wav2Vec2 embeddings and classical ML classifiers. [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-yellow)](https://huggingface.co/hjsgfd/deepfake_audio_classifier) [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT) ## 📊 Model Performance | Model | Accuracy | Precision | Recall | F1-Score | |-------|----------|-----------|--------|----------| | **Logistic Regression** | **92.86%** | 0.95 | 0.93 | 0.93 | | SVM | 85.71% | 0.89 | 0.86 | 0.85 | | Random Forest | 78.57% | 0.85 | 0.79 | 0.76 | **Best Model: Logistic Regression with 92.86% accuracy** ## 🎯 Approach ### 1. Dataset - **Source**: [Real vs Fake Human Voice Deepfake Audio Dataset](https://huggingface.co/datasets/ud-nlp/real-vs-fake-human-voice-deepfake-audio) - **Size**: 70 audio samples - **Classes**: 5 classes (0, 1, 2, 3, 4) - **Distribution**: Perfectly balanced (14 samples per class) ### 2. Feature Extraction We use **Wav2Vec2** (facebook/wav2vec2-base-960h) to extract deep audio embeddings: - Pre-trained self-supervised model - Extracts 768-dimensional feature vectors - Captures semantic audio information - Handles variable-length audio automatically **Pipeline:** ``` Audio File → Wav2Vec2 → 768-dim Embedding → Classifier → Prediction ``` ### 3. Model Training Three classifiers were trained and compared: #### Logistic Regression (Best) - **Accuracy**: 92.86% - Multi-class classification with OvR strategy - Max iterations: 1000 - Features: StandardScaler normalized #### SVM - **Accuracy**: 85.71% - RBF kernel - Probability estimates enabled #### Random Forest - **Accuracy**: 78.57% - 200 estimators - Parallel processing enabled ### 4. Preprocessing - **Audio Loading**: Support for both URLs and local files - **Resampling**: All audio converted to 16kHz - **Stereo to Mono**: Averaged across channels - **Normalization**: StandardScaler on embeddings ## 🚀 Quick Start ### Installation ```bash pip install transformers torch librosa soundfile scikit-learn huggingface-hub requests numpy ``` ### Usage #### Simple Prediction ```python from predict_from_hf import AudioDeepfakeDetectorFromHF # Initialize detector (downloads model automatically) detector = AudioDeepfakeDetectorFromHF("hjsgfd/deepfake_audio_classifier") # Predict from URL result = detector.predict("https://your-audio-file.wav", is_url=True) print(f"Prediction: {result['label']} ({result['confidence']:.1%})") ``` #### Batch Prediction ```python from predict_from_hf import AudioDeepfakeDetectorFromHF detector = AudioDeepfakeDetectorFromHF("hjsgfd/deepfake_audio_classifier") # Multiple URLs audio_urls = [ "https://example.com/audio1.wav", "https://example.com/audio2.wav", "https://example.com/audio3.wav", ] results = detector.predict_batch(audio_urls, are_urls=True) # Print results for result in results: if 'prediction' in result: print(f"{result['audio_source']}: {result['label']} ({result['confidence']:.1%})") ``` #### Local Files ```python # Single file result = detector.predict("path/to/audio.wav", is_url=False) # Multiple files local_files = ["audio1.wav", "audio2.wav", "audio3.wav"] results = detector.predict_batch(local_files, are_urls=False) ``` ## 📁 Model Files The model consists of three files hosted on Hugging Face: 1. **deepfake_audio_classifier.pkl** - Trained Logistic Regression classifier 2. **audio_scaler.pkl** - StandardScaler for feature normalization 3. **model_metadata.json** - Model configuration and metadata ```json { "model_type": "LogisticRegression", "accuracy": 0.9286, "feature_extractor": "facebook/wav2vec2-base-960h", "embedding_dim": 768, "num_classes": 5, "class_labels": { "0": "class_0", "1": "class_1", "2": "class_2", "3": "class_3", "4": "class_4" } } ``` ## 📈 Detailed Results ### Training Configuration - **Training Samples**: 56 (80%) - **Testing Samples**: 14 (20%) - **Feature Dimension**: 768 - **Stratified Split**: Maintains class distribution ### Logistic Regression Performance (Best Model) ``` precision recall f1-score support class_0 1.00 0.67 0.80 3 class_1 1.00 1.00 1.00 2 class_2 1.00 1.00 1.00 3 class_3 0.75 1.00 0.86 3 class_4 1.00 1.00 1.00 3 accuracy 0.93 14 macro avg 0.95 0.93 0.93 14 weighted avg 0.95 0.93 0.93 14 ``` ### Key Metrics - **Macro Average Precision**: 0.95 - **Macro Average Recall**: 0.93 - **Macro Average F1-Score**: 0.93 - **Overall Accuracy**: 92.86% ## 🔧 Technical Details ### Dependencies ``` transformers>=4.30.0 torch>=2.0.0 librosa>=0.10.0 soundfile>=0.12.0 scikit-learn>=1.3.0 huggingface-hub>=0.16.0 requests>=2.31.0 numpy>=1.24.0 ``` ### Model Architecture ``` Input: Audio File (any format supported by soundfile) ↓ Preprocessing (16kHz, Mono) ↓ Wav2Vec2 Feature Extractor ↓ 768-dimensional Embedding ↓ StandardScaler Normalization ↓ Logistic Regression Classifier ↓ Output: Class Prediction + Confidence Scores ``` ### Supported Audio Formats - WAV - MP3 - FLAC - OGG - M4A ## 📊 Training Process 1. **Data Loading**: Load dataset with disabled auto-decoding 2. **Feature Extraction**: Extract Wav2Vec2 embeddings (768-dim vectors) 3. **Train-Test Split**: 80-20 stratified split 4. **Normalization**: StandardScaler on training data 5. **Model Training**: Train 3 classifiers (LR, RF, SVM) 6. **Evaluation**: Compare performance on test set 7. **Selection**: Choose best model (Logistic Regression) 8. **Export**: Save model, scaler, and metadata ## 🎯 Use Cases - Deepfake audio detection - Voice authentication systems - Media verification tools - Forensic audio analysis - Content moderation platforms ## 🤝 Contributing Contributions are welcome! Please feel free to submit a Pull Request. ## 📝 Citation If you use this model, please cite: ```bibtex @misc{deepfake_audio_classifier_2024, author = {Your Name}, title = {Deepfake Audio Detection Model}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/hjsgfd/deepfake_audio_classifier}} } ``` ## 🙏 Acknowledgments - **Dataset**: [ud-nlp/real-vs-fake-human-voice-deepfake-audio](https://huggingface.co/datasets/ud-nlp/real-vs-fake-human-voice-deepfake-audio) - **Feature Extractor**: [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h) - **Transformers Library**: Hugging Face ## 📧 Contact For questions or feedback, please open an issue on the repository. ---