| | --- |
| | license: mit |
| | datasets: |
| | - fixie-ai/librispeech_asr |
| | language: |
| | - en |
| | base_model: |
| | - facebook/wav2vec2-base |
| | pipeline_tag: audio-classification |
| | metrics: |
| | - accuracy |
| | library_name: transformers |
| | tags: |
| | - voice_phishing |
| | - audio_classification |
| | --- |
| | # Voice Detection AI - Real vs AI Audio Classifier |
| |
|
| |  |
| |
|
| | ### **Model Overview** |
| | This model is a fine-tuned Wav2Vec2-based audio classifier capable of distinguishing between **real human voices** and **AI-generated voices**. It has been trained on a dataset containing samples from various TTS models and real human audio recordings. |
| |
|
| | --- |
| |
|
| | ### **Model Details** |
| | - **Architecture:** Wav2Vec2ForSequenceClassification |
| | - **Fine-tuned on:** Custom dataset with real and AI-generated audio |
| | - **Classes:** |
| | 1. Real Human Voice |
| | 2. AI-generated (e.g., Melgan, DiffWave, etc.) |
| | - **Input Requirements:** |
| | - Audio format: `.wav`, `.mp3`, etc. |
| | - Sample rate: 16kHz |
| | - Max duration: 10 seconds (longer audios are truncated, shorter ones are padded) |
| |
|
| | --- |
| |
|
| |
|
| | ### **Performance** |
| | - **Robustness:** Successfully classifies across multiple AI-generation models. |
| | - **Limitations:** Struggles with certain unseen AI-generation models (e.g., ElevenLabs). |
| |
|
| | --- |
| |
|
| | ### **How to Use** |
| |
|
| | #### **1. Install Dependencies** |
| | Make sure you have `transformers` and `torch` installed: |
| | ```bash |
| | pip install transformers torch torchaudio |
| | ``` |
| | ## Usage |
| | ### Here's how to use VoiceGUARD for audio classification: |
| | ``` |
| | import torch |
| | from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor |
| | import torchaudio |
| | |
| | # Load model and processor |
| | model_name = "Mrkomiljon/voiceGUARD" |
| | model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name) |
| | processor = Wav2Vec2Processor.from_pretrained(model_name) |
| | |
| | # Load audio |
| | waveform, sample_rate = torchaudio.load("path_to_audio_file.wav") |
| | |
| | # Resample if necessary |
| | if sample_rate != 16000: |
| | resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000) |
| | waveform = resampler(waveform) |
| | |
| | # Preprocess |
| | inputs = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt", padding=True) |
| | |
| | # Inference |
| | with torch.no_grad(): |
| | logits = model(**inputs).logits |
| | predicted_ids = torch.argmax(logits, dim=-1) |
| | |
| | # Map to label |
| | labels = ["Real Human Voice", "AI-generated"] |
| | prediction = labels[predicted_ids.item()] |
| | print(f"Prediction: {prediction}") |
| | ``` |
| | ## Training Procedure |
| | - Data Collection: Compiled a balanced dataset of real human voices and AI-generated samples from various TTS models. |
| | - Preprocessing: Standardized audio formats, resampled to 16 kHz, and adjusted durations to 10 seconds. |
| | - Fine-Tuning: Utilized the Wav2Vec2 architecture for sequence classification, training for 3 epochs with a learning rate of 1e-5. |
| | ## Evaluation |
| | - Metrics: Accuracy, Precision, Recall |
| | - Results: Achieved 99.8% validation accuracy on the test set. |
| | ## Limitations and Future Work |
| | - While VoiceGUARD performs robustly across known AI-generation models, it may encounter challenges with novel or unseen models. |
| | - Future work includes expanding the training dataset with samples from emerging TTS technologies to enhance generalization. |
| |
|
| | ## License |
| | This project is licensed under the MIT License. See the LICENSE file for details. |
| |
|
| | ## Acknowledgements |
| | * Special thanks to the developers of the [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) model and the contributors to the datasets used in this project. |
| | * View the complete project on [GitHub](https://github.com/Mrkomiljon/VoiceGUARD2) |