YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)
AI Voice Detection Model (Wav2Vec2)
Multi-language AI-generated voice detection model for Tamil, English, Hindi, Malayalam, and Telugu.
Model Description
This model detects whether an audio clip is AI-generated or spoken by a human. It uses Facebook's Wav2Vec2-large-xlsr-53 as the backbone with a custom classification head.
Performance
- Accuracy: 99.69%
- AUROC: 1.0
- EER (Equal Error Rate): 0.25%
Supported Languages
- Tamil
- English
- Hindi
- Malayalam
- Telugu
Model Architecture
Wav2Vec2Model (facebook/wav2vec2-large-xlsr-53)
βββ Dropout (0.1)
βββ Linear (1024 β 2)
Usage
Installation
pip install torch transformers librosa pydub numpy
Load the Model
import torch
import torch.nn as nn
from transformers import Wav2Vec2Model
from pydub import AudioSegment
import librosa
import numpy as np
# Define the model architecture
class W2VBertDeepfakeDetector(nn.Module):
def __init__(self, backbone, num_labels=2):
super().__init__()
self.backbone = backbone
hidden_size = backbone.config.hidden_size
self.dropout = nn.Dropout(0.1)
self.classifier = nn.Linear(hidden_size, num_labels)
def forward(self, input_values, attention_mask=None):
outputs = self.backbone(input_values=input_values, attention_mask=attention_mask)
hidden_states = outputs.last_hidden_state
pooled = hidden_states.mean(dim=1)
pooled = self.dropout(pooled)
logits = self.classifier(pooled)
return logits
# Load backbone
backbone = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-large-xlsr-53")
# Create model and load weights
model = W2VBertDeepfakeDetector(backbone, num_labels=2)
model.load_state_dict(torch.load("best_model.pt", map_location="cpu"))
model.eval()
Inference
def load_audio(path, target_sr=16000):
audio_segment = AudioSegment.from_file(path)
samples = np.array(audio_segment.get_array_of_samples()).astype(np.float32)
if audio_segment.channels > 1:
samples = samples.reshape(-1, audio_segment.channels).mean(axis=1)
samples /= 32767.0
if audio_segment.frame_rate != target_sr:
samples = librosa.resample(samples, orig_sr=audio_segment.frame_rate, target_sr=target_sr)
return torch.from_numpy(samples).float()
# Load and classify audio
waveform = load_audio("your_audio.mp3")
input_values = waveform.unsqueeze(0)
with torch.no_grad():
logits = model(input_values)
probs = torch.softmax(logits, dim=-1)
prediction = torch.argmax(probs, dim=-1).item()
confidence = probs[0, prediction].item()
result = "AI_GENERATED" if prediction == 1 else "HUMAN"
print(f"Classification: {result} (confidence: {confidence:.2%})")
Training Details
- Base Model: facebook/wav2vec2-large-xlsr-53
- Training Data: IndicSynth + custom multilingual dataset
- Split: 90% train, 5% val, 5% test
- Batch Size: 8
- Learning Rate: 1e-5
- Epochs: 5
- Optimizer: AdamW with linear warmup
Limitations
- Optimized for speech in supported languages
- Audio should be at least 0.5 seconds long
- Best performance with 16kHz sample rate
License
MIT
Citation
@misc{ai-voice-detection-2024,
author = {Your Name},
title = {Multilingual AI Voice Detection using Wav2Vec2},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/kimnamjoon0007/lkht-v440}
}
- Downloads last month
- 1
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support