YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

AI Voice Detection Model (Wav2Vec2)

Multi-language AI-generated voice detection model for Tamil, English, Hindi, Malayalam, and Telugu.

Model Description

This model detects whether an audio clip is AI-generated or spoken by a human. It uses Facebook's Wav2Vec2-large-xlsr-53 as the backbone with a custom classification head.

Performance

  • Accuracy: 99.69%
  • AUROC: 1.0
  • EER (Equal Error Rate): 0.25%

Supported Languages

  • Tamil
  • English
  • Hindi
  • Malayalam
  • Telugu

Model Architecture

Wav2Vec2Model (facebook/wav2vec2-large-xlsr-53)
    └── Dropout (0.1)
        └── Linear (1024 β†’ 2)

Usage

Installation

pip install torch transformers librosa pydub numpy

Load the Model

import torch
import torch.nn as nn
from transformers import Wav2Vec2Model
from pydub import AudioSegment
import librosa
import numpy as np

# Define the model architecture
class W2VBertDeepfakeDetector(nn.Module):
    def __init__(self, backbone, num_labels=2):
        super().__init__()
        self.backbone = backbone
        hidden_size = backbone.config.hidden_size
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(hidden_size, num_labels)

    def forward(self, input_values, attention_mask=None):
        outputs = self.backbone(input_values=input_values, attention_mask=attention_mask)
        hidden_states = outputs.last_hidden_state
        pooled = hidden_states.mean(dim=1)
        pooled = self.dropout(pooled)
        logits = self.classifier(pooled)
        return logits

# Load backbone
backbone = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-large-xlsr-53")

# Create model and load weights
model = W2VBertDeepfakeDetector(backbone, num_labels=2)
model.load_state_dict(torch.load("best_model.pt", map_location="cpu"))
model.eval()

Inference

def load_audio(path, target_sr=16000):
    audio_segment = AudioSegment.from_file(path)
    samples = np.array(audio_segment.get_array_of_samples()).astype(np.float32)
    if audio_segment.channels > 1:
        samples = samples.reshape(-1, audio_segment.channels).mean(axis=1)
    samples /= 32767.0
    if audio_segment.frame_rate != target_sr:
        samples = librosa.resample(samples, orig_sr=audio_segment.frame_rate, target_sr=target_sr)
    return torch.from_numpy(samples).float()

# Load and classify audio
waveform = load_audio("your_audio.mp3")
input_values = waveform.unsqueeze(0)

with torch.no_grad():
    logits = model(input_values)
    probs = torch.softmax(logits, dim=-1)
    prediction = torch.argmax(probs, dim=-1).item()
    confidence = probs[0, prediction].item()

result = "AI_GENERATED" if prediction == 1 else "HUMAN"
print(f"Classification: {result} (confidence: {confidence:.2%})")

Training Details

  • Base Model: facebook/wav2vec2-large-xlsr-53
  • Training Data: IndicSynth + custom multilingual dataset
  • Split: 90% train, 5% val, 5% test
  • Batch Size: 8
  • Learning Rate: 1e-5
  • Epochs: 5
  • Optimizer: AdamW with linear warmup

Limitations

  • Optimized for speech in supported languages
  • Audio should be at least 0.5 seconds long
  • Best performance with 16kHz sample rate

License

MIT

Citation

@misc{ai-voice-detection-2024,
  author = {Your Name},
  title = {Multilingual AI Voice Detection using Wav2Vec2},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/kimnamjoon0007/lkht-v440}
}
Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Spaces using kimnamjoon0007/lkht-v440 4