YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

AI Voice Detection Model (Wav2Vec2)

Multi-language AI-generated voice detection model for Tamil, English, Hindi, Malayalam, and Telugu.

Model Description

This model detects whether an audio clip is AI-generated or spoken by a human. It uses Facebook's Wav2Vec2-large-xlsr-53 as the backbone with a custom classification head.

Performance

Accuracy: 99.69%
AUROC: 1.0
EER (Equal Error Rate): 0.25%

Supported Languages

Tamil
English
Hindi
Malayalam
Telugu

Model Architecture

Wav2Vec2Model (facebook/wav2vec2-large-xlsr-53)
    └── Dropout (0.1)
        └── Linear (1024 → 2)

Usage

Installation

pip install torch transformers librosa pydub numpy

Load the Model

import torch
import torch.nn as nn
from transformers import Wav2Vec2Model
from pydub import AudioSegment
import librosa
import numpy as np

# Define the model architecture
class W2VBertDeepfakeDetector(nn.Module):
    def __init__(self, backbone, num_labels=2):
        super().__init__()
        self.backbone = backbone
        hidden_size = backbone.config.hidden_size
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(hidden_size, num_labels)

    def forward(self, input_values, attention_mask=None):
        outputs = self.backbone(input_values=input_values, attention_mask=attention_mask)
        hidden_states = outputs.last_hidden_state
        pooled = hidden_states.mean(dim=1)
        pooled = self.dropout(pooled)
        logits = self.classifier(pooled)
        return logits

# Load backbone
backbone = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-large-xlsr-53")

# Create model and load weights
model = W2VBertDeepfakeDetector(backbone, num_labels=2)
model.load_state_dict(torch.load("best_model.pt", map_location="cpu"))
model.eval()

Inference

def load_audio(path, target_sr=16000):
    audio_segment = AudioSegment.from_file(path)
    samples = np.array(audio_segment.get_array_of_samples()).astype(np.float32)
    if audio_segment.channels > 1:
        samples = samples.reshape(-1, audio_segment.channels).mean(axis=1)
    samples /= 32767.0
    if audio_segment.frame_rate != target_sr:
        samples = librosa.resample(samples, orig_sr=audio_segment.frame_rate, target_sr=target_sr)
    return torch.from_numpy(samples).float()

# Load and classify audio
waveform = load_audio("your_audio.mp3")
input_values = waveform.unsqueeze(0)

with torch.no_grad():
    logits = model(input_values)
    probs = torch.softmax(logits, dim=-1)
    prediction = torch.argmax(probs, dim=-1).item()
    confidence = probs[0, prediction].item()

result = "AI_GENERATED" if prediction == 1 else "HUMAN"
print(f"Classification: {result} (confidence: {confidence:.2%})")

Training Details

Base Model: facebook/wav2vec2-large-xlsr-53
Training Data: IndicSynth + custom multilingual dataset
Split: 90% train, 5% val, 5% test
Batch Size: 8
Learning Rate: 1e-5
Epochs: 5
Optimizer: AdamW with linear warmup

Limitations

Optimized for speech in supported languages
Audio should be at least 0.5 seconds long
Best performance with 16kHz sample rate

License

MIT

Citation

@misc{ai-voice-detection-2024,
  author = {Your Name},
  title = {Multilingual AI Voice Detection using Wav2Vec2},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/kimnamjoon0007/lkht-v440}
}

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

kimnamjoon0007
/

lkht-v440