🇵🇰 Pakistani Multilingual LID (V3 SOTA)

This is a State-of-the-Art (SOTA) Language Identification (LID) model specifically fine-tuned for Pakistani languages and English spoken in the Pakistani context. It achieves 98.71% accuracy on unseen test data.

🚀 Key Features

Target Languages: Urdu, Sindhi, Balochi, Pashto, and English.
Base Model: facebook/mms-lid-126 (Wav2Vec2).
V3 Architecture Upgrade: - 1D-CNN Layer: Extracts local phonetic features (Crucial for Sindhi & Pashto consonant clusters).
- Attentive Statistics Pooling (ASP): Captures both the mean and variance (rhythm/pitch) of the speech dynamically.
- Label Smoothed Focal Loss: Prevents overconfidence and handles noisy "in-the-wild" audio effectively.
⚡ ONNX Optimized: Includes an ONNX runtime version for low-latency, production-ready edge deployment.

📊 Performance Metrics

The model was trained on a balanced dataset of ~45,000 samples and achieved convergence in just 2 epochs.

🏆 Final Test Metrics (On Unseen Data)

Test Accuracy: 98.71%
Test Precision: 98.71%
Test Recall: 98.71%
Test F1-Score: 98.71%

⚡ How to Use (Fast ONNX Inference) - Recommended

For production, APIs, or CPU-based environments, use the optimized ONNX model. It does not require building the complex PyTorch architecture graph.

Prerequisites:

pip install onnxruntime torchaudio numpy huggingface_hub

ONNX Inference Code:

import onnxruntime as ort
import torchaudio
import torch.nn.functional as F
import torch
import numpy as np
from huggingface_hub import hf_hub_download

# 1. Download Model Weights & Structure
repo_id = "Hammad712/pakistani-lid-v3-sota"
print("Downloading model weights (.data file)...")
hf_hub_download(repo_id=repo_id, filename="pakistani_lid_v3.onnx.data")
print("Downloading model structure (.onnx file)...")
onnx_model_path = hf_hub_download(repo_id=repo_id, filename="pakistani_lid_v3.onnx")

# 2. Config & Labels
labels = ("balochi", "english", "pashto", "sindhi", "urdu")
id2label = {i: label for i, label in enumerate(labels)}
sample_rate, max_duration = 16000, 15

# 3. Load Session
providers = ['CUDAExecutionProvider'] if torch.cuda.is_available() else ['CPUExecutionProvider']
session = ort.InferenceSession(onnx_model_path, providers=providers)

# 4. Process Audio
def predict(audio_path):
    waveform, sr = torchaudio.load(audio_path)
    if waveform.shape[0] > 1: waveform = waveform.mean(dim=0, keepdim=True)
    if waveform.ndim == 1: waveform = waveform.unsqueeze(0)
    
    target_frames = int(sr * max_duration)
    if waveform.shape[1] > target_frames: waveform = waveform[:, :target_frames]
    if sr != sample_rate: waveform = torchaudio.functional.resample(waveform, sr, sample_rate)
    
    peak = waveform.abs().max().clamp(min=1e-6)
    waveform = (waveform / peak) - waveform.mean()
    waveform = waveform / waveform.std().clamp(min=1e-6)
    
    length = waveform.shape[1]
    max_length = sample_rate * max_duration
    mask = torch.zeros(max_length, dtype=torch.long)
    
    if length >= max_length:
        waveform = waveform[:, :max_length]
        mask[:] = 1
    else:
        mask[:length] = 1
        waveform = F.pad(waveform, (0, max_length - length))

    mask = mask.unsqueeze(0) # 2D fix for ONNX

    ort_inputs = {"input_values": waveform.numpy(), "attention_mask": mask.numpy()}
    logits = session.run(None, ort_inputs)[0]
    
    exp_logits = np.exp(logits - np.max(logits, axis=1, keepdims=True))
    probs = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
    
    pred_id = np.argmax(probs, axis=1)[0]
    return id2label[pred_id], probs[0][pred_id]

# Test it out!
# lang, confidence = predict("your_audio_file.wav")
# print(f"Predicted: {lang}, Confidence: {confidence:.2f}")

🛠️ Developed By

Hammad Wakeel | AI Engineer LinkedIn | GitHub

Downloads last month: -; Downloads are not tracked for this model. How to track