π΅π° Pakistani Multilingual LID (V3 SOTA)
This is a State-of-the-Art (SOTA) Language Identification (LID) model specifically fine-tuned for Pakistani languages and English spoken in the Pakistani context. It achieves 98.71% accuracy on unseen test data.
π Key Features
- Target Languages: Urdu, Sindhi, Balochi, Pashto, and English.
- Base Model:
facebook/mms-lid-126(Wav2Vec2). - V3 Architecture Upgrade: - 1D-CNN Layer: Extracts local phonetic features (Crucial for Sindhi & Pashto consonant clusters).
- Attentive Statistics Pooling (ASP): Captures both the mean and variance (rhythm/pitch) of the speech dynamically.
- Label Smoothed Focal Loss: Prevents overconfidence and handles noisy "in-the-wild" audio effectively.
- β‘ ONNX Optimized: Includes an ONNX runtime version for low-latency, production-ready edge deployment.
π Performance Metrics
The model was trained on a balanced dataset of ~45,000 samples and achieved convergence in just 2 epochs.
π Final Test Metrics (On Unseen Data)
- Test Accuracy: 98.71%
- Test Precision: 98.71%
- Test Recall: 98.71%
- Test F1-Score: 98.71%
β‘ How to Use (Fast ONNX Inference) - Recommended
For production, APIs, or CPU-based environments, use the optimized ONNX model. It does not require building the complex PyTorch architecture graph.
Prerequisites:
pip install onnxruntime torchaudio numpy huggingface_hub
ONNX Inference Code:
import onnxruntime as ort
import torchaudio
import torch.nn.functional as F
import torch
import numpy as np
from huggingface_hub import hf_hub_download
# 1. Download Model Weights & Structure
repo_id = "Hammad712/pakistani-lid-v3-sota"
print("Downloading model weights (.data file)...")
hf_hub_download(repo_id=repo_id, filename="pakistani_lid_v3.onnx.data")
print("Downloading model structure (.onnx file)...")
onnx_model_path = hf_hub_download(repo_id=repo_id, filename="pakistani_lid_v3.onnx")
# 2. Config & Labels
labels = ("balochi", "english", "pashto", "sindhi", "urdu")
id2label = {i: label for i, label in enumerate(labels)}
sample_rate, max_duration = 16000, 15
# 3. Load Session
providers = ['CUDAExecutionProvider'] if torch.cuda.is_available() else ['CPUExecutionProvider']
session = ort.InferenceSession(onnx_model_path, providers=providers)
# 4. Process Audio
def predict(audio_path):
waveform, sr = torchaudio.load(audio_path)
if waveform.shape[0] > 1: waveform = waveform.mean(dim=0, keepdim=True)
if waveform.ndim == 1: waveform = waveform.unsqueeze(0)
target_frames = int(sr * max_duration)
if waveform.shape[1] > target_frames: waveform = waveform[:, :target_frames]
if sr != sample_rate: waveform = torchaudio.functional.resample(waveform, sr, sample_rate)
peak = waveform.abs().max().clamp(min=1e-6)
waveform = (waveform / peak) - waveform.mean()
waveform = waveform / waveform.std().clamp(min=1e-6)
length = waveform.shape[1]
max_length = sample_rate * max_duration
mask = torch.zeros(max_length, dtype=torch.long)
if length >= max_length:
waveform = waveform[:, :max_length]
mask[:] = 1
else:
mask[:length] = 1
waveform = F.pad(waveform, (0, max_length - length))
mask = mask.unsqueeze(0) # 2D fix for ONNX
ort_inputs = {"input_values": waveform.numpy(), "attention_mask": mask.numpy()}
logits = session.run(None, ort_inputs)[0]
exp_logits = np.exp(logits - np.max(logits, axis=1, keepdims=True))
probs = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
pred_id = np.argmax(probs, axis=1)[0]
return id2label[pred_id], probs[0][pred_id]
# Test it out!
# lang, confidence = predict("your_audio_file.wav")
# print(f"Predicted: {lang}, Confidence: {confidence:.2f}")