YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Preserving Orang Asli Language Resources (POLAR)

mds04/iban-bukar-malay-langid-lr

Model: mds04/iban-bukar-malay-langid-lr
Task: Language Identification (3-class) - Iban, Bukar Sadong, Malay
Type: Logistic Regression classifier trained on SpeechBrain ECAPA embeddings (VoxLingua107)
Project: POLAR (Project ID: 47208)


Summary

mds04/iban-bukar-malay-langid-lr is a lightweight logistic-regression language identifier built for the POLAR project.
It distinguishes between Iban, Bukar Sadong, and Malay audio using embeddings extracted from the SpeechBrain speechbrain/lang-id-voxlingua107-ecapa encoder.

This model is simple, fast, and ideal as a language router in multilingual ASR pipelines - deciding whether audio should be processed by the Iban or Bukar Sadong ASR, or detected as Malay.


Intended Use & Scope

Primary use:
Route short audio segments (speech) into one of three language buckets - Iban, Bukar Sadong, Malay - so that the appropriate ASR or processing pipeline can be triggered.

Not intended for:

  • Fine-grained dialect identification beyond these three classes
  • Speaker recognition, emotion detection, or transcription
  • Audio with heavy noise, overlapping speech, or extreme compression

Note: Bukar Sadong has fewer training examples and shows lower accuracy than Iban and Malay. Treat its predictions as lower-confidence and consider human verification when possible.


How It Was Built

  1. Embedding extractor: SpeechBrain VoxLingua107 ECAPA
    โ†’ speechbrain/lang-id-voxlingua107-ecapa
    (audio: mono, 16 kHz)

  2. Classifier: scikit-learn LogisticRegression on fixed-size embeddings

  3. Imbalance handling: SMOTE (k_neighbors = 5) to oversample Bukar Sadong

  4. Class weighting: Computed from post-balancing frequencies

Training data sources:


Data & Metrics

Samples collected:

Language Count
Iban 5,011
Bukar Sadong 680
Malay 5,010

Training split (pre-balance):

Language Train Samples
Iban 4,008
Bukar Sadong 544
Malay 4,008

After SMOTE:

Language Samples
Iban 4,008
Bukar Sadong 1,603
Malay 4,008

Final class weights:
{0: 0.79998, 1: 2.40025, 2: 0.79998}


Evaluation (Test Set Summary)

Class Precision Recall F1-Score Support
Iban 0.94 0.95 0.94 1003
Bukar Sadong 0.73 0.74 0.74 136
Malay 0.98 0.97 0.97 1002
Accuracy โ€” โ€” 0.95 2141
Macro Avg 0.88 0.89 0.89 2141
Weighted Avg 0.95 0.95 0.95 2141

Confusion Matrix (rows = true labels, columns = predicted)

True \ Predicted Iban Bukar Sadong Malay
Iban 951 35 17
Bukar Sadong 31 101 4
Malay 28 2 972

Per-class accuracies:

  • Iban - 94.8 %
  • Bukar Sadong - 74.3 %
  • Malay - 97.0 %

Overall accuracy โ‰ˆ 95 %. Bukar Sadong remains the weakest class due to limited and noisy data.


Files Included

  • iban_bukar_malay_lr.joblib - trained Logistic Regression model
  • (optional) label_map.json - mapping of index โ†” label

The classifier expects embeddings from the same SpeechBrain ECAPA encoder used in training.


Inference Example

Requirements

pip install speechbrain torch torchaudio joblib numpy soundfile librosa huggingface_hub

Minimal Python example

import joblib, numpy as np, torch, librosa, soundfile as sf
from huggingface_hub import hf_hub_download
from speechbrain.inference.classifiers import EncoderClassifier

REPO_ID = "mds04/iban-bukar-malay-langid-lr"
MODEL_FILE = "iban_bukar_malay_lr.joblib"
TARGET_SR = 16000

# Load model
local_joblib = hf_hub_download(repo_id=REPO_ID, filename=MODEL_FILE)
bundle = joblib.load(local_joblib)
clf = bundle["classifier"]
label_map = {int(k): v for k, v in bundle.get("label_map", {0:"iban",1:"bukar_sadong",2:"malay"}).items()}

# Load SpeechBrain encoder
vox = EncoderClassifier.from_hparams(source="speechbrain/lang-id-voxlingua107-ecapa")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
vox.to(device)

def load_audio(path):
    y, _ = librosa.load(path, sr=TARGET_SR, mono=True)
    return torch.tensor(y).unsqueeze(0)

def get_embedding(wav):
    wav = wav.to(device)
    with torch.no_grad():
        emb = vox.encode_batch(wav)
        if isinstance(emb, tuple): emb = emb[0]
        return emb.view(emb.size(0), -1).squeeze(0).cpu().numpy()

wav = load_audio("example.wav")
emb = get_embedding(wav)

probs = clf.predict_proba([emb])[0]
pred_idx = int(np.argmax(probs))
print(f"Predicted: {label_map[pred_idx]} (confidence={probs[pred_idx]:.3f})")

Notes:

  • Audio must be mono, 16 kHz
  • Using a different embedding model will degrade accuracy
  • See your app repo for examples using FFmpeg loaders for compressed formats

Integration Tips

  • Ideal as a router in multilingual ASR pipelines
  • For Bukar Sadong predictions:
    • Aggregate results across segments
    • Or apply a lower confidence threshold before routing
  • Always reuse the same SpeechBrain ECAPA encoder for consistent performance

Limitations & Risks

  • Data imbalance: Bukar Sadong performance lower due to fewer samples
  • Domain sensitivity: Microphone and noise variation can reduce accuracy
  • Scope: Only recognizes Iban, Bukar Sadong, Malay โ€” not others
  • Ethical note: Use data responsibly in accordance with Orang Asli community consent and governance

Citation / Attribution

If you use this model, please cite:

  • POLAR (Preserving Orang Asli Language Resources), Project ID 47208
  • Model: mds04/iban-bukar-malay-langid-lr

Datasets:

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support