Preserving Orang Asli Language Resources (POLAR)

`mds04/iban-bukar-malay-langid-lr`

Model: mds04/iban-bukar-malay-langid-lr
Task: Language Identification (3-class) - Iban, Bukar Sadong, Malay
Type: Logistic Regression classifier trained on SpeechBrain ECAPA embeddings (VoxLingua107)
Project: POLAR (Project ID: 47208)

Summary

mds04/iban-bukar-malay-langid-lr is a lightweight logistic-regression language identifier built for the POLAR project.
It distinguishes between Iban, Bukar Sadong, and Malay audio using embeddings extracted from the SpeechBrain speechbrain/lang-id-voxlingua107-ecapa encoder.

This model is simple, fast, and ideal as a language router in multilingual ASR pipelines - deciding whether audio should be processed by the Iban or Bukar Sadong ASR, or detected as Malay.

Intended Use & Scope

Primary use:
Route short audio segments (speech) into one of three language buckets - Iban, Bukar Sadong, Malay - so that the appropriate ASR or processing pipeline can be triggered.

Not intended for:

Fine-grained dialect identification beyond these three classes
Speaker recognition, emotion detection, or transcription
Audio with heavy noise, overlapping speech, or extreme compression

Note: Bukar Sadong has fewer training examples and shows lower accuracy than Iban and Malay. Treat its predictions as lower-confidence and consider human verification when possible.

How It Was Built

Embedding extractor: SpeechBrain VoxLingua107 ECAPA
→ speechbrain/lang-id-voxlingua107-ecapa
(audio: mono, 16 kHz)
Classifier: scikit-learn LogisticRegression on fixed-size embeddings
Imbalance handling: SMOTE (k_neighbors = 5) to oversample Bukar Sadong
Class weighting: Computed from post-balancing frequencies

Training data sources:

Iban - mds04/iban-audio-datasets
Bukar Sadong - mds04/bukar-sadong-conversational-audio-dataset-v3
Malay - google/fleurs

Data & Metrics

Samples collected:

Language	Count
Iban	5,011
Bukar Sadong	680
Malay	5,010

Training split (pre-balance):

Language	Train Samples
Iban	4,008
Bukar Sadong	544
Malay	4,008

After SMOTE:

Language	Samples
Iban	4,008
Bukar Sadong	1,603
Malay	4,008

Final class weights:
{0: 0.79998, 1: 2.40025, 2: 0.79998}

Evaluation (Test Set Summary)

Class	Precision	Recall	F1-Score	Support
Iban	0.94	0.95	0.94	1003
Bukar Sadong	0.73	0.74	0.74	136
Malay	0.98	0.97	0.97	1002
Accuracy	—	—	0.95	2141
Macro Avg	0.88	0.89	0.89	2141
Weighted Avg	0.95	0.95	0.95	2141

Confusion Matrix (rows = true labels, columns = predicted)

True \ Predicted	Iban	Bukar Sadong	Malay
Iban	951	35	17
Bukar Sadong	31	101	4
Malay	28	2	972

Per-class accuracies:

Iban - 94.8 %
Bukar Sadong - 74.3 %
Malay - 97.0 %

Overall accuracy ≈ 95 %. Bukar Sadong remains the weakest class due to limited and noisy data.

Files Included

iban_bukar_malay_lr.joblib - trained Logistic Regression model
(optional) label_map.json - mapping of index ↔ label

The classifier expects embeddings from the same SpeechBrain ECAPA encoder used in training.

Inference Example

Requirements

pip install speechbrain torch torchaudio joblib numpy soundfile librosa huggingface_hub

Minimal Python example

import joblib, numpy as np, torch, librosa, soundfile as sf
from huggingface_hub import hf_hub_download
from speechbrain.inference.classifiers import EncoderClassifier

REPO_ID = "mds04/iban-bukar-malay-langid-lr"
MODEL_FILE = "iban_bukar_malay_lr.joblib"
TARGET_SR = 16000

# Load model
local_joblib = hf_hub_download(repo_id=REPO_ID, filename=MODEL_FILE)
bundle = joblib.load(local_joblib)
clf = bundle["classifier"]
label_map = {int(k): v for k, v in bundle.get("label_map", {0:"iban",1:"bukar_sadong",2:"malay"}).items()}

# Load SpeechBrain encoder
vox = EncoderClassifier.from_hparams(source="speechbrain/lang-id-voxlingua107-ecapa")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
vox.to(device)

def load_audio(path):
    y, _ = librosa.load(path, sr=TARGET_SR, mono=True)
    return torch.tensor(y).unsqueeze(0)

def get_embedding(wav):
    wav = wav.to(device)
    with torch.no_grad():
        emb = vox.encode_batch(wav)
        if isinstance(emb, tuple): emb = emb[0]
        return emb.view(emb.size(0), -1).squeeze(0).cpu().numpy()

wav = load_audio("example.wav")
emb = get_embedding(wav)

probs = clf.predict_proba([emb])[0]
pred_idx = int(np.argmax(probs))
print(f"Predicted: {label_map[pred_idx]} (confidence={probs[pred_idx]:.3f})")

Notes:

Audio must be mono, 16 kHz
Using a different embedding model will degrade accuracy
See your app repo for examples using FFmpeg loaders for compressed formats

Integration Tips

Ideal as a router in multilingual ASR pipelines
For Bukar Sadong predictions:
- Aggregate results across segments
- Or apply a lower confidence threshold before routing
Always reuse the same SpeechBrain ECAPA encoder for consistent performance

Limitations & Risks

Data imbalance: Bukar Sadong performance lower due to fewer samples
Domain sensitivity: Microphone and noise variation can reduce accuracy
Scope: Only recognizes Iban, Bukar Sadong, Malay — not others
Ethical note: Use data responsibly in accordance with Orang Asli community consent and governance

Citation / Attribution

If you use this model, please cite:

POLAR (Preserving Orang Asli Language Resources), Project ID 47208
Model: mds04/iban-bukar-malay-langid-lr

Datasets:

mds04/iban-audio-datasets
mds04/bukar-sadong-conversational-audio-dataset-v3
google/fleurs (Malay subset)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support