Preserving Orang Asli Language Resources (POLAR)
mds04/iban-bukar-malay-langid-lr
Model: mds04/iban-bukar-malay-langid-lr
Task: Language Identification (3-class) - Iban, Bukar Sadong, Malay
Type: Logistic Regression classifier trained on SpeechBrain ECAPA embeddings (VoxLingua107)
Project: POLAR (Project ID: 47208)
Summary
mds04/iban-bukar-malay-langid-lr is a lightweight logistic-regression language identifier built for the POLAR project.
It distinguishes between Iban, Bukar Sadong, and Malay audio using embeddings extracted from the SpeechBrain speechbrain/lang-id-voxlingua107-ecapa encoder.
This model is simple, fast, and ideal as a language router in multilingual ASR pipelines - deciding whether audio should be processed by the Iban or Bukar Sadong ASR, or detected as Malay.
Intended Use & Scope
Primary use:
Route short audio segments (speech) into one of three language buckets - Iban, Bukar Sadong, Malay - so that the appropriate ASR or processing pipeline can be triggered.
Not intended for:
- Fine-grained dialect identification beyond these three classes
- Speaker recognition, emotion detection, or transcription
- Audio with heavy noise, overlapping speech, or extreme compression
Note: Bukar Sadong has fewer training examples and shows lower accuracy than Iban and Malay. Treat its predictions as lower-confidence and consider human verification when possible.
How It Was Built
Embedding extractor: SpeechBrain VoxLingua107 ECAPA
โspeechbrain/lang-id-voxlingua107-ecapa
(audio: mono, 16 kHz)Classifier:
scikit-learnLogisticRegression on fixed-size embeddingsImbalance handling: SMOTE (k_neighbors = 5) to oversample Bukar Sadong
Class weighting: Computed from post-balancing frequencies
Training data sources:
- Iban - mds04/iban-audio-datasets
- Bukar Sadong - mds04/bukar-sadong-conversational-audio-dataset-v3
- Malay - google/fleurs
Data & Metrics
Samples collected:
| Language | Count |
|---|---|
| Iban | 5,011 |
| Bukar Sadong | 680 |
| Malay | 5,010 |
Training split (pre-balance):
| Language | Train Samples |
|---|---|
| Iban | 4,008 |
| Bukar Sadong | 544 |
| Malay | 4,008 |
After SMOTE:
| Language | Samples |
|---|---|
| Iban | 4,008 |
| Bukar Sadong | 1,603 |
| Malay | 4,008 |
Final class weights:{0: 0.79998, 1: 2.40025, 2: 0.79998}
Evaluation (Test Set Summary)
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Iban | 0.94 | 0.95 | 0.94 | 1003 |
| Bukar Sadong | 0.73 | 0.74 | 0.74 | 136 |
| Malay | 0.98 | 0.97 | 0.97 | 1002 |
| Accuracy | โ | โ | 0.95 | 2141 |
| Macro Avg | 0.88 | 0.89 | 0.89 | 2141 |
| Weighted Avg | 0.95 | 0.95 | 0.95 | 2141 |
Confusion Matrix (rows = true labels, columns = predicted)
| True \ Predicted | Iban | Bukar Sadong | Malay |
|---|---|---|---|
| Iban | 951 | 35 | 17 |
| Bukar Sadong | 31 | 101 | 4 |
| Malay | 28 | 2 | 972 |
Per-class accuracies:
- Iban - 94.8 %
- Bukar Sadong - 74.3 %
- Malay - 97.0 %
Overall accuracy โ 95 %. Bukar Sadong remains the weakest class due to limited and noisy data.
Files Included
iban_bukar_malay_lr.joblib- trained Logistic Regression model- (optional)
label_map.json- mapping of index โ label
The classifier expects embeddings from the same SpeechBrain ECAPA encoder used in training.
Inference Example
Requirements
pip install speechbrain torch torchaudio joblib numpy soundfile librosa huggingface_hub
Minimal Python example
import joblib, numpy as np, torch, librosa, soundfile as sf
from huggingface_hub import hf_hub_download
from speechbrain.inference.classifiers import EncoderClassifier
REPO_ID = "mds04/iban-bukar-malay-langid-lr"
MODEL_FILE = "iban_bukar_malay_lr.joblib"
TARGET_SR = 16000
# Load model
local_joblib = hf_hub_download(repo_id=REPO_ID, filename=MODEL_FILE)
bundle = joblib.load(local_joblib)
clf = bundle["classifier"]
label_map = {int(k): v for k, v in bundle.get("label_map", {0:"iban",1:"bukar_sadong",2:"malay"}).items()}
# Load SpeechBrain encoder
vox = EncoderClassifier.from_hparams(source="speechbrain/lang-id-voxlingua107-ecapa")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
vox.to(device)
def load_audio(path):
y, _ = librosa.load(path, sr=TARGET_SR, mono=True)
return torch.tensor(y).unsqueeze(0)
def get_embedding(wav):
wav = wav.to(device)
with torch.no_grad():
emb = vox.encode_batch(wav)
if isinstance(emb, tuple): emb = emb[0]
return emb.view(emb.size(0), -1).squeeze(0).cpu().numpy()
wav = load_audio("example.wav")
emb = get_embedding(wav)
probs = clf.predict_proba([emb])[0]
pred_idx = int(np.argmax(probs))
print(f"Predicted: {label_map[pred_idx]} (confidence={probs[pred_idx]:.3f})")
Notes:
- Audio must be mono, 16 kHz
- Using a different embedding model will degrade accuracy
- See your app repo for examples using FFmpeg loaders for compressed formats
Integration Tips
- Ideal as a router in multilingual ASR pipelines
- For Bukar Sadong predictions:
- Aggregate results across segments
- Or apply a lower confidence threshold before routing
- Always reuse the same SpeechBrain ECAPA encoder for consistent performance
Limitations & Risks
- Data imbalance: Bukar Sadong performance lower due to fewer samples
- Domain sensitivity: Microphone and noise variation can reduce accuracy
- Scope: Only recognizes Iban, Bukar Sadong, Malay โ not others
- Ethical note: Use data responsibly in accordance with Orang Asli community consent and governance
Citation / Attribution
If you use this model, please cite:
- POLAR (Preserving Orang Asli Language Resources), Project ID 47208
- Model:
mds04/iban-bukar-malay-langid-lr
Datasets: