Preserving Orang Asli Language Resources (POLAR)
mds04/nllb-iban2malay-600m
Model: mds04/nllb-iban2malay-600m
Task: Language Translation - Iban โ Malay
Type: Fine-tuned model based on facebook/nllb-200-distilled-600M
Project: POLAR (Project ID: 47208)
Summary
mds04/nllb-iban2malay-600m is a translation model fine-tuned from Facebook's NLLB-200-Distilled-600M, specialized for translating Iban โ Malay.
Malay is used as a proxy language since Iban is not natively supported by NLLB, making this approach ideal for low-resource language adaptation.
The model helps bridge communication and documentation between Iban-speaking and Malay-speaking communities - a crucial step for language preservation and revitalization under the POLAR initiative.
How It Was Built
1. Base model:
facebook/nllb-200-distilled-600M (version-agnostic)
2. Tokenizer initialization:
tokenizer = AutoTokenizer.from_pretrained(
"facebook/nllb-200-distilled-600M",
src_lang="zsm_Latn",
tgt_lang="zsm_Latn"
)
3. Data split:
| Split | Samples |
|---|---|
| Train | 5,462 |
| Validation | 610 |
| Test | 610 |
| Total | 6,682 |
4. Preprocessing:
- Both source and target use
"zsm_Latn"(Malay Latin) code - Sentences tokenized with
max_length = 256
5. Training setup:
- Framework: ๐ค Transformers
- Trainer: Seq2SeqTrainer
- Training epochs: 5
- Learning rate: 3e-05
- Label smoothing: 0.0
- Warmup ratio: 0.05
- Effective batch size: 64
- Early stopping applied:
- Patience: 4 epochs
- Threshold: 0.0005 BLEU improvement
- Optimized for BLEU score on validation set
Metrics
Evaluation (Test Set)
| Metric | Score |
|---|---|
| Test Loss | 1.3508 |
| BLEU | 35.18 |
| chrF | 62.24 |
| chrF++ | 59.93 |
- BLEU 35.18 indicates strong translation performance given the limited data, especially for low-resource Iban input.
- chrF and chrF++ scores show good word and character-level alignment with Malay references.
How to Use
Inference Example (Google Colab / Python)
Install dependencies:
pip install transformers torch sentencepiece
Example usage:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
model_name = "mds04/nllb-iban2malay-600m"
tokenizer = AutoTokenizer.from_pretrained(model_name, src_lang="zsm_Latn", tgt_lang="zsm_Latn")
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Example Iban sentence
iban_text = "Nama aku Joshua, aku udah pergi ke rumah kawan kemari."
# Tokenize input
inputs = tokenizer(iban_text, return_tensors="pt", padding=True)
# Generate translation
outputs = model.generate(**inputs, max_length=256)
translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print("Iban:", iban_text)
print("Malay Translation:", translation)
Expected output:
Iban: Nama aku Joshua, aku udah pergi ke rumah kawan kemari.
Malay Translation: Nama saya Joshua, saya sudah pergi ke rumah kawan semalam.
Integration Tips
- Ideal for ASR โ Translation pipelines (e.g., after transcribing Iban speech)
- Use Malay proxy (
zsm_Latn) for consistent tokenization - Works best on complete, conversational sentences
- Combine with mds04/iban-bukar-malay-langid-lr for automatic language routing before translation
Limitations & Risks
- Low-resource bias: Limited Iban data may lead to literal or grammar-imperfect Malay outputs
- Malay proxy limitation: Iban tokens are internally treated as Malay (
zsm_Latn), so some rare words may lose meaning - Domain sensitivity: Performs best on conversational or daily-use Iban sentences
- Ethical note: Use with respect to Orang Asli data consent and community governance
Citation / Attribution
If you use this model, please cite:
- POLAR (Preserving Orang Asli Language Resources), Project ID 47208
- Model:
mds04/nllb-iban2malay-600m - Based on:
facebook/nllb-200-distilled-600M
License
Refer to the repository's license file.
If none is provided, contact the model owner for permission details.
Contact & Support
For inquiries, dataset info, or technical questions,
please contact the POLAR project maintainers (owner: mds04)
or open an issue on this model's repository.
- Downloads last month
- 243