Preserving Orang Asli Language Resources (POLAR)

`mds04/nllb-iban2malay-600m`

Model: mds04/nllb-iban2malay-600m
Task: Language Translation - Iban → Malay
Type: Fine-tuned model based on facebook/nllb-200-distilled-600M
Project: POLAR (Project ID: 47208)

Summary

mds04/nllb-iban2malay-600m is a translation model fine-tuned from Facebook's NLLB-200-Distilled-600M, specialized for translating Iban → Malay.
Malay is used as a proxy language since Iban is not natively supported by NLLB, making this approach ideal for low-resource language adaptation.

The model helps bridge communication and documentation between Iban-speaking and Malay-speaking communities - a crucial step for language preservation and revitalization under the POLAR initiative.

How It Was Built

1. Base model:

facebook/nllb-200-distilled-600M (version-agnostic)

2. Tokenizer initialization:

tokenizer = AutoTokenizer.from_pretrained(
    "facebook/nllb-200-distilled-600M",
    src_lang="zsm_Latn", 
    tgt_lang="zsm_Latn"
)

3. Data split:

Split	Samples
Train	5,462
Validation	610
Test	610
Total	6,682

4. Preprocessing:

Both source and target use "zsm_Latn" (Malay Latin) code
Sentences tokenized with max_length = 256

5. Training setup:

Framework: 🤗 Transformers
Trainer: Seq2SeqTrainer
Training epochs: 5
Learning rate: 3e-05
Label smoothing: 0.0
Warmup ratio: 0.05
Effective batch size: 64
Early stopping applied:
- Patience: 4 epochs
- Threshold: 0.0005 BLEU improvement
Optimized for BLEU score on validation set

Metrics

Evaluation (Test Set)

Metric	Score
Test Loss	1.3508
BLEU	35.18
chrF	62.24
chrF++	59.93

BLEU 35.18 indicates strong translation performance given the limited data, especially for low-resource Iban input.
chrF and chrF++ scores show good word and character-level alignment with Malay references.

How to Use

Inference Example (Google Colab / Python)

Install dependencies:

pip install transformers torch sentencepiece

Example usage:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
model_name = "mds04/nllb-iban2malay-600m"
tokenizer = AutoTokenizer.from_pretrained(model_name, src_lang="zsm_Latn", tgt_lang="zsm_Latn")
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Example Iban sentence
iban_text = "Nama aku Joshua, aku udah pergi ke rumah kawan kemari."

# Tokenize input
inputs = tokenizer(iban_text, return_tensors="pt", padding=True)

# Generate translation
outputs = model.generate(**inputs, max_length=256)
translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

print("Iban:", iban_text)
print("Malay Translation:", translation)

Expected output:

Iban: Nama aku Joshua, aku udah pergi ke rumah kawan kemari.
Malay Translation: Nama saya Joshua, saya sudah pergi ke rumah kawan semalam.

Integration Tips

Ideal for ASR → Translation pipelines (e.g., after transcribing Iban speech)
Use Malay proxy (zsm_Latn) for consistent tokenization
Works best on complete, conversational sentences
Combine with mds04/iban-bukar-malay-langid-lr for automatic language routing before translation

Limitations & Risks

Low-resource bias: Limited Iban data may lead to literal or grammar-imperfect Malay outputs
Malay proxy limitation: Iban tokens are internally treated as Malay (zsm_Latn), so some rare words may lose meaning
Domain sensitivity: Performs best on conversational or daily-use Iban sentences
Ethical note: Use with respect to Orang Asli data consent and community governance

Citation / Attribution

If you use this model, please cite:

POLAR (Preserving Orang Asli Language Resources), Project ID 47208
Model: mds04/nllb-iban2malay-600m
Based on: facebook/nllb-200-distilled-600M

License

Refer to the repository's license file.
If none is provided, contact the model owner for permission details.

Contact & Support

For inquiries, dataset info, or technical questions,
please contact the POLAR project maintainers (owner: mds04)
or open an issue on this model's repository.

Downloads last month: 243

mds04
/

nllb-iban2malay-600m