Preserving Orang Asli Language Resources (POLAR)

mds04/nllb-iban2malay-600m

Model: mds04/nllb-iban2malay-600m
Task: Language Translation - Iban โ†’ Malay
Type: Fine-tuned model based on facebook/nllb-200-distilled-600M
Project: POLAR (Project ID: 47208)


Summary

mds04/nllb-iban2malay-600m is a translation model fine-tuned from Facebook's NLLB-200-Distilled-600M, specialized for translating Iban โ†’ Malay.
Malay is used as a proxy language since Iban is not natively supported by NLLB, making this approach ideal for low-resource language adaptation.

The model helps bridge communication and documentation between Iban-speaking and Malay-speaking communities - a crucial step for language preservation and revitalization under the POLAR initiative.


How It Was Built

1. Base model:

facebook/nllb-200-distilled-600M (version-agnostic)

2. Tokenizer initialization:

tokenizer = AutoTokenizer.from_pretrained(
    "facebook/nllb-200-distilled-600M",
    src_lang="zsm_Latn", 
    tgt_lang="zsm_Latn"
)

3. Data split:

Split Samples
Train 5,462
Validation 610
Test 610
Total 6,682

4. Preprocessing:

  • Both source and target use "zsm_Latn" (Malay Latin) code
  • Sentences tokenized with max_length = 256

5. Training setup:

  • Framework: ๐Ÿค— Transformers
  • Trainer: Seq2SeqTrainer
  • Training epochs: 5
  • Learning rate: 3e-05
  • Label smoothing: 0.0
  • Warmup ratio: 0.05
  • Effective batch size: 64
  • Early stopping applied:
    • Patience: 4 epochs
    • Threshold: 0.0005 BLEU improvement
  • Optimized for BLEU score on validation set

Metrics

Evaluation (Test Set)

Metric Score
Test Loss 1.3508
BLEU 35.18
chrF 62.24
chrF++ 59.93
  • BLEU 35.18 indicates strong translation performance given the limited data, especially for low-resource Iban input.
  • chrF and chrF++ scores show good word and character-level alignment with Malay references.

How to Use

Inference Example (Google Colab / Python)

Install dependencies:

pip install transformers torch sentencepiece

Example usage:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
model_name = "mds04/nllb-iban2malay-600m"
tokenizer = AutoTokenizer.from_pretrained(model_name, src_lang="zsm_Latn", tgt_lang="zsm_Latn")
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Example Iban sentence
iban_text = "Nama aku Joshua, aku udah pergi ke rumah kawan kemari."

# Tokenize input
inputs = tokenizer(iban_text, return_tensors="pt", padding=True)

# Generate translation
outputs = model.generate(**inputs, max_length=256)
translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

print("Iban:", iban_text)
print("Malay Translation:", translation)

Expected output:

Iban: Nama aku Joshua, aku udah pergi ke rumah kawan kemari.
Malay Translation: Nama saya Joshua, saya sudah pergi ke rumah kawan semalam.

Integration Tips

  • Ideal for ASR โ†’ Translation pipelines (e.g., after transcribing Iban speech)
  • Use Malay proxy (zsm_Latn) for consistent tokenization
  • Works best on complete, conversational sentences
  • Combine with mds04/iban-bukar-malay-langid-lr for automatic language routing before translation

Limitations & Risks

  • Low-resource bias: Limited Iban data may lead to literal or grammar-imperfect Malay outputs
  • Malay proxy limitation: Iban tokens are internally treated as Malay (zsm_Latn), so some rare words may lose meaning
  • Domain sensitivity: Performs best on conversational or daily-use Iban sentences
  • Ethical note: Use with respect to Orang Asli data consent and community governance

Citation / Attribution

If you use this model, please cite:

  • POLAR (Preserving Orang Asli Language Resources), Project ID 47208
  • Model: mds04/nllb-iban2malay-600m
  • Based on: facebook/nllb-200-distilled-600M

License

Refer to the repository's license file.
If none is provided, contact the model owner for permission details.


Contact & Support

For inquiries, dataset info, or technical questions,
please contact the POLAR project maintainers (owner: mds04)
or open an issue on this model's repository.

Downloads last month
243
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using mds04/nllb-iban2malay-600m 1