NLLB-200 SENCOTEN to English Translation Model

This model is a fine-tuned version of facebook/nllb-200-distilled-600M for translating from SENCOTEN (a Coast Salish language) to English.

Model Description

  • Base Model: facebook/nllb-200-distilled-600M
  • Custom Tokenizer: jwarrenbc/nllb-200-sencoten-tokenizer
  • Language Pair: SENCOTEN → English
  • Language Code: sen_Latn (SENCOTEN), eng_Latn (English)

Training Data

The model was trained on combined SENCOTEN Dictionary phrases and Grammar sentences.

  • Training samples: 16,783
  • Validation samples: 1,865

Usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("jwarrenbc/nllb-200-sencoten-english")
tokenizer = AutoTokenizer.from_pretrained("jwarrenbc/nllb-200-sencoten-english")

# Set source language to SENCOTEN
tokenizer.src_lang = "sen_Latn"

# Translate SENCOTEN to English
text = "kw'unnuhw sun kwsu sway'qu' i' kwsu slheni'."
inputs = tokenizer(text, return_tensors="pt")

translated_tokens = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("eng_Latn"),
    max_length=128
)

translation = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
print(translation)

Training Procedure

  • Epochs: 10
  • Batch Size: 4 (effective: 32)
  • Learning Rate: 5e-05
  • Warmup Steps: 500
  • FP16: True

Evaluation Results

  • Eval Loss: 1.228232741355896
  • BLEU Score: 23.406918280118166

Acknowledgments

This model was developed to support SENCOTEN language preservation and revitalization efforts.

Downloads last month
5
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jwarrenbc/nllb-200-sencoten-english

Finetuned
(274)
this model

Dataset used to train jwarrenbc/nllb-200-sencoten-english