NLLB-200 SENCOTEN to English Translation Model

This model is a fine-tuned version of facebook/nllb-200-distilled-600M for translating from SENCOTEN (a Coast Salish language) to English.

Model Description

Base Model: facebook/nllb-200-distilled-600M
Custom Tokenizer: jwarrenbc/nllb-200-sencoten-tokenizer
Language Pair: SENCOTEN → English
Language Code: sen_Latn (SENCOTEN), eng_Latn (English)

Training Data

The model was trained on combined SENCOTEN Dictionary phrases and Grammar sentences.

Training samples: 16,783
Validation samples: 1,865

Usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("jwarrenbc/nllb-200-sencoten-english")
tokenizer = AutoTokenizer.from_pretrained("jwarrenbc/nllb-200-sencoten-english")

# Set source language to SENCOTEN
tokenizer.src_lang = "sen_Latn"

# Translate SENCOTEN to English
text = "kw'unnuhw sun kwsu sway'qu' i' kwsu slheni'."
inputs = tokenizer(text, return_tensors="pt")

translated_tokens = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("eng_Latn"),
    max_length=128
)

translation = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
print(translation)

Training Procedure

Epochs: 10
Batch Size: 4 (effective: 32)
Learning Rate: 5e-05
Warmup Steps: 500
FP16: True

Evaluation Results

Eval Loss: 1.228232741355896
BLEU Score: 23.406918280118166

Acknowledgments

This model was developed to support SENCOTEN language preservation and revitalization efforts.

Downloads last month: 5

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for jwarrenbc/nllb-200-sencoten-english

Base model

facebook/nllb-200-distilled-600M

Finetuned

(274)

this model

jwarrenbc
/

nllb-200-sencoten-english