Northern Frisian translation model

This is an NLLB-200-600M model fine-tuned for translating between German and the Northern Frisian dialect Mooring following this great blogpost.

Data

Version 3.0

The dataset for finetuning consisted of 9336 sentence pairs of the Ååstermooring dialect of North Frisian with German translation. Most examples (roughly 5100) were taken directly from "Rüm Hart" published by the Nordfriisk Instituut. For sentence splitting the python sentence-splitting library was used. The splitting wasn't perfect, especially in cases of direct speech, so that manual re-alignment and further splitting was necessary. A further >3500 examples were taken from the Frasch Uurdebök, Friesisches Wörterbuch, Neumünster 1988 published by the Nordfriesische Wörterbuchstelle at the Christian-Albrechts-Universität Kiel. Finally, a little under 180 very simple self-written examples were used as evaluation data set.

Pre Version 3.0

The dataset for finetuning consisted of 7194 sentence pairs of the Ååstermooring dialect of North Frisian with German translation. Most examples (roughly 5100) were taken directly from "Rüm Hart" published by the Nordfriisk Instituut. For sentence splitting the python sentence-splitting library was used. The splitting wasn't perfect, especially in cases of direct speech, so that manual re-alignment and further splitting was necessary. A further roughly 2000 examples were taken from the Frasch Uurdebök, Friesisches Wörterbuch, Neumünster 1988 published by the Nordfriesische Wörterbuchstelle at the Christian-Albrechts-Universität Kiel. Finally, a little under 180 very simple self-written examples were used as evaluation data set.

Usage

Version 3.0

From version 3.0 on the token moo_Latn is used for Mooring. In older versions this was frr_Latn. The actual usage has become simpler since the new token is now baked into the model. For example:

from transformers import AutoModelForSeq2SeqLM, NllbTokenizer

def translate(
    text,
    tokenizer,
    model,
    src_lang='moo_Latn',
    tgt_lang='deu_Latn',
    a=32,
    b=3,
    max_input_length=1024,
    num_beams=4,
    **kwargs
):
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
    result = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
        num_beams=num_beams,
        **kwargs
    )
    return tokenizer.batch_decode(result, skip_special_tokens=True)

path = "CmdCody/nllb-deu-moo"
tokenizer = NllbTokenizer.from_pretrained(path)
model = AutoModelForSeq2SeqLM.from_pretrained(path)

translate("Momme booget önj Naibel", tokenizer=tokenizer, model=model)

Pre Version 3.0

Earlier versions rely on an old version of the transformers library. For these versions the tokenizer has to be adapted every time the model is loaded. Also notice that the token used to specify North Frisian/Mooring for these versions is frr_Latn rather than moo_Latn, which is used in newer versions.

!pip install transformers==4.33

from transformers import AutoModelForSeq2SeqLM, NllbTokenizer

def create_tokenizer_with_new_lang(model_id, new_lang):
    tokenizer = NllbTokenizer.from_pretrained(model_id)
    old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)
    tokenizer.lang_code_to_id[new_lang] = old_len-1
    tokenizer.id_to_lang_code[old_len-1] = new_lang
    # always move "mask" to the last position
    tokenizer.fairseq_tokens_to_ids["<mask>"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset

    tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)
    tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}
    if new_lang not in tokenizer._additional_special_tokens:
        tokenizer._additional_special_tokens.append(new_lang)
    # clear the added token encoder; otherwise a new token may end up there by mistake
    tokenizer.added_tokens_encoder = {}

    return tokenizer

def translate(
    text,
    tokenizer,
    model,
    src_lang='frr_Latn',
    tgt_lang='deu_Latn',
    a=32,
    b=3,
    max_input_length=1024,
    num_beams=4,
    **kwargs
):
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
    result = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
        num_beams=num_beams,
        **kwargs
    )
    return tokenizer.batch_decode(result, skip_special_tokens=True)

path = "CmdCody/nllb-deu-moo"
tokenizer = create_tokenizer_with_new_lang(path, 'frr_Latn')
model = AutoModelForSeq2SeqLM.from_pretrained(path)

translate("Momme booget önj Naibel", tokenizer=tokenizer, model=model)

Training

Version 3.0

The model was trained on an Nvidia L4 (thanks to the FPS Niebüll) with a batch size of 8 for 3 epochs. Every epoch the full training data set was seen by the model in randomized order.

Metrics on the evaluation data set:

	Bleu	ChrF++
Frr -> De	51.88	68.23
De -> Frr	48.25	65.66

Pre Version 3.0

The model was trained in a Google Colab notebook for 5000 steps and a batch size of 16 following the above mentioned blog post.

Metrics on the evaluation data set:

	Bleu	ChrF++
Frr -> De	48.79	65.12
De -> Frr	47.56	65.03

Downloads last month: 9

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for CmdCody/nllb-deu-moo

Base model

facebook/nllb-200-distilled-600M

Finetuned

(282)

this model

CmdCody
/

nllb-deu-moo

Northern Frisian translation model

Data

Version 3.0

Pre Version 3.0

Usage

Version 3.0

Pre Version 3.0

Training

Version 3.0

Pre Version 3.0

Model tree for CmdCody/nllb-deu-moo

Space using CmdCody/nllb-deu-moo 1