Nagamese-to-English NMT
This model is a fine-tuned version of facebook/nllb-200-distilled-600M specifically adapted for Nagamese (Naga Creole) to English translation. It is designed to be the second stage of a speech processing pipeline, converting ASR-generated Nagamese text into English.
Model Details
Model Description
Nagamese is a creole language primarily spoken in Nagaland, India. Because it lacks a standardized orthography and large-scale parallel datasets, standard translation models often fail. This model uses Parameter-Efficient Fine-Tuning (LoRA) to adapt the NLLB-200 architecture to the specific syntax and vocabulary of Nagamese using a highly curated, small-scale dataset.
- Developed by: Kenei
- Model type: Neural Machine Translation (Encoder-Decoder)
- Language(s): Nagamese (Source) to English (Target)
- Finetuned from model: facebook/nllb-200-distilled-600M
Direct Use
The model is intended to translate Nagamese text—specifically outputs from an Automatic Speech Recognition (ASR) system—into English. It is optimized for daily conversational sentences and common phrases.
Out-of-Scope Use
- Not intended for legal, medical, or official government documentation.
- May struggle with highly technical or scientific jargon not present in the 500-sentence training set.
Bias, Risks, and Limitations
Technical Limitations
- Dataset Size: Due to the "extreme low-resource" nature of the project (trained on ~500 unique parallel sentences), the model may over-rely on memorized phrases.
- Orthography: Since Nagamese is often written phonetically, variations in spelling (e.g., "bosti" vs "boosti") may affect translation quality.
Recommendations
Users should pair this model with a robust ASR system. It is recommended to use Label Smoothing during inference to handle the inherent noise in Nagamese transcriptions.
How to Get Started with the Model
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("your-username/nagamese-nmt")
model = AutoModelForSeq2SeqLM.from_pretrained("your-username/nagamese-nmt")
text = "Ami bosti te jai ase." # "I am going to the village."
inputs = tokenizer(text, return_tensors="pt")
translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["eng_Latn"])
print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True))
- Downloads last month
- 5