You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Nagamese-to-English NMT

This model is a fine-tuned version of facebook/nllb-200-distilled-600M specifically adapted for Nagamese (Naga Creole) to English translation. It is designed to be the second stage of a speech processing pipeline, converting ASR-generated Nagamese text into English.

Model Details

Model Description

Nagamese is a creole language primarily spoken in Nagaland, India. Because it lacks a standardized orthography and large-scale parallel datasets, standard translation models often fail. This model uses Parameter-Efficient Fine-Tuning (LoRA) to adapt the NLLB-200 architecture to the specific syntax and vocabulary of Nagamese using a highly curated, small-scale dataset.

  • Developed by: Kenei
  • Model type: Neural Machine Translation (Encoder-Decoder)
  • Language(s): Nagamese (Source) to English (Target)
  • Finetuned from model: facebook/nllb-200-distilled-600M

Direct Use

The model is intended to translate Nagamese text—specifically outputs from an Automatic Speech Recognition (ASR) system—into English. It is optimized for daily conversational sentences and common phrases.

Out-of-Scope Use

  • Not intended for legal, medical, or official government documentation.
  • May struggle with highly technical or scientific jargon not present in the 500-sentence training set.

Bias, Risks, and Limitations

Technical Limitations

  • Dataset Size: Due to the "extreme low-resource" nature of the project (trained on ~500 unique parallel sentences), the model may over-rely on memorized phrases.
  • Orthography: Since Nagamese is often written phonetically, variations in spelling (e.g., "bosti" vs "boosti") may affect translation quality.

Recommendations

Users should pair this model with a robust ASR system. It is recommended to use Label Smoothing during inference to handle the inherent noise in Nagamese transcriptions.

How to Get Started with the Model

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("your-username/nagamese-nmt")
model = AutoModelForSeq2SeqLM.from_pretrained("your-username/nagamese-nmt")

text = "Ami bosti te jai ase." # "I am going to the village."
inputs = tokenizer(text, return_tensors="pt")

translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["eng_Latn"])
print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True))
Downloads last month
5
Safetensors
Model size
1B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support