--- license: apache-2.0 language: - tgj - en datasets: - repleeka/Tagin-KIFS-Corpus-v1.0 base_model: - facebook/mbart-large-50-many-to-many-mmt pipeline_tag: translation tags: - mbart - nmt - lowres - tagin --- # mBART-tgj-final `mBART-tgj-final` is a fine-tuned sequence-to-sequence model based on `mBART-50-M2M` for `Tagin → English` Neural Machine Translation. The model is trained using a Knowledge-Integrated Filtering System (KIFS) applied to a large mixed corpus of synthetic and manually curated sentence pairs, enabling significant improvements in translation quality for an extremely low-resource language. ## Developer - Name: Tungon Dugi - Institution: National Institute of Technology Arunachal Pradesh ## How to Get Started with the Model Use the code below to get started with the model. ```python import torch from transformers import MBartForConditionalGeneration, MBart50TokenizerFast model = MBartForConditionalGeneration.from_pretrained("repleeka/mBART-tgj-final") tokenizer = MBart50TokenizerFast.from_pretrained("repleeka/mBART-tgj-final") tokenizer.src_lang = "en_XX" text = "How are you?" inputs = tokenizer(text, return_tensors="pt") generated_tokens = model.generate( **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids(""), num_beams=5, max_length=128, ) print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)) ``` ## Intended Use - Research in low-resource and endangered language technology - Tagin language documentation and preservation - Transfer learning experiments with Tani-language family - Domain adaptation and linguistic resource development ## Not intended for - High-stakes applications (medical, legal, safety-critical) - Fully automated decision systems without human validation ## Model Architecture | Component | Value | | :--- | :--- | | **Base Model** | mBART-50 Large | | **Layers** | 12-layer encoder & decoder | | **Hidden size** | 1024 | | **FFN dimension** | 4096 | | **Attention heads** | 16 | | **Vocabulary size** | 250,055 | | **Tokenizer** | MBart50Tokenizer | ## Evaluation Results | Metric | Score | | :--- | :--- | | **BLEU** | 40.27 | | **ChrF** | 59.38 | | **METEOR** | 46.29 | | **TER** | 44.42 | | **Loss** | 0.0503 | ### Statistical Significance Test Results `Evaluation based on 890-sentence held-out test set with Approximate Randomization significance testing (p < 0.01), showing large performance gains over baseline mBART-tgj-base.` | Parameter | Detail | | :--- | :--- | | **Evaluation Set** | 890-sentence held-out test set | | **Test Method** | Approximate Randomization | | **Significance Level** | p < 0.01 | | **Comparison** | Large performance gains over baseline `mBART-tgj-base` | ## Limitations & Ethical Considerations | Category | Details | | :--- | :--- | | **Linguistic Limitations** | Model may struggle with rare morphology, honorific forms, and culturally grounded expressions | | **Cultural Sensitivity** | Sensitive cultural or mythological meanings may require expert review | | **Data Bias** | Bias inherited from synthetic data cannot be fully eliminated | | **Deployment Scope** | Not suitable for automated translation of critical documents | ## Citation Dugi, T., & Sambyo, K. (2025). *A Knowledge-Integrated System for Quality-Driven Filtering of Low-Resource Tagin-English Bitexts.* National Institute of Technology Arunachal Pradesh. ## License MIT / Apache-2.0 recommended for research openness and compatibility. (Data licensing depends on source text availability.) ## Contact For collaboration, feedback, or issues: *tungondugi@gmail.com*