---
license: apache-2.0
language:
- tgj
- en
datasets:
- repleeka/Tagin-KIFS-Corpus-v1.0
base_model:
- facebook/mbart-large-50-many-to-many-mmt
pipeline_tag: translation
tags:
- mbart
- nmt
- lowres
- tagin
---
# mBART-tgj-final
`mBART-tgj-final` is a fine-tuned sequence-to-sequence model based on `mBART-50-M2M` for `Tagin → English` Neural Machine Translation.
The model is trained using a Knowledge-Integrated Filtering System (KIFS) applied to a large mixed corpus of synthetic and manually curated sentence pairs, enabling significant improvements in translation quality for an extremely low-resource language.

## Developer
 - Name: Tungon Dugi
 - Institution: National Institute of Technology Arunachal Pradesh

## How to Get Started with the Model

Use the code below to get started with the model.

```python
import torch
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model = MBartForConditionalGeneration.from_pretrained("repleeka/mBART-tgj-final")
tokenizer = MBart50TokenizerFast.from_pretrained("repleeka/mBART-tgj-final")

tokenizer.src_lang = "en_XX"
text = "How are you?"
inputs = tokenizer(text, return_tensors="pt")

generated_tokens = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("<tgj_IN>"),
    num_beams=5,
    max_length=128,
)

print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))

```

## Intended Use
 - Research in low-resource and endangered language technology
 - Tagin language documentation and preservation
 - Transfer learning experiments with Tani-language family
 - Domain adaptation and linguistic resource development

## Not intended for

 - High-stakes applications (medical, legal, safety-critical)
 - Fully automated decision systems without human validation

## Model Architecture

| Component | Value |
| :--- | :--- |
| **Base Model** | mBART-50 Large |
| **Layers** | 12-layer encoder & decoder |
| **Hidden size** | 1024 |
| **FFN dimension** | 4096 |
| **Attention heads** | 16 |
| **Vocabulary size** | 250,055 |
| **Tokenizer** | MBart50Tokenizer |


## Evaluation Results
| Metric | Score |
| :--- | :--- |
| **BLEU** | 40.27 |
| **ChrF** | 59.38 |
| **METEOR** | 46.29 |
| **TER** | 44.42 |
| **Loss** | 0.0503 |

 ### Statistical Significance Test Results
 `Evaluation based on 890-sentence held-out test set with Approximate Randomization significance testing (p < 0.01), showing large performance gains over baseline mBART-tgj-base.`

| Parameter | Detail |
| :--- | :--- |
| **Evaluation Set** | 890-sentence held-out test set |
| **Test Method** | Approximate Randomization |
| **Significance Level** | p < 0.01 |
| **Comparison** | Large performance gains over baseline `mBART-tgj-base` |

## Limitations & Ethical Considerations
| Category | Details |
| :--- | :--- |
| **Linguistic Limitations** | Model may struggle with rare morphology, honorific forms, and culturally grounded expressions |
| **Cultural Sensitivity** | Sensitive cultural or mythological meanings may require expert review |
| **Data Bias** | Bias inherited from synthetic data cannot be fully eliminated |
| **Deployment Scope** | Not suitable for automated translation of critical documents |


## Citation

Dugi, T., & Sambyo, K. (2025). 
*A Knowledge-Integrated System for Quality-Driven Filtering of Low-Resource Tagin-English Bitexts.* 
National Institute of Technology Arunachal Pradesh.

## License
MIT / Apache-2.0 recommended for research openness and compatibility.
(Data licensing depends on source text availability.)

## Contact
For collaboration, feedback, or issues:
*tungondugi@gmail.com*