mBART-tgj-final / README.md
repleeka's picture
Update README.md
e792638 verified
---
license: apache-2.0
language:
- tgj
- en
datasets:
- repleeka/Tagin-KIFS-Corpus-v1.0
base_model:
- facebook/mbart-large-50-many-to-many-mmt
pipeline_tag: translation
tags:
- mbart
- nmt
- lowres
- tagin
---
# mBART-tgj-final
`mBART-tgj-final` is a fine-tuned sequence-to-sequence model based on `mBART-50-M2M` for `Tagin → English` Neural Machine Translation.
The model is trained using a Knowledge-Integrated Filtering System (KIFS) applied to a large mixed corpus of synthetic and manually curated sentence pairs, enabling significant improvements in translation quality for an extremely low-resource language.
## Developer
- Name: Tungon Dugi
- Institution: National Institute of Technology Arunachal Pradesh
## How to Get Started with the Model
Use the code below to get started with the model.
```python
import torch
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
model = MBartForConditionalGeneration.from_pretrained("repleeka/mBART-tgj-final")
tokenizer = MBart50TokenizerFast.from_pretrained("repleeka/mBART-tgj-final")
tokenizer.src_lang = "en_XX"
text = "How are you?"
inputs = tokenizer(text, return_tensors="pt")
generated_tokens = model.generate(
**inputs,
forced_bos_token_id=tokenizer.convert_tokens_to_ids("<tgj_IN>"),
num_beams=5,
max_length=128,
)
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))
```
## Intended Use
- Research in low-resource and endangered language technology
- Tagin language documentation and preservation
- Transfer learning experiments with Tani-language family
- Domain adaptation and linguistic resource development
## Not intended for
- High-stakes applications (medical, legal, safety-critical)
- Fully automated decision systems without human validation
## Model Architecture
| Component | Value |
| :--- | :--- |
| **Base Model** | mBART-50 Large |
| **Layers** | 12-layer encoder & decoder |
| **Hidden size** | 1024 |
| **FFN dimension** | 4096 |
| **Attention heads** | 16 |
| **Vocabulary size** | 250,055 |
| **Tokenizer** | MBart50Tokenizer |
## Evaluation Results
| Metric | Score |
| :--- | :--- |
| **BLEU** | 40.27 |
| **ChrF** | 59.38 |
| **METEOR** | 46.29 |
| **TER** | 44.42 |
| **Loss** | 0.0503 |
### Statistical Significance Test Results
`Evaluation based on 890-sentence held-out test set with Approximate Randomization significance testing (p < 0.01), showing large performance gains over baseline mBART-tgj-base.`
| Parameter | Detail |
| :--- | :--- |
| **Evaluation Set** | 890-sentence held-out test set |
| **Test Method** | Approximate Randomization |
| **Significance Level** | p < 0.01 |
| **Comparison** | Large performance gains over baseline `mBART-tgj-base` |
## Limitations & Ethical Considerations
| Category | Details |
| :--- | :--- |
| **Linguistic Limitations** | Model may struggle with rare morphology, honorific forms, and culturally grounded expressions |
| **Cultural Sensitivity** | Sensitive cultural or mythological meanings may require expert review |
| **Data Bias** | Bias inherited from synthetic data cannot be fully eliminated |
| **Deployment Scope** | Not suitable for automated translation of critical documents |
## Citation
Dugi, T., & Sambyo, K. (2025).
*A Knowledge-Integrated System for Quality-Driven Filtering of Low-Resource Tagin-English Bitexts.*
National Institute of Technology Arunachal Pradesh.
## License
MIT / Apache-2.0 recommended for research openness and compatibility.
(Data licensing depends on source text availability.)
## Contact
For collaboration, feedback, or issues:
*tungondugi@gmail.com*