|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- tgj |
|
|
- en |
|
|
datasets: |
|
|
- repleeka/Tagin-KIFS-Corpus-v1.0 |
|
|
base_model: |
|
|
- facebook/mbart-large-50-many-to-many-mmt |
|
|
pipeline_tag: translation |
|
|
tags: |
|
|
- mbart |
|
|
- nmt |
|
|
- lowres |
|
|
- tagin |
|
|
--- |
|
|
# mBART-tgj-final |
|
|
`mBART-tgj-final` is a fine-tuned sequence-to-sequence model based on `mBART-50-M2M` for `Tagin → English` Neural Machine Translation. |
|
|
The model is trained using a Knowledge-Integrated Filtering System (KIFS) applied to a large mixed corpus of synthetic and manually curated sentence pairs, enabling significant improvements in translation quality for an extremely low-resource language. |
|
|
|
|
|
## Developer |
|
|
- Name: Tungon Dugi |
|
|
- Institution: National Institute of Technology Arunachal Pradesh |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Use the code below to get started with the model. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast |
|
|
|
|
|
model = MBartForConditionalGeneration.from_pretrained("repleeka/mBART-tgj-final") |
|
|
tokenizer = MBart50TokenizerFast.from_pretrained("repleeka/mBART-tgj-final") |
|
|
|
|
|
tokenizer.src_lang = "en_XX" |
|
|
text = "How are you?" |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
|
|
|
generated_tokens = model.generate( |
|
|
**inputs, |
|
|
forced_bos_token_id=tokenizer.convert_tokens_to_ids("<tgj_IN>"), |
|
|
num_beams=5, |
|
|
max_length=128, |
|
|
) |
|
|
|
|
|
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)) |
|
|
|
|
|
``` |
|
|
|
|
|
## Intended Use |
|
|
- Research in low-resource and endangered language technology |
|
|
- Tagin language documentation and preservation |
|
|
- Transfer learning experiments with Tani-language family |
|
|
- Domain adaptation and linguistic resource development |
|
|
|
|
|
## Not intended for |
|
|
|
|
|
- High-stakes applications (medical, legal, safety-critical) |
|
|
- Fully automated decision systems without human validation |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
| Component | Value | |
|
|
| :--- | :--- | |
|
|
| **Base Model** | mBART-50 Large | |
|
|
| **Layers** | 12-layer encoder & decoder | |
|
|
| **Hidden size** | 1024 | |
|
|
| **FFN dimension** | 4096 | |
|
|
| **Attention heads** | 16 | |
|
|
| **Vocabulary size** | 250,055 | |
|
|
| **Tokenizer** | MBart50Tokenizer | |
|
|
|
|
|
|
|
|
## Evaluation Results |
|
|
| Metric | Score | |
|
|
| :--- | :--- | |
|
|
| **BLEU** | 40.27 | |
|
|
| **ChrF** | 59.38 | |
|
|
| **METEOR** | 46.29 | |
|
|
| **TER** | 44.42 | |
|
|
| **Loss** | 0.0503 | |
|
|
|
|
|
### Statistical Significance Test Results |
|
|
`Evaluation based on 890-sentence held-out test set with Approximate Randomization significance testing (p < 0.01), showing large performance gains over baseline mBART-tgj-base.` |
|
|
|
|
|
| Parameter | Detail | |
|
|
| :--- | :--- | |
|
|
| **Evaluation Set** | 890-sentence held-out test set | |
|
|
| **Test Method** | Approximate Randomization | |
|
|
| **Significance Level** | p < 0.01 | |
|
|
| **Comparison** | Large performance gains over baseline `mBART-tgj-base` | |
|
|
|
|
|
## Limitations & Ethical Considerations |
|
|
| Category | Details | |
|
|
| :--- | :--- | |
|
|
| **Linguistic Limitations** | Model may struggle with rare morphology, honorific forms, and culturally grounded expressions | |
|
|
| **Cultural Sensitivity** | Sensitive cultural or mythological meanings may require expert review | |
|
|
| **Data Bias** | Bias inherited from synthetic data cannot be fully eliminated | |
|
|
| **Deployment Scope** | Not suitable for automated translation of critical documents | |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
Dugi, T., & Sambyo, K. (2025). |
|
|
*A Knowledge-Integrated System for Quality-Driven Filtering of Low-Resource Tagin-English Bitexts.* |
|
|
National Institute of Technology Arunachal Pradesh. |
|
|
|
|
|
## License |
|
|
MIT / Apache-2.0 recommended for research openness and compatibility. |
|
|
(Data licensing depends on source text availability.) |
|
|
|
|
|
## Contact |
|
|
For collaboration, feedback, or issues: |
|
|
*tungondugi@gmail.com* |