mBART-tgj-final / README.md

Update README.md

e792638 verified about 1 month ago

3.63 kB

	---
	license: apache-2.0
	language:
	- tgj
	- en
	datasets:
	- repleeka/Tagin-KIFS-Corpus-v1.0
	base_model:
	- facebook/mbart-large-50-many-to-many-mmt
	pipeline_tag: translation
	tags:
	- mbart
	- nmt
	- lowres
	- tagin
	---
	# mBART-tgj-final
	`mBART-tgj-final` is a fine-tuned sequence-to-sequence model based on `mBART-50-M2M` for `Tagin → English` Neural Machine Translation.
	The model is trained using a Knowledge-Integrated Filtering System (KIFS) applied to a large mixed corpus of synthetic and manually curated sentence pairs, enabling significant improvements in translation quality for an extremely low-resource language.

	## Developer
	- Name: Tungon Dugi
	- Institution: National Institute of Technology Arunachal Pradesh

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	import torch
	from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

	model = MBartForConditionalGeneration.from_pretrained("repleeka/mBART-tgj-final")
	tokenizer = MBart50TokenizerFast.from_pretrained("repleeka/mBART-tgj-final")

	tokenizer.src_lang = "en_XX"
	text = "How are you?"
	inputs = tokenizer(text, return_tensors="pt")

	generated_tokens = model.generate(
	**inputs,
	forced_bos_token_id=tokenizer.convert_tokens_to_ids("<tgj_IN>"),
	num_beams=5,
	max_length=128,
	)

	print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))

	```

	## Intended Use
	- Research in low-resource and endangered language technology
	- Tagin language documentation and preservation
	- Transfer learning experiments with Tani-language family
	- Domain adaptation and linguistic resource development

	## Not intended for

	- High-stakes applications (medical, legal, safety-critical)
	- Fully automated decision systems without human validation

	## Model Architecture

	\| Component \| Value \|
	\| :--- \| :--- \|
	\| Base Model \| mBART-50 Large \|
	\| Layers \| 12-layer encoder & decoder \|
	\| Hidden size \| 1024 \|
	\| FFN dimension \| 4096 \|
	\| Attention heads \| 16 \|
	\| Vocabulary size \| 250,055 \|
	\| Tokenizer \| MBart50Tokenizer \|


	## Evaluation Results
	\| Metric \| Score \|
	\| :--- \| :--- \|
	\| BLEU \| 40.27 \|
	\| ChrF \| 59.38 \|
	\| METEOR \| 46.29 \|
	\| TER \| 44.42 \|
	\| Loss \| 0.0503 \|

	### Statistical Significance Test Results
	`Evaluation based on 890-sentence held-out test set with Approximate Randomization significance testing (p < 0.01), showing large performance gains over baseline mBART-tgj-base.`

	\| Parameter \| Detail \|
	\| :--- \| :--- \|
	\| Evaluation Set \| 890-sentence held-out test set \|
	\| Test Method \| Approximate Randomization \|
	\| Significance Level \| p < 0.01 \|
	\| Comparison \| Large performance gains over baseline `mBART-tgj-base` \|

	## Limitations & Ethical Considerations
	\| Category \| Details \|
	\| :--- \| :--- \|
	\| Linguistic Limitations \| Model may struggle with rare morphology, honorific forms, and culturally grounded expressions \|
	\| Cultural Sensitivity \| Sensitive cultural or mythological meanings may require expert review \|
	\| Data Bias \| Bias inherited from synthetic data cannot be fully eliminated \|
	\| Deployment Scope \| Not suitable for automated translation of critical documents \|


	## Citation

	Dugi, T., & Sambyo, K. (2025).
	A Knowledge-Integrated System for Quality-Driven Filtering of Low-Resource Tagin-English Bitexts.
	National Institute of Technology Arunachal Pradesh.

	## License
	MIT / Apache-2.0 recommended for research openness and compatibility.
	(Data licensing depends on source text availability.)

	## Contact
	For collaboration, feedback, or issues:
	tungondugi@gmail.com