README.md · WellDunDun/opus-mt-en-wal at main

opus-mt-en-wal / README.md

WellDunDun

Update README.md

aea525f verified about 1 month ago

preview code

raw

history blame contribute delete

5.4 kB

	---
	library_name: transformers
	license: apache-2.0
	language:
	- en
	- wal
	base_model: Helsinki-NLP/opus-mt-en-mul
	tags:
	- translation
	- en-wal
	- wolaytta
	- ethiopian-languages
	- low-resource
	- marian
	- opus-mt
	- generated_from_trainer
	datasets:
	- michsethowusu/english-wolaytta_sentence-pairs_mt560
	pipeline_tag: translation
	model-index:
	- name: opus-mt-en-wal
	results: []
	---

	# English to Wolaytta Translation Model

	A machine translation model for translating English → Wolaytta (an Ethiopian language spoken by 2-7 million people).

	This is the first publicly available English-to-Wolaytta neural machine translation model.

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| [Helsinki-NLP/opus-mt-en-mul](https://huggingface.co/Helsinki-NLP/opus-mt-en-mul) \|
	\| Architecture \| MarianMT (Transformer) \|
	\| Parameters \| 77M \|
	\| Training Data \| 120,608 sentence pairs \|
	\| Final Validation Loss \| 0.3485 \|
	\| License \| Apache 2.0 \|

	## Usage

	```python
	from transformers import MarianMTModel, MarianTokenizer

	model_name = "WellDunDun/opus-mt-en-wal"
	tokenizer = MarianTokenizer.from_pretrained(model_name)
	model = MarianMTModel.from_pretrained(model_name)

	text = "Hello, how are you?"
	inputs = tokenizer(text, return_tensors="pt", padding=True)
	outputs = model.generate(**inputs, max_length=128, num_beams=4)
	translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(translation) # Output: "Halo, neeni waanidee?"
	```

	## Example Translations

	\| English \| Wolaytta \|
	\|---------\|----------\|
	\| Hello, how are you? \| Halo, neeni waanidee? \|
	\| Thank you very much \| Keehippe galatays \|
	\| What is your name? \| Ne sunttay aybee? \|

	## Training Data

	This model was fine-tuned on the [michsethowusu/english-wolaytta_sentence-pairs_mt560](https://huggingface.co/datasets/michsethowusu/english-wolaytta_sentence-pairs_mt560) dataset, which contains 120,608 English-Wolaytta parallel sentences derived from [OPUS MT560](https://opus.nlpl.eu/MT560).

	The training data primarily comes from:
	- Bible translations
	- JW.org publications

	## Intended Uses

	- Communication with Wolaytta speakers
	- Language learning and education
	- Research on low-resource language translation
	- Building applications for the Wolaytta-speaking community

	## Limitations

	- Domain bias: Heavy religious/biblical content in training data
	- Casual speech: May struggle with informal expressions or slang
	- Modern vocabulary: Limited coverage of technology, contemporary topics
	- Low-resource language: Wolaytta has limited digital resources; verify important translations with native speakers

	## Training Procedure

	### Training Hyperparameters

	- Learning rate: 2e-05
	- Train batch size: 16
	- Eval batch size: 16
	- Seed: 42
	- Optimizer: AdamW (fused) with betas=(0.9,0.999) and epsilon=1e-08
	- LR scheduler: Linear
	- Epochs: 3
	- Mixed precision: Native AMP
	- Hardware: Google Colab (T4 GPU)
	- Training time: ~3 hours

	### Training Results

	\| Training Loss \| Epoch \| Step \| Validation Loss \|
	\|:-------------:\|:------:\|:-----:\|:---------------:\|
	\| 0.6944 \| 0.14 \| 1000 \| 0.6297 \|
	\| 0.5968 \| 0.28 \| 2000 \| 0.5214 \|
	\| 0.5329 \| 0.42 \| 3000 \| 0.4742 \|
	\| 0.5116 \| 0.56 \| 4000 \| 0.4459 \|
	\| 0.4747 \| 0.70 \| 5000 \| 0.4255 \|
	\| 0.4483 \| 0.84 \| 6000 \| 0.4120 \|
	\| 0.4501 \| 0.98 \| 7000 \| 0.4021 \|
	\| 0.4275 \| 1.12 \| 8000 \| 0.3899 \|
	\| 0.4174 \| 1.26 \| 9000 \| 0.3833 \|
	\| 0.4060 \| 1.40 \| 10000 \| 0.3768 \|
	\| 0.4145 \| 1.54 \| 11000 \| 0.3727 \|
	\| 0.3968 \| 1.68 \| 12000 \| 0.3675 \|
	\| 0.3930 \| 1.82 \| 13000 \| 0.3635 \|
	\| 0.4027 \| 1.95 \| 14000 \| 0.3595 \|
	\| 0.3778 \| 2.09 \| 15000 \| 0.3573 \|
	\| 0.3732 \| 2.23 \| 16000 \| 0.3556 \|
	\| 0.3695 \| 2.37 \| 17000 \| 0.3535 \|
	\| 0.3611 \| 2.51 \| 18000 \| 0.3518 \|
	\| 0.3605 \| 2.65 \| 19000 \| 0.3504 \|
	\| 0.3639 \| 2.79 \| 20000 \| 0.3491 \|
	\| 0.3680 \| 2.93 \| 21000 \| 0.3485 \|

	### Framework Versions

	- Transformers 4.57.3
	- PyTorch 2.9.0+cu126
	- Datasets 4.0.0
	- Tokenizers 0.22.1

	## Related Models

	- [Helsinki-NLP/opus-mt-wal-en](https://huggingface.co/Helsinki-NLP/opus-mt-wal-en) - Wolaytta → English (reverse direction)

	## About Wolaytta

	Wolaytta (also spelled Wolayta, Wolaitta, Welayta) is a North Omotic language spoken in the Wolaita Zone of Ethiopia's Southern Nations, Nationalities, and Peoples' Region by approximately 2-7 million people.

	## Citation

	```bibtex
	@misc{opus_mt_en_wal_2026,
	title={English to Wolaytta Translation Model},
	author={WellDunDun},
	year={2026},
	url={https://huggingface.co/WellDunDun/opus-mt-en-wal},
	note={Fine-tuned on michsethowusu/english-wolaytta_sentence-pairs_mt560 dataset, derived from OPUS MT560}
	}
	```

	## Acknowledgments

	- [Helsinki-NLP](https://huggingface.co/Helsinki-NLP) for the base multilingual model
	- [michsethowusu](https://huggingface.co/datasets/michsethowusu/english-wolaytta_sentence-pairs_mt560) for curating the parallel corpus
	- [OPUS MT560](https://opus.nlpl.eu/MT560) for the original training data
	- The Wolaytta language community