Update README.md
Browse files
README.md
CHANGED
|
@@ -1,84 +1,161 @@
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
| 4 |
base_model: Helsinki-NLP/opus-mt-en-mul
|
| 5 |
tags:
|
| 6 |
- translation
|
| 7 |
- en-wal
|
| 8 |
- wolaytta
|
|
|
|
|
|
|
| 9 |
- marian
|
| 10 |
- opus-mt
|
| 11 |
- generated_from_trainer
|
|
|
|
|
|
|
|
|
|
| 12 |
model-index:
|
| 13 |
- name: opus-mt-en-wal
|
| 14 |
results: []
|
| 15 |
---
|
| 16 |
|
| 17 |
-
|
| 18 |
-
should probably proofread and complete it, then remove this comment. -->
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
This
|
| 23 |
-
It achieves the following results on the evaluation set:
|
| 24 |
-
- Loss: 0.3485
|
| 25 |
|
| 26 |
-
## Model
|
| 27 |
|
| 28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
-
##
|
| 31 |
|
| 32 |
-
|
|
|
|
| 33 |
|
| 34 |
-
|
|
|
|
|
|
|
| 35 |
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
-
##
|
| 39 |
|
| 40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
-
|
| 43 |
-
- learning_rate: 2e-05
|
| 44 |
-
- train_batch_size: 16
|
| 45 |
-
- eval_batch_size: 16
|
| 46 |
-
- seed: 42
|
| 47 |
-
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
|
| 48 |
-
- lr_scheduler_type: linear
|
| 49 |
-
- num_epochs: 3
|
| 50 |
-
- mixed_precision_training: Native AMP
|
| 51 |
|
| 52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
| Training Loss | Epoch | Step | Validation Loss |
|
| 55 |
|:-------------:|:------:|:-----:|:---------------:|
|
| 56 |
-
| 0.6944 | 0.
|
| 57 |
-
| 0.5968 | 0.
|
| 58 |
-
| 0.5329 | 0.
|
| 59 |
-
| 0.5116 | 0.
|
| 60 |
-
| 0.4747 | 0.
|
| 61 |
-
| 0.4483 | 0.
|
| 62 |
-
| 0.4501 | 0.
|
| 63 |
-
| 0.4275 | 1.
|
| 64 |
-
| 0.4174 | 1.
|
| 65 |
-
| 0.
|
| 66 |
-
| 0.4145 | 1.
|
| 67 |
-
| 0.3968 | 1.
|
| 68 |
-
| 0.
|
| 69 |
-
| 0.4027 | 1.
|
| 70 |
-
| 0.3778 | 2.
|
| 71 |
-
| 0.3732 | 2.
|
| 72 |
-
| 0.3695 | 2.
|
| 73 |
-
| 0.3611 | 2.
|
| 74 |
-
| 0.3605 | 2.
|
| 75 |
-
| 0.3639 | 2.
|
| 76 |
-
| 0.
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
### Framework versions
|
| 80 |
|
| 81 |
- Transformers 4.57.3
|
| 82 |
-
-
|
| 83 |
- Datasets 4.0.0
|
| 84 |
- Tokenizers 0.22.1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
license: apache-2.0
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
+
- wal
|
| 7 |
base_model: Helsinki-NLP/opus-mt-en-mul
|
| 8 |
tags:
|
| 9 |
- translation
|
| 10 |
- en-wal
|
| 11 |
- wolaytta
|
| 12 |
+
- ethiopian-languages
|
| 13 |
+
- low-resource
|
| 14 |
- marian
|
| 15 |
- opus-mt
|
| 16 |
- generated_from_trainer
|
| 17 |
+
datasets:
|
| 18 |
+
- michsethowusu/english-wolaytta_sentence-pairs_mt560
|
| 19 |
+
pipeline_tag: translation
|
| 20 |
model-index:
|
| 21 |
- name: opus-mt-en-wal
|
| 22 |
results: []
|
| 23 |
---
|
| 24 |
|
| 25 |
+
# English to Wolaytta Translation Model
|
|
|
|
| 26 |
|
| 27 |
+
A machine translation model for translating **English → Wolaytta** (an Ethiopian language spoken by 2-7 million people).
|
| 28 |
|
| 29 |
+
This is the first publicly available English-to-Wolaytta neural machine translation model.
|
|
|
|
|
|
|
| 30 |
|
| 31 |
+
## Model Details
|
| 32 |
|
| 33 |
+
| Property | Value |
|
| 34 |
+
|----------|-------|
|
| 35 |
+
| **Base Model** | [Helsinki-NLP/opus-mt-en-mul](https://huggingface.co/Helsinki-NLP/opus-mt-en-mul) |
|
| 36 |
+
| **Architecture** | MarianMT (Transformer) |
|
| 37 |
+
| **Parameters** | 77M |
|
| 38 |
+
| **Training Data** | 120,608 sentence pairs |
|
| 39 |
+
| **Final Validation Loss** | 0.3485 |
|
| 40 |
+
| **License** | Apache 2.0 |
|
| 41 |
|
| 42 |
+
## Usage
|
| 43 |
|
| 44 |
+
```python
|
| 45 |
+
from transformers import MarianMTModel, MarianTokenizer
|
| 46 |
|
| 47 |
+
model_name = "WellDunDun/opus-mt-en-wal"
|
| 48 |
+
tokenizer = MarianTokenizer.from_pretrained(model_name)
|
| 49 |
+
model = MarianMTModel.from_pretrained(model_name)
|
| 50 |
|
| 51 |
+
text = "Hello, how are you?"
|
| 52 |
+
inputs = tokenizer(text, return_tensors="pt", padding=True)
|
| 53 |
+
outputs = model.generate(**inputs, max_length=128, num_beams=4)
|
| 54 |
+
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 55 |
+
print(translation) # Output: "Halo, neeni waanidee?"
|
| 56 |
+
```
|
| 57 |
|
| 58 |
+
## Example Translations
|
| 59 |
|
| 60 |
+
| English | Wolaytta |
|
| 61 |
+
|---------|----------|
|
| 62 |
+
| Hello, how are you? | Halo, neeni waanidee? |
|
| 63 |
+
| Thank you very much | Keehippe galatays |
|
| 64 |
+
| What is your name? | Ne sunttay aybee? |
|
| 65 |
|
| 66 |
+
## Training Data
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
+
This model was fine-tuned on the [michsethowusu/english-wolaytta_sentence-pairs_mt560](https://huggingface.co/datasets/michsethowusu/english-wolaytta_sentence-pairs_mt560) dataset, which contains 120,608 English-Wolaytta parallel sentences derived from [OPUS MT560](https://opus.nlpl.eu/MT560).
|
| 69 |
+
|
| 70 |
+
The training data primarily comes from:
|
| 71 |
+
- Bible translations
|
| 72 |
+
- JW.org publications
|
| 73 |
+
|
| 74 |
+
## Intended Uses
|
| 75 |
+
|
| 76 |
+
- Communication with Wolaytta speakers
|
| 77 |
+
- Language learning and education
|
| 78 |
+
- Research on low-resource language translation
|
| 79 |
+
- Building applications for the Wolaytta-speaking community
|
| 80 |
+
|
| 81 |
+
## Limitations
|
| 82 |
+
|
| 83 |
+
- **Domain bias**: Heavy religious/biblical content in training data
|
| 84 |
+
- **Casual speech**: May struggle with informal expressions or slang
|
| 85 |
+
- **Modern vocabulary**: Limited coverage of technology, contemporary topics
|
| 86 |
+
- **Low-resource language**: Wolaytta has limited digital resources; verify important translations with native speakers
|
| 87 |
+
|
| 88 |
+
## Training Procedure
|
| 89 |
+
|
| 90 |
+
### Training Hyperparameters
|
| 91 |
+
|
| 92 |
+
- **Learning rate**: 2e-05
|
| 93 |
+
- **Train batch size**: 16
|
| 94 |
+
- **Eval batch size**: 16
|
| 95 |
+
- **Seed**: 42
|
| 96 |
+
- **Optimizer**: AdamW (fused) with betas=(0.9,0.999) and epsilon=1e-08
|
| 97 |
+
- **LR scheduler**: Linear
|
| 98 |
+
- **Epochs**: 3
|
| 99 |
+
- **Mixed precision**: Native AMP
|
| 100 |
+
- **Hardware**: Google Colab (T4 GPU)
|
| 101 |
+
- **Training time**: ~3 hours
|
| 102 |
+
|
| 103 |
+
### Training Results
|
| 104 |
|
| 105 |
| Training Loss | Epoch | Step | Validation Loss |
|
| 106 |
|:-------------:|:------:|:-----:|:---------------:|
|
| 107 |
+
| 0.6944 | 0.14 | 1000 | 0.6297 |
|
| 108 |
+
| 0.5968 | 0.28 | 2000 | 0.5214 |
|
| 109 |
+
| 0.5329 | 0.42 | 3000 | 0.4742 |
|
| 110 |
+
| 0.5116 | 0.56 | 4000 | 0.4459 |
|
| 111 |
+
| 0.4747 | 0.70 | 5000 | 0.4255 |
|
| 112 |
+
| 0.4483 | 0.84 | 6000 | 0.4120 |
|
| 113 |
+
| 0.4501 | 0.98 | 7000 | 0.4021 |
|
| 114 |
+
| 0.4275 | 1.12 | 8000 | 0.3899 |
|
| 115 |
+
| 0.4174 | 1.26 | 9000 | 0.3833 |
|
| 116 |
+
| 0.4060 | 1.40 | 10000 | 0.3768 |
|
| 117 |
+
| 0.4145 | 1.54 | 11000 | 0.3727 |
|
| 118 |
+
| 0.3968 | 1.68 | 12000 | 0.3675 |
|
| 119 |
+
| 0.3930 | 1.82 | 13000 | 0.3635 |
|
| 120 |
+
| 0.4027 | 1.95 | 14000 | 0.3595 |
|
| 121 |
+
| 0.3778 | 2.09 | 15000 | 0.3573 |
|
| 122 |
+
| 0.3732 | 2.23 | 16000 | 0.3556 |
|
| 123 |
+
| 0.3695 | 2.37 | 17000 | 0.3535 |
|
| 124 |
+
| 0.3611 | 2.51 | 18000 | 0.3518 |
|
| 125 |
+
| 0.3605 | 2.65 | 19000 | 0.3504 |
|
| 126 |
+
| 0.3639 | 2.79 | 20000 | 0.3491 |
|
| 127 |
+
| 0.3680 | 2.93 | 21000 | 0.3485 |
|
| 128 |
+
|
| 129 |
+
### Framework Versions
|
|
|
|
| 130 |
|
| 131 |
- Transformers 4.57.3
|
| 132 |
+
- PyTorch 2.9.0+cu126
|
| 133 |
- Datasets 4.0.0
|
| 134 |
- Tokenizers 0.22.1
|
| 135 |
+
|
| 136 |
+
## Related Models
|
| 137 |
+
|
| 138 |
+
- [Helsinki-NLP/opus-mt-wal-en](https://huggingface.co/Helsinki-NLP/opus-mt-wal-en) - Wolaytta → English (reverse direction)
|
| 139 |
+
|
| 140 |
+
## About Wolaytta
|
| 141 |
+
|
| 142 |
+
Wolaytta (also spelled Wolayta, Wolaitta, Welayta) is a North Omotic language spoken in the Wolaita Zone of Ethiopia's Southern Nations, Nationalities, and Peoples' Region by approximately 2-7 million people.
|
| 143 |
+
|
| 144 |
+
## Citation
|
| 145 |
+
|
| 146 |
+
```bibtex
|
| 147 |
+
@misc{opus_mt_en_wal_2026,
|
| 148 |
+
title={English to Wolaytta Translation Model},
|
| 149 |
+
author={WellDunDun},
|
| 150 |
+
year={2026},
|
| 151 |
+
url={https://huggingface.co/WellDunDun/opus-mt-en-wal},
|
| 152 |
+
note={Fine-tuned on michsethowusu/english-wolaytta_sentence-pairs_mt560 dataset, derived from OPUS MT560}
|
| 153 |
+
}
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
## Acknowledgments
|
| 157 |
+
|
| 158 |
+
- [Helsinki-NLP](https://huggingface.co/Helsinki-NLP) for the base multilingual model
|
| 159 |
+
- [michsethowusu](https://huggingface.co/datasets/michsethowusu/english-wolaytta_sentence-pairs_mt560) for curating the parallel corpus
|
| 160 |
+
- [OPUS MT560](https://opus.nlpl.eu/MT560) for the original training data
|
| 161 |
+
- The Wolaytta language community
|