WellDunDun commited on
Commit
aea525f
·
verified ·
1 Parent(s): 9c42243

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +126 -49
README.md CHANGED
@@ -1,84 +1,161 @@
1
  ---
2
  library_name: transformers
3
  license: apache-2.0
 
 
 
4
  base_model: Helsinki-NLP/opus-mt-en-mul
5
  tags:
6
  - translation
7
  - en-wal
8
  - wolaytta
 
 
9
  - marian
10
  - opus-mt
11
  - generated_from_trainer
 
 
 
12
  model-index:
13
  - name: opus-mt-en-wal
14
  results: []
15
  ---
16
 
17
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
18
- should probably proofread and complete it, then remove this comment. -->
19
 
20
- # opus-mt-en-wal
21
 
22
- This model is a fine-tuned version of [Helsinki-NLP/opus-mt-en-mul](https://huggingface.co/Helsinki-NLP/opus-mt-en-mul) on an unknown dataset.
23
- It achieves the following results on the evaluation set:
24
- - Loss: 0.3485
25
 
26
- ## Model description
27
 
28
- More information needed
 
 
 
 
 
 
 
29
 
30
- ## Intended uses & limitations
31
 
32
- More information needed
 
33
 
34
- ## Training and evaluation data
 
 
35
 
36
- More information needed
 
 
 
 
 
37
 
38
- ## Training procedure
39
 
40
- ### Training hyperparameters
 
 
 
 
41
 
42
- The following hyperparameters were used during training:
43
- - learning_rate: 2e-05
44
- - train_batch_size: 16
45
- - eval_batch_size: 16
46
- - seed: 42
47
- - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
48
- - lr_scheduler_type: linear
49
- - num_epochs: 3
50
- - mixed_precision_training: Native AMP
51
 
52
- ### Training results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
  | Training Loss | Epoch | Step | Validation Loss |
55
  |:-------------:|:------:|:-----:|:---------------:|
56
- | 0.6944 | 0.1396 | 1000 | 0.6297 |
57
- | 0.5968 | 0.2793 | 2000 | 0.5214 |
58
- | 0.5329 | 0.4189 | 3000 | 0.4742 |
59
- | 0.5116 | 0.5585 | 4000 | 0.4459 |
60
- | 0.4747 | 0.6981 | 5000 | 0.4255 |
61
- | 0.4483 | 0.8378 | 6000 | 0.4120 |
62
- | 0.4501 | 0.9774 | 7000 | 0.4021 |
63
- | 0.4275 | 1.1170 | 8000 | 0.3899 |
64
- | 0.4174 | 1.2566 | 9000 | 0.3833 |
65
- | 0.406 | 1.3963 | 10000 | 0.3768 |
66
- | 0.4145 | 1.5359 | 11000 | 0.3727 |
67
- | 0.3968 | 1.6755 | 12000 | 0.3675 |
68
- | 0.393 | 1.8151 | 13000 | 0.3635 |
69
- | 0.4027 | 1.9548 | 14000 | 0.3595 |
70
- | 0.3778 | 2.0944 | 15000 | 0.3573 |
71
- | 0.3732 | 2.2340 | 16000 | 0.3556 |
72
- | 0.3695 | 2.3736 | 17000 | 0.3535 |
73
- | 0.3611 | 2.5133 | 18000 | 0.3518 |
74
- | 0.3605 | 2.6529 | 19000 | 0.3504 |
75
- | 0.3639 | 2.7925 | 20000 | 0.3491 |
76
- | 0.368 | 2.9321 | 21000 | 0.3485 |
77
-
78
-
79
- ### Framework versions
80
 
81
  - Transformers 4.57.3
82
- - Pytorch 2.9.0+cu126
83
  - Datasets 4.0.0
84
  - Tokenizers 0.22.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
  license: apache-2.0
4
+ language:
5
+ - en
6
+ - wal
7
  base_model: Helsinki-NLP/opus-mt-en-mul
8
  tags:
9
  - translation
10
  - en-wal
11
  - wolaytta
12
+ - ethiopian-languages
13
+ - low-resource
14
  - marian
15
  - opus-mt
16
  - generated_from_trainer
17
+ datasets:
18
+ - michsethowusu/english-wolaytta_sentence-pairs_mt560
19
+ pipeline_tag: translation
20
  model-index:
21
  - name: opus-mt-en-wal
22
  results: []
23
  ---
24
 
25
+ # English to Wolaytta Translation Model
 
26
 
27
+ A machine translation model for translating **English → Wolaytta** (an Ethiopian language spoken by 2-7 million people).
28
 
29
+ This is the first publicly available English-to-Wolaytta neural machine translation model.
 
 
30
 
31
+ ## Model Details
32
 
33
+ | Property | Value |
34
+ |----------|-------|
35
+ | **Base Model** | [Helsinki-NLP/opus-mt-en-mul](https://huggingface.co/Helsinki-NLP/opus-mt-en-mul) |
36
+ | **Architecture** | MarianMT (Transformer) |
37
+ | **Parameters** | 77M |
38
+ | **Training Data** | 120,608 sentence pairs |
39
+ | **Final Validation Loss** | 0.3485 |
40
+ | **License** | Apache 2.0 |
41
 
42
+ ## Usage
43
 
44
+ ```python
45
+ from transformers import MarianMTModel, MarianTokenizer
46
 
47
+ model_name = "WellDunDun/opus-mt-en-wal"
48
+ tokenizer = MarianTokenizer.from_pretrained(model_name)
49
+ model = MarianMTModel.from_pretrained(model_name)
50
 
51
+ text = "Hello, how are you?"
52
+ inputs = tokenizer(text, return_tensors="pt", padding=True)
53
+ outputs = model.generate(**inputs, max_length=128, num_beams=4)
54
+ translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
55
+ print(translation) # Output: "Halo, neeni waanidee?"
56
+ ```
57
 
58
+ ## Example Translations
59
 
60
+ | English | Wolaytta |
61
+ |---------|----------|
62
+ | Hello, how are you? | Halo, neeni waanidee? |
63
+ | Thank you very much | Keehippe galatays |
64
+ | What is your name? | Ne sunttay aybee? |
65
 
66
+ ## Training Data
 
 
 
 
 
 
 
 
67
 
68
+ This model was fine-tuned on the [michsethowusu/english-wolaytta_sentence-pairs_mt560](https://huggingface.co/datasets/michsethowusu/english-wolaytta_sentence-pairs_mt560) dataset, which contains 120,608 English-Wolaytta parallel sentences derived from [OPUS MT560](https://opus.nlpl.eu/MT560).
69
+
70
+ The training data primarily comes from:
71
+ - Bible translations
72
+ - JW.org publications
73
+
74
+ ## Intended Uses
75
+
76
+ - Communication with Wolaytta speakers
77
+ - Language learning and education
78
+ - Research on low-resource language translation
79
+ - Building applications for the Wolaytta-speaking community
80
+
81
+ ## Limitations
82
+
83
+ - **Domain bias**: Heavy religious/biblical content in training data
84
+ - **Casual speech**: May struggle with informal expressions or slang
85
+ - **Modern vocabulary**: Limited coverage of technology, contemporary topics
86
+ - **Low-resource language**: Wolaytta has limited digital resources; verify important translations with native speakers
87
+
88
+ ## Training Procedure
89
+
90
+ ### Training Hyperparameters
91
+
92
+ - **Learning rate**: 2e-05
93
+ - **Train batch size**: 16
94
+ - **Eval batch size**: 16
95
+ - **Seed**: 42
96
+ - **Optimizer**: AdamW (fused) with betas=(0.9,0.999) and epsilon=1e-08
97
+ - **LR scheduler**: Linear
98
+ - **Epochs**: 3
99
+ - **Mixed precision**: Native AMP
100
+ - **Hardware**: Google Colab (T4 GPU)
101
+ - **Training time**: ~3 hours
102
+
103
+ ### Training Results
104
 
105
  | Training Loss | Epoch | Step | Validation Loss |
106
  |:-------------:|:------:|:-----:|:---------------:|
107
+ | 0.6944 | 0.14 | 1000 | 0.6297 |
108
+ | 0.5968 | 0.28 | 2000 | 0.5214 |
109
+ | 0.5329 | 0.42 | 3000 | 0.4742 |
110
+ | 0.5116 | 0.56 | 4000 | 0.4459 |
111
+ | 0.4747 | 0.70 | 5000 | 0.4255 |
112
+ | 0.4483 | 0.84 | 6000 | 0.4120 |
113
+ | 0.4501 | 0.98 | 7000 | 0.4021 |
114
+ | 0.4275 | 1.12 | 8000 | 0.3899 |
115
+ | 0.4174 | 1.26 | 9000 | 0.3833 |
116
+ | 0.4060 | 1.40 | 10000 | 0.3768 |
117
+ | 0.4145 | 1.54 | 11000 | 0.3727 |
118
+ | 0.3968 | 1.68 | 12000 | 0.3675 |
119
+ | 0.3930 | 1.82 | 13000 | 0.3635 |
120
+ | 0.4027 | 1.95 | 14000 | 0.3595 |
121
+ | 0.3778 | 2.09 | 15000 | 0.3573 |
122
+ | 0.3732 | 2.23 | 16000 | 0.3556 |
123
+ | 0.3695 | 2.37 | 17000 | 0.3535 |
124
+ | 0.3611 | 2.51 | 18000 | 0.3518 |
125
+ | 0.3605 | 2.65 | 19000 | 0.3504 |
126
+ | 0.3639 | 2.79 | 20000 | 0.3491 |
127
+ | 0.3680 | 2.93 | 21000 | 0.3485 |
128
+
129
+ ### Framework Versions
 
130
 
131
  - Transformers 4.57.3
132
+ - PyTorch 2.9.0+cu126
133
  - Datasets 4.0.0
134
  - Tokenizers 0.22.1
135
+
136
+ ## Related Models
137
+
138
+ - [Helsinki-NLP/opus-mt-wal-en](https://huggingface.co/Helsinki-NLP/opus-mt-wal-en) - Wolaytta → English (reverse direction)
139
+
140
+ ## About Wolaytta
141
+
142
+ Wolaytta (also spelled Wolayta, Wolaitta, Welayta) is a North Omotic language spoken in the Wolaita Zone of Ethiopia's Southern Nations, Nationalities, and Peoples' Region by approximately 2-7 million people.
143
+
144
+ ## Citation
145
+
146
+ ```bibtex
147
+ @misc{opus_mt_en_wal_2026,
148
+ title={English to Wolaytta Translation Model},
149
+ author={WellDunDun},
150
+ year={2026},
151
+ url={https://huggingface.co/WellDunDun/opus-mt-en-wal},
152
+ note={Fine-tuned on michsethowusu/english-wolaytta_sentence-pairs_mt560 dataset, derived from OPUS MT560}
153
+ }
154
+ ```
155
+
156
+ ## Acknowledgments
157
+
158
+ - [Helsinki-NLP](https://huggingface.co/Helsinki-NLP) for the base multilingual model
159
+ - [michsethowusu](https://huggingface.co/datasets/michsethowusu/english-wolaytta_sentence-pairs_mt560) for curating the parallel corpus
160
+ - [OPUS MT560](https://opus.nlpl.eu/MT560) for the original training data
161
+ - The Wolaytta language community