Hailay commited on
Commit
4e21a45
·
verified ·
1 Parent(s): 0cbc581

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -43
README.md CHANGED
@@ -5,67 +5,91 @@ language:
5
  tags:
6
  - tokenizer
7
  - machine-translation
 
 
8
  license: mit
9
  datasets:
10
- - nllb # NLLB training dataset
11
- - opus # OPUS parallel data for testing
12
  metrics:
13
  - bleu
14
  ---
15
 
16
- # English-Tigrinya Tokenizer
17
 
18
- This tokenizer is trained for English to Tigrinya machine translation tasks using the NLLB dataset for training and OPUS parallel data for testing.
 
 
19
 
20
- ## Model Details
21
 
22
- - **Languages:** English, Tigrinya
23
- - **Model type:** Tokenizer using SentencePiece
24
- - **License:** MIT License
25
- - **Training dataset:** NLLB
26
- - **Testing dataset:** OPUS parallel data
27
- - **Evaluation metric:** BLEU score
 
 
 
 
 
 
 
28
 
29
- ## Machine Translation Model: English ↔ Tigrinya
30
 
31
- This model is a fine-tuned machine translation model trained to translate between English and Tigrinya. It was trained on the parallel corpus of English and Tigrinya sentences.
 
 
 
 
32
 
33
- ### Model Overview
 
 
 
34
 
35
- - **Model Type**: MarianMT (Multilingual Transformer Model)
36
- - **Languages**: English ↔ Tigrinya
37
- - **Model Architecture**: MarianMT, fine-tuned for English ↔ Tigrinya translation
38
- - **Training Framework**: Hugging Face Transformers, PyTorch
39
 
40
- ### Training Details
41
 
42
  - **Training Dataset**: NLLB Parallel Corpus (English ↔ Tigrinya)
43
- - **Training Epochs**: 3
 
44
  - **Batch Size**: 8
45
- - **Max Length**: 128 tokens
46
- - **Learning Rate**: Starts from `1.44e-07` and decays during training
47
- - **Training Loss**:
48
- - Final training loss: 0.4756
49
- - Per-epoch loss progress:
50
- - Epoch 1: 0.443
51
- - Epoch 2: 0.4077
52
- - Epoch 3: 0.4379
53
 
54
- - **Gradient Norms**:
55
- - Epoch 1: 1.14
56
- - Epoch 2: 1.11
57
- - Epoch 3: 1.06
 
58
 
59
- - **Training Time**: 43376.7 seconds (~12 hours)
60
- - **Training Speed**:
61
- - Training samples per second: 96.7
62
- - Training steps per second: 12.08
63
 
64
- ## Model Usage
 
 
65
 
66
- This model can be used for translating English sentences to Tigrinya and vice versa.
67
 
68
- ### Example Usage (Python)
 
 
 
 
 
 
 
 
 
 
 
69
 
70
  ```python
71
  from transformers import MarianMTModel, MarianTokenizer
@@ -75,10 +99,24 @@ model_name = "Hailay/MachineT_TigEng"
75
  model = MarianMTModel.from_pretrained(model_name)
76
  tokenizer = MarianTokenizer.from_pretrained(model_name)
77
 
78
- # Translate an English sentence to Tigrinya
79
  english_text = "We must obey the Lord and leave them alone"
80
- encoded_input = tokenizer(english_text, return_tensors="pt", padding=True, truncation=True)
81
- translated = model.generate(**encoded_input)
82
  translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)
83
 
84
- print(f"Translated text: {translated_text}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  tags:
6
  - tokenizer
7
  - machine-translation
8
+ - low-resource
9
+ - geez-script
10
  license: mit
11
  datasets:
12
+ - nllb # NLLB training dataset
13
+ - opus # OPUS parallel data for testing
14
  metrics:
15
  - bleu
16
  ---
17
 
18
+ # EnglishTigrinya Machine Translation & Tokenizer
19
 
20
+ ### 📌 Conference
21
+ Accepted at the **3rd International Conference on Foundation and Large Language Models (FLLM2025)**
22
+ 📍 25–28 November 2025 | Vienna, Austria
23
 
24
+ **Paper Title**: *Low-Resource English–Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks*
25
 
26
+ ---
27
+
28
+ ## 📝 Model Summary
29
+
30
+ This repository provides a **custom tokenizer** and a **fine-tuned MarianMT model** for **English Tigrinya machine translation**.
31
+ It leverages the NLLB dataset for training and OPUS parallel corpora for testing and evaluation, with BLEU used as the primary metric.
32
+
33
+ - **Languages:** English (eng), Tigrinya (tig)
34
+ - **Tokenizer:** SentencePiece, customized for Geez-script representation
35
+ - **Model:** MarianMT (multilingual transformer) fine-tuned for English–Tigrinya translation
36
+ - **License:** MIT
37
+
38
+ ---
39
 
40
+ ## 🔍 Model Details
41
 
42
+ ### Tokenizer
43
+ - **Type**: SentencePiece-based subword tokenizer
44
+ - **Purpose**: Handles Geez-script specific tokenization for Tigrinya
45
+ - **Training Data**: NLLB English–Tigrinya subset
46
+ - **Evaluation Data**: OPUS parallel corpus
47
 
48
+ ### Translation Model
49
+ - **Base Model**: MarianMT
50
+ - **Frameworks**: Hugging Face Transformers, PyTorch
51
+ - **Task**: Bidirectional English ↔ Tigrinya MT
52
 
53
+ ---
 
 
 
54
 
55
+ ## ⚙️ Training Details
56
 
57
  - **Training Dataset**: NLLB Parallel Corpus (English ↔ Tigrinya)
58
+ - **Testing Dataset**: OPUS Parallel Corpus
59
+ - **Epochs**: 3
60
  - **Batch Size**: 8
61
+ - **Max Sequence Length**: 128 tokens
62
+ - **Learning Rate**: `1.44e-07` with decay
 
 
 
 
 
 
63
 
64
+ **Training Loss**
65
+ - Epoch 1: 0.443
66
+ - Epoch 2: 0.4077
67
+ - Epoch 3: 0.4379
68
+ - Final Loss: 0.4756
69
 
70
+ **Gradient Norms**
71
+ - Epoch 1: 1.14
72
+ - Epoch 2: 1.11
73
+ - Epoch 3: 1.06
74
 
75
+ **Performance**
76
+ - Training Time: ~12 hours (43,376.7s)
77
+ - Speed: 96.7 samples/sec | 12.08 steps/sec
78
 
79
+ ---
80
 
81
+ ## 📊 Evaluation
82
+
83
+ - **Metric**: BLEU score
84
+ - **Evaluation Dataset**: OPUS parallel English–Tigrinya
85
+
86
+ ---
87
+
88
+ ## 🚀 Usage
89
+
90
+ This model can be directly used for **English → Tigrinya** and **Tigrinya → English** translation.
91
+
92
+ ### Example (Python)
93
 
94
  ```python
95
  from transformers import MarianMTModel, MarianTokenizer
 
99
  model = MarianMTModel.from_pretrained(model_name)
100
  tokenizer = MarianTokenizer.from_pretrained(model_name)
101
 
102
+ # Translate English Tigrinya
103
  english_text = "We must obey the Lord and leave them alone"
104
+ inputs = tokenizer(english_text, return_tensors="pt", padding=True, truncation=True)
105
+ translated = model.generate(**inputs)
106
  translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)
107
 
108
+ print("Translated text:", translated_text)
109
+
110
+
111
+
112
+ ## 📌Citation
113
+
114
+ If you use this model or tokenizer in your work, please cite:
115
+
116
+ @inproceedings{hailay2025lowres,
117
+ title = {Low-Resource English–Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks},
118
+ author = {Hailay Kidu and collaborators},
119
+ booktitle = {Proceedings of the 3rd International Conference on Foundation and Large Language Models (FLLM2025)},
120
+ year = {2025},
121
+ location = {Vienna, Austria}
122
+ }