File size: 3,483 Bytes
0cbc581
 
 
 
 
 
 
4e21a45
 
0cbc581
 
4e21a45
 
0cbc581
 
 
 
4e21a45
0cbc581
4e21a45
 
 
0cbc581
4e21a45
0cbc581
4e21a45
 
 
 
 
 
 
 
 
 
 
 
 
0cbc581
4e21a45
666a104
4e21a45
 
 
 
 
666a104
4e21a45
 
 
 
0cbc581
4e21a45
0cbc581
4e21a45
0cbc581
 
4e21a45
 
0cbc581
4e21a45
 
0cbc581
4e21a45
 
 
 
 
0cbc581
4e21a45
 
 
 
666a104
4e21a45
 
 
666a104
4e21a45
666a104
4e21a45
 
 
 
 
 
 
 
 
 
 
 
666a104
 
 
 
 
 
 
 
 
4e21a45
666a104
4e21a45
 
666a104
 
4e21a45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
language:
  - eng     # English
  - tig     # Tigrinya
tags:
  - tokenizer
  - machine-translation
  - low-resource
  - geez-script
license: mit
datasets:
  - nllb        # NLLB training dataset
  - opus        # OPUS parallel data for testing
metrics:
  - bleu
---

# English–Tigrinya Machine Translation & Tokenizer

### 📌 Conference
Accepted at the **3rd International Conference on Foundation and Large Language Models (FLLM2025)**  
📍 25–28 November 2025 | Vienna, Austria  

**Paper Title**: *Low-Resource English–Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks*  

---

## 📝 Model Summary

This repository provides a **custom tokenizer** and a **fine-tuned MarianMT model** for **English ↔ Tigrinya machine translation**.  
It leverages the NLLB dataset for training and OPUS parallel corpora for testing and evaluation, with BLEU used as the primary metric.  

- **Languages:** English (eng), Tigrinya (tig)  
- **Tokenizer:** SentencePiece, customized for Geez-script representation  
- **Model:** MarianMT (multilingual transformer) fine-tuned for English–Tigrinya translation  
- **License:** MIT  

---

## 🔍 Model Details

### Tokenizer
- **Type**: SentencePiece-based subword tokenizer  
- **Purpose**: Handles Geez-script specific tokenization for Tigrinya  
- **Training Data**: NLLB English–Tigrinya subset  
- **Evaluation Data**: OPUS parallel corpus  

### Translation Model
- **Base Model**: MarianMT  
- **Frameworks**: Hugging Face Transformers, PyTorch  
- **Task**: Bidirectional English ↔ Tigrinya MT  

---

## ⚙️ Training Details

- **Training Dataset**: NLLB Parallel Corpus (English ↔ Tigrinya)  
- **Testing Dataset**: OPUS Parallel Corpus  
- **Epochs**: 3  
- **Batch Size**: 8  
- **Max Sequence Length**: 128 tokens  
- **Learning Rate**: `1.44e-07` with decay  

**Training Loss**  
- Epoch 1: 0.443  
- Epoch 2: 0.4077  
- Epoch 3: 0.4379  
- Final Loss: 0.4756  

**Gradient Norms**  
- Epoch 1: 1.14  
- Epoch 2: 1.11  
- Epoch 3: 1.06  

**Performance**  
- Training Time: ~12 hours (43,376.7s)  
- Speed: 96.7 samples/sec | 12.08 steps/sec  

---

## 📊 Evaluation

- **Metric**: BLEU score  
- **Evaluation Dataset**: OPUS parallel English–Tigrinya  

---

## 🚀 Usage

This model can be directly used for **English → Tigrinya** and **Tigrinya → English** translation.  

### Example (Python)

```python
from transformers import MarianMTModel, MarianTokenizer

# Load the model and tokenizer
model_name = "Hailay/MachineT_TigEng"  
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)

# Translate English → Tigrinya
english_text = "We must obey the Lord and leave them alone"
inputs = tokenizer(english_text, return_tensors="pt", padding=True, truncation=True)
translated = model.generate(**inputs)
translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)

print("Translated text:", translated_text)



##  📌Citation

If you use this model or tokenizer in your work, please cite:

@inproceedings{hailay2025lowres,
  title     = {Low-Resource English–Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks},
  author    = {Hailay Kidu and collaborators},
  booktitle = {Proceedings of the 3rd International Conference on Foundation and Large Language Models (FLLM2025)},
  year      = {2025},
  location  = {Vienna, Austria}
}