File size: 3,483 Bytes
0cbc581 4e21a45 0cbc581 4e21a45 0cbc581 4e21a45 0cbc581 4e21a45 0cbc581 4e21a45 0cbc581 4e21a45 0cbc581 4e21a45 666a104 4e21a45 666a104 4e21a45 0cbc581 4e21a45 0cbc581 4e21a45 0cbc581 4e21a45 0cbc581 4e21a45 0cbc581 4e21a45 0cbc581 4e21a45 666a104 4e21a45 666a104 4e21a45 666a104 4e21a45 666a104 4e21a45 666a104 4e21a45 666a104 4e21a45 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
---
language:
- eng # English
- tig # Tigrinya
tags:
- tokenizer
- machine-translation
- low-resource
- geez-script
license: mit
datasets:
- nllb # NLLB training dataset
- opus # OPUS parallel data for testing
metrics:
- bleu
---
# English–Tigrinya Machine Translation & Tokenizer
### 📌 Conference
Accepted at the **3rd International Conference on Foundation and Large Language Models (FLLM2025)**
📍 25–28 November 2025 | Vienna, Austria
**Paper Title**: *Low-Resource English–Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks*
---
## 📝 Model Summary
This repository provides a **custom tokenizer** and a **fine-tuned MarianMT model** for **English ↔ Tigrinya machine translation**.
It leverages the NLLB dataset for training and OPUS parallel corpora for testing and evaluation, with BLEU used as the primary metric.
- **Languages:** English (eng), Tigrinya (tig)
- **Tokenizer:** SentencePiece, customized for Geez-script representation
- **Model:** MarianMT (multilingual transformer) fine-tuned for English–Tigrinya translation
- **License:** MIT
---
## 🔍 Model Details
### Tokenizer
- **Type**: SentencePiece-based subword tokenizer
- **Purpose**: Handles Geez-script specific tokenization for Tigrinya
- **Training Data**: NLLB English–Tigrinya subset
- **Evaluation Data**: OPUS parallel corpus
### Translation Model
- **Base Model**: MarianMT
- **Frameworks**: Hugging Face Transformers, PyTorch
- **Task**: Bidirectional English ↔ Tigrinya MT
---
## ⚙️ Training Details
- **Training Dataset**: NLLB Parallel Corpus (English ↔ Tigrinya)
- **Testing Dataset**: OPUS Parallel Corpus
- **Epochs**: 3
- **Batch Size**: 8
- **Max Sequence Length**: 128 tokens
- **Learning Rate**: `1.44e-07` with decay
**Training Loss**
- Epoch 1: 0.443
- Epoch 2: 0.4077
- Epoch 3: 0.4379
- Final Loss: 0.4756
**Gradient Norms**
- Epoch 1: 1.14
- Epoch 2: 1.11
- Epoch 3: 1.06
**Performance**
- Training Time: ~12 hours (43,376.7s)
- Speed: 96.7 samples/sec | 12.08 steps/sec
---
## 📊 Evaluation
- **Metric**: BLEU score
- **Evaluation Dataset**: OPUS parallel English–Tigrinya
---
## 🚀 Usage
This model can be directly used for **English → Tigrinya** and **Tigrinya → English** translation.
### Example (Python)
```python
from transformers import MarianMTModel, MarianTokenizer
# Load the model and tokenizer
model_name = "Hailay/MachineT_TigEng"
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)
# Translate English → Tigrinya
english_text = "We must obey the Lord and leave them alone"
inputs = tokenizer(english_text, return_tensors="pt", padding=True, truncation=True)
translated = model.generate(**inputs)
translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)
print("Translated text:", translated_text)
## 📌Citation
If you use this model or tokenizer in your work, please cite:
@inproceedings{hailay2025lowres,
title = {Low-Resource English–Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks},
author = {Hailay Kidu and collaborators},
booktitle = {Proceedings of the 3rd International Conference on Foundation and Large Language Models (FLLM2025)},
year = {2025},
location = {Vienna, Austria}
}
|