|
|
--- |
|
|
language: |
|
|
- eng |
|
|
- tig |
|
|
tags: |
|
|
- tokenizer |
|
|
- machine-translation |
|
|
- low-resource |
|
|
- geez-script |
|
|
license: mit |
|
|
datasets: |
|
|
- nllb |
|
|
- opus |
|
|
metrics: |
|
|
- bleu |
|
|
--- |
|
|
|
|
|
# English–Tigrinya Machine Translation & Tokenizer |
|
|
|
|
|
### 📌 Conference |
|
|
Accepted at the **3rd International Conference on Foundation and Large Language Models (FLLM2025)** |
|
|
📍 25–28 November 2025 | Vienna, Austria |
|
|
|
|
|
**Paper Title**: *Low-Resource English–Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks* |
|
|
|
|
|
--- |
|
|
|
|
|
## 📝 Model Summary |
|
|
|
|
|
This repository provides a **custom tokenizer** and a **fine-tuned MarianMT model** for **English ↔ Tigrinya machine translation**. |
|
|
It leverages the NLLB dataset for training and OPUS parallel corpora for testing and evaluation, with BLEU used as the primary metric. |
|
|
|
|
|
- **Languages:** English (eng), Tigrinya (tig) |
|
|
- **Tokenizer:** SentencePiece, customized for Geez-script representation |
|
|
- **Model:** MarianMT (multilingual transformer) fine-tuned for English–Tigrinya translation |
|
|
- **License:** MIT |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔍 Model Details |
|
|
|
|
|
### Tokenizer |
|
|
- **Type**: SentencePiece-based subword tokenizer |
|
|
- **Purpose**: Handles Geez-script specific tokenization for Tigrinya |
|
|
- **Training Data**: NLLB English–Tigrinya subset |
|
|
- **Evaluation Data**: OPUS parallel corpus |
|
|
|
|
|
### Translation Model |
|
|
- **Base Model**: MarianMT |
|
|
- **Frameworks**: Hugging Face Transformers, PyTorch |
|
|
- **Task**: Bidirectional English ↔ Tigrinya MT |
|
|
|
|
|
--- |
|
|
|
|
|
## ⚙️ Training Details |
|
|
|
|
|
- **Training Dataset**: NLLB Parallel Corpus (English ↔ Tigrinya) |
|
|
- **Testing Dataset**: OPUS Parallel Corpus |
|
|
- **Epochs**: 3 |
|
|
- **Batch Size**: 8 |
|
|
- **Max Sequence Length**: 128 tokens |
|
|
- **Learning Rate**: `1.44e-07` with decay |
|
|
|
|
|
**Training Loss** |
|
|
- Epoch 1: 0.443 |
|
|
- Epoch 2: 0.4077 |
|
|
- Epoch 3: 0.4379 |
|
|
- Final Loss: 0.4756 |
|
|
|
|
|
**Gradient Norms** |
|
|
- Epoch 1: 1.14 |
|
|
- Epoch 2: 1.11 |
|
|
- Epoch 3: 1.06 |
|
|
|
|
|
**Performance** |
|
|
- Training Time: ~12 hours (43,376.7s) |
|
|
- Speed: 96.7 samples/sec | 12.08 steps/sec |
|
|
|
|
|
--- |
|
|
|
|
|
## 📊 Evaluation |
|
|
|
|
|
- **Metric**: BLEU score |
|
|
- **Evaluation Dataset**: OPUS parallel English–Tigrinya |
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 Usage |
|
|
|
|
|
This model can be directly used for **English → Tigrinya** and **Tigrinya → English** translation. |
|
|
|
|
|
### Example (Python) |
|
|
|
|
|
```python |
|
|
from transformers import MarianMTModel, MarianTokenizer |
|
|
|
|
|
# Load the model and tokenizer |
|
|
model_name = "Hailay/MachineT_TigEng" |
|
|
model = MarianMTModel.from_pretrained(model_name) |
|
|
tokenizer = MarianTokenizer.from_pretrained(model_name) |
|
|
|
|
|
# Translate English → Tigrinya |
|
|
english_text = "We must obey the Lord and leave them alone" |
|
|
inputs = tokenizer(english_text, return_tensors="pt", padding=True, truncation=True) |
|
|
translated = model.generate(**inputs) |
|
|
translated_text = tokenizer.decode(translated[0], skip_special_tokens=True) |
|
|
|
|
|
print("Translated text:", translated_text) |
|
|
|
|
|
|
|
|
|
|
|
## 📌Citation |
|
|
|
|
|
If you use this model or tokenizer in your work, please cite: |
|
|
|
|
|
@inproceedings{hailay2025lowres, |
|
|
title = {Low-Resource English–Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks}, |
|
|
author = {Hailay Kidu and collaborators}, |
|
|
booktitle = {Proceedings of the 3rd International Conference on Foundation and Large Language Models (FLLM2025)}, |
|
|
year = {2025}, |
|
|
location = {Vienna, Austria} |
|
|
} |
|
|
|