MachineT_TigEng / README.md
Hailay's picture
Update README.md
4e21a45 verified
---
language:
- eng # English
- tig # Tigrinya
tags:
- tokenizer
- machine-translation
- low-resource
- geez-script
license: mit
datasets:
- nllb # NLLB training dataset
- opus # OPUS parallel data for testing
metrics:
- bleu
---
# English–Tigrinya Machine Translation & Tokenizer
### 📌 Conference
Accepted at the **3rd International Conference on Foundation and Large Language Models (FLLM2025)**
📍 25–28 November 2025 | Vienna, Austria
**Paper Title**: *Low-Resource English–Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks*
---
## 📝 Model Summary
This repository provides a **custom tokenizer** and a **fine-tuned MarianMT model** for **English ↔ Tigrinya machine translation**.
It leverages the NLLB dataset for training and OPUS parallel corpora for testing and evaluation, with BLEU used as the primary metric.
- **Languages:** English (eng), Tigrinya (tig)
- **Tokenizer:** SentencePiece, customized for Geez-script representation
- **Model:** MarianMT (multilingual transformer) fine-tuned for English–Tigrinya translation
- **License:** MIT
---
## 🔍 Model Details
### Tokenizer
- **Type**: SentencePiece-based subword tokenizer
- **Purpose**: Handles Geez-script specific tokenization for Tigrinya
- **Training Data**: NLLB English–Tigrinya subset
- **Evaluation Data**: OPUS parallel corpus
### Translation Model
- **Base Model**: MarianMT
- **Frameworks**: Hugging Face Transformers, PyTorch
- **Task**: Bidirectional English ↔ Tigrinya MT
---
## ⚙️ Training Details
- **Training Dataset**: NLLB Parallel Corpus (English ↔ Tigrinya)
- **Testing Dataset**: OPUS Parallel Corpus
- **Epochs**: 3
- **Batch Size**: 8
- **Max Sequence Length**: 128 tokens
- **Learning Rate**: `1.44e-07` with decay
**Training Loss**
- Epoch 1: 0.443
- Epoch 2: 0.4077
- Epoch 3: 0.4379
- Final Loss: 0.4756
**Gradient Norms**
- Epoch 1: 1.14
- Epoch 2: 1.11
- Epoch 3: 1.06
**Performance**
- Training Time: ~12 hours (43,376.7s)
- Speed: 96.7 samples/sec | 12.08 steps/sec
---
## 📊 Evaluation
- **Metric**: BLEU score
- **Evaluation Dataset**: OPUS parallel English–Tigrinya
---
## 🚀 Usage
This model can be directly used for **English → Tigrinya** and **Tigrinya → English** translation.
### Example (Python)
```python
from transformers import MarianMTModel, MarianTokenizer
# Load the model and tokenizer
model_name = "Hailay/MachineT_TigEng"
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)
# Translate English → Tigrinya
english_text = "We must obey the Lord and leave them alone"
inputs = tokenizer(english_text, return_tensors="pt", padding=True, truncation=True)
translated = model.generate(**inputs)
translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)
print("Translated text:", translated_text)
## 📌Citation
If you use this model or tokenizer in your work, please cite:
@inproceedings{hailay2025lowres,
title = {Low-Resource English–Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks},
author = {Hailay Kidu and collaborators},
booktitle = {Proceedings of the 3rd International Conference on Foundation and Large Language Models (FLLM2025)},
year = {2025},
location = {Vienna, Austria}
}