---
language:
  - eng     # English
  - tig     # Tigrinya
tags:
  - tokenizer
  - machine-translation
  - low-resource
  - geez-script
license: mit
datasets:
  - nllb        # NLLB training dataset
  - opus        # OPUS parallel data for testing
metrics:
  - bleu
---

# English–Tigrinya Machine Translation & Tokenizer

### 📌 Conference
Accepted at the **3rd International Conference on Foundation and Large Language Models (FLLM2025)**  
📍 25–28 November 2025 | Vienna, Austria  

**Paper Title**: *Low-Resource English–Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks*  

---

## 📝 Model Summary

This repository provides a **custom tokenizer** and a **fine-tuned MarianMT model** for **English ↔ Tigrinya machine translation**.  
It leverages the NLLB dataset for training and OPUS parallel corpora for testing and evaluation, with BLEU used as the primary metric.  

- **Languages:** English (eng), Tigrinya (tig)  
- **Tokenizer:** SentencePiece, customized for Geez-script representation  
- **Model:** MarianMT (multilingual transformer) fine-tuned for English–Tigrinya translation  
- **License:** MIT  

---

## 🔍 Model Details

### Tokenizer
- **Type**: SentencePiece-based subword tokenizer  
- **Purpose**: Handles Geez-script specific tokenization for Tigrinya  
- **Training Data**: NLLB English–Tigrinya subset  
- **Evaluation Data**: OPUS parallel corpus  

### Translation Model
- **Base Model**: MarianMT  
- **Frameworks**: Hugging Face Transformers, PyTorch  
- **Task**: Bidirectional English ↔ Tigrinya MT  

---

## ⚙️ Training Details

- **Training Dataset**: NLLB Parallel Corpus (English ↔ Tigrinya)  
- **Testing Dataset**: OPUS Parallel Corpus  
- **Epochs**: 3  
- **Batch Size**: 8  
- **Max Sequence Length**: 128 tokens  
- **Learning Rate**: `1.44e-07` with decay  

**Training Loss**  
- Epoch 1: 0.443  
- Epoch 2: 0.4077  
- Epoch 3: 0.4379  
- Final Loss: 0.4756  

**Gradient Norms**  
- Epoch 1: 1.14  
- Epoch 2: 1.11  
- Epoch 3: 1.06  

**Performance**  
- Training Time: ~12 hours (43,376.7s)  
- Speed: 96.7 samples/sec | 12.08 steps/sec  

---

## 📊 Evaluation

- **Metric**: BLEU score  
- **Evaluation Dataset**: OPUS parallel English–Tigrinya  

---

## 🚀 Usage

This model can be directly used for **English → Tigrinya** and **Tigrinya → English** translation.  

### Example (Python)

```python
from transformers import MarianMTModel, MarianTokenizer

# Load the model and tokenizer
model_name = "Hailay/MachineT_TigEng"  
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)

# Translate English → Tigrinya
english_text = "We must obey the Lord and leave them alone"
inputs = tokenizer(english_text, return_tensors="pt", padding=True, truncation=True)
translated = model.generate(**inputs)
translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)

print("Translated text:", translated_text)


##  📌Citation

If you use this model or tokenizer in your work, please cite:

@inproceedings{hailay2025lowres,
  title     = {Low-Resource English–Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks},
  author    = {Hailay Kidu and collaborators},
  booktitle = {Proceedings of the 3rd International Conference on Foundation and Large Language Models (FLLM2025)},
  year      = {2025},
  location  = {Vienna, Austria}
}