--- language: - eng # English - tig # Tigrinya tags: - tokenizer - machine-translation - low-resource - geez-script license: mit datasets: - nllb # NLLB training dataset - opus # OPUS parallel data for testing metrics: - bleu --- # English–Tigrinya Machine Translation & Tokenizer ### 📌 Conference Accepted at the **3rd International Conference on Foundation and Large Language Models (FLLM2025)** 📍 25–28 November 2025 | Vienna, Austria **Paper Title**: *Low-Resource English–Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks* --- ## 📝 Model Summary This repository provides a **custom tokenizer** and a **fine-tuned MarianMT model** for **English ↔ Tigrinya machine translation**. It leverages the NLLB dataset for training and OPUS parallel corpora for testing and evaluation, with BLEU used as the primary metric. - **Languages:** English (eng), Tigrinya (tig) - **Tokenizer:** SentencePiece, customized for Geez-script representation - **Model:** MarianMT (multilingual transformer) fine-tuned for English–Tigrinya translation - **License:** MIT --- ## 🔍 Model Details ### Tokenizer - **Type**: SentencePiece-based subword tokenizer - **Purpose**: Handles Geez-script specific tokenization for Tigrinya - **Training Data**: NLLB English–Tigrinya subset - **Evaluation Data**: OPUS parallel corpus ### Translation Model - **Base Model**: MarianMT - **Frameworks**: Hugging Face Transformers, PyTorch - **Task**: Bidirectional English ↔ Tigrinya MT --- ## ⚙️ Training Details - **Training Dataset**: NLLB Parallel Corpus (English ↔ Tigrinya) - **Testing Dataset**: OPUS Parallel Corpus - **Epochs**: 3 - **Batch Size**: 8 - **Max Sequence Length**: 128 tokens - **Learning Rate**: `1.44e-07` with decay **Training Loss** - Epoch 1: 0.443 - Epoch 2: 0.4077 - Epoch 3: 0.4379 - Final Loss: 0.4756 **Gradient Norms** - Epoch 1: 1.14 - Epoch 2: 1.11 - Epoch 3: 1.06 **Performance** - Training Time: ~12 hours (43,376.7s) - Speed: 96.7 samples/sec | 12.08 steps/sec --- ## 📊 Evaluation - **Metric**: BLEU score - **Evaluation Dataset**: OPUS parallel English–Tigrinya --- ## 🚀 Usage This model can be directly used for **English → Tigrinya** and **Tigrinya → English** translation. ### Example (Python) ```python from transformers import MarianMTModel, MarianTokenizer # Load the model and tokenizer model_name = "Hailay/MachineT_TigEng" model = MarianMTModel.from_pretrained(model_name) tokenizer = MarianTokenizer.from_pretrained(model_name) # Translate English → Tigrinya english_text = "We must obey the Lord and leave them alone" inputs = tokenizer(english_text, return_tensors="pt", padding=True, truncation=True) translated = model.generate(**inputs) translated_text = tokenizer.decode(translated[0], skip_special_tokens=True) print("Translated text:", translated_text) ## 📌Citation If you use this model or tokenizer in your work, please cite: @inproceedings{hailay2025lowres, title = {Low-Resource English–Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks}, author = {Hailay Kidu and collaborators}, booktitle = {Proceedings of the 3rd International Conference on Foundation and Large Language Models (FLLM2025)}, year = {2025}, location = {Vienna, Austria} }