MT5 Singlish → Sinhala Transliteration
This model converts Singlish (romanized Sinhala) text into Sinhala script using a Transformer-based sequence-to-sequence architecture.
It is trained using parameter-efficient fine-tuning (LoRA) on top of google/mt5-small, following both one-phase and two-phase training strategies.
The released version contains the merged full model, ready for direct inference.
Key Features
- Task: Singlish → Sinhala transliteration
- Model:
google/mt5-small - Training strategy:
- One-phase phonetic training
- Two-phase training (phonetic foundation → adhoc fine-tuning)
- Fine-tuning method: LoRA (Low-Rank Adaptation)
- Deployment: LoRA adapters merged into a single full model
- Optimized for: Low-resource Indic transliteration
Training Overview
One-Phase Training
The model is trained only on a large phonetic transliteration dataset, learning the core mapping between Singlish spellings and Sinhala characters.
Two-Phase Training
Phase 1 – Phonetic Foundation
- Trained on a large, clean phonetic dataset
- Learns stable character-level and phonetic patterns
Phase 2 – Adhoc Fine-Tuning
- Fine-tuned on a smaller, real-world adhoc dataset
- Adapts to spelling variations, noise, and colloquial usage
Datasets Used
- Phonetic dataset
- Large-scale Singlish ↔ Sinhala phonetic pairs
- Adhoc dataset
deshanksuman/SwaBhasha_Transliteration_Sinhala
All datasets were cleaned, normalized, and deduplicated before training.
Evaluation Metrics
The model was evaluated using standard sequence-to-sequence metrics:
- BLEU – overall sequence similarity (higher is better)
- CER (Character Error Rate) – character-level accuracy (lower is better)
- WER (Word Error Rate) – word-level accuracy (lower is better)
The two-phase model consistently outperformed the one-phase model, especially on noisy real-world inputs.
Usage
Load the model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
repo = "Pudamya/mt5-singlish2sinhala"
tokenizer = AutoTokenizer.from_pretrained(repo, use_fast=False)
model = AutoModelForSeq2SeqLM.from_pretrained(repo)
- Downloads last month
- 15