MT5 Singlish → Sinhala Transliteration

This model converts Singlish (romanized Sinhala) text into Sinhala script using a Transformer-based sequence-to-sequence architecture.

It is trained using parameter-efficient fine-tuning (LoRA) on top of google/mt5-small, following both one-phase and two-phase training strategies.
The released version contains the merged full model, ready for direct inference.

Key Features

Task: Singlish → Sinhala transliteration
Model: google/mt5-small
Training strategy:
- One-phase phonetic training
- Two-phase training (phonetic foundation → adhoc fine-tuning)
Fine-tuning method: LoRA (Low-Rank Adaptation)
Deployment: LoRA adapters merged into a single full model
Optimized for: Low-resource Indic transliteration

Training Overview

One-Phase Training

The model is trained only on a large phonetic transliteration dataset, learning the core mapping between Singlish spellings and Sinhala characters.

Two-Phase Training

Phase 1 – Phonetic Foundation
- Trained on a large, clean phonetic dataset
- Learns stable character-level and phonetic patterns
Phase 2 – Adhoc Fine-Tuning
- Fine-tuned on a smaller, real-world adhoc dataset
- Adapts to spelling variations, noise, and colloquial usage

Datasets Used

Phonetic dataset
- Large-scale Singlish ↔ Sinhala phonetic pairs
Adhoc dataset
- deshanksuman/SwaBhasha_Transliteration_Sinhala

All datasets were cleaned, normalized, and deduplicated before training.

Evaluation Metrics

The model was evaluated using standard sequence-to-sequence metrics:

BLEU – overall sequence similarity (higher is better)
CER (Character Error Rate) – character-level accuracy (lower is better)
WER (Word Error Rate) – word-level accuracy (lower is better)

The two-phase model consistently outperformed the one-phase model, especially on noisy real-world inputs.

Usage

Load the model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

repo = "Pudamya/mt5-singlish2sinhala"

tokenizer = AutoTokenizer.from_pretrained(repo, use_fast=False)
model = AutoModelForSeq2SeqLM.from_pretrained(repo)

Downloads last month: 7

Safetensors

Model size

0.3B params

Tensor type

F32