MT5 Singlish → Sinhala Transliteration

This model converts Singlish (romanized Sinhala) text into Sinhala script using a Transformer-based sequence-to-sequence architecture.

It is trained using parameter-efficient fine-tuning (LoRA) on top of google/mt5-small, following both one-phase and two-phase training strategies.
The released version contains the merged full model, ready for direct inference.


Key Features

  • Task: Singlish → Sinhala transliteration
  • Model: google/mt5-small
  • Training strategy:
    • One-phase phonetic training
    • Two-phase training (phonetic foundation → adhoc fine-tuning)
  • Fine-tuning method: LoRA (Low-Rank Adaptation)
  • Deployment: LoRA adapters merged into a single full model
  • Optimized for: Low-resource Indic transliteration

Training Overview

One-Phase Training

The model is trained only on a large phonetic transliteration dataset, learning the core mapping between Singlish spellings and Sinhala characters.

Two-Phase Training

  1. Phase 1 – Phonetic Foundation

    • Trained on a large, clean phonetic dataset
    • Learns stable character-level and phonetic patterns
  2. Phase 2 – Adhoc Fine-Tuning

    • Fine-tuned on a smaller, real-world adhoc dataset
    • Adapts to spelling variations, noise, and colloquial usage

Datasets Used

  • Phonetic dataset
    • Large-scale Singlish ↔ Sinhala phonetic pairs
  • Adhoc dataset
    • deshanksuman/SwaBhasha_Transliteration_Sinhala

All datasets were cleaned, normalized, and deduplicated before training.


Evaluation Metrics

The model was evaluated using standard sequence-to-sequence metrics:

  • BLEU – overall sequence similarity (higher is better)
  • CER (Character Error Rate) – character-level accuracy (lower is better)
  • WER (Word Error Rate) – word-level accuracy (lower is better)

The two-phase model consistently outperformed the one-phase model, especially on noisy real-world inputs.


Usage

Load the model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

repo = "Pudamya/mt5-singlish2sinhala"

tokenizer = AutoTokenizer.from_pretrained(repo, use_fast=False)
model = AutoModelForSeq2SeqLM.from_pretrained(repo)
Downloads last month
15
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support