Gemma3 Singlish → Sinhala Transliteration Model

Overview

This model performs Singlish (Romanized Sinhala) → Sinhala script transliteration.

It is designed to correctly handle:

phonetic Singlish
code-mixed Sinhala-English text
adhoc spellings
rare Sinhala conjunct characters

Examples of difficult conjunct clusters handled by the model:

ඥ
ක්ෂ
ශ්‍ර
ස්ථ
මඤ්ඤ

Example:

Singlish	Sinhala
gnathin	ඥාතින්
jnanaya	ඥානය
mannyokka	මඤ්ඤොක්කා
kshana	ක්ෂණ
shraddha	ශ්‍රද්ධා

Model Architecture

Base model: Gemma

Fine-tuning method:

LoRA (Low Rank Adaptation)
Efficient fine-tuning for large language models

Training Strategy

The model was trained using a 3-phase curriculum training approach to improve performance on both common and rare transliteration patterns.

Phase 1 — Phonetic Learning

Datasets used:

Phonetic dataset (1M rows)

Goal:

Learn general Singlish → Sinhala phonetic mapping

Example:

amma → අම්මා
gama → ගම
ratak → රටක්

Phase 2 — Adhoc + Code-Mix Learning

Datasets used:

Adhoc dataset
Code-mixed Sinhala-English dataset

Goal:

Handle:

informal spellings
mixed language sentences
real-world Singlish usage

Example:

mama office ekata yanawa → මම office එකට යනවා
today mama busy → අද මම busy

Phase 3 — Rare Conjunct Booster

Datasets used:

Adjunct dataset
Replay samples from phonetic dataset
Replay samples from adhoc dataset

Goal:

Improve difficult Sinhala conjunct clusters:

ඥ
ක්ෂ
ශ්‍ර
ස්ථ
මඤ්ඤ

Example:

gnathin → ඥාතින්
kshana → ක්ෂණ
mannyokka → මඤ්ඤොක්කා

Datasets Used

Training data consists of multiple dataset types:

1. Phonetic Dataset

Romanized Sinhala → Sinhala script

Examples:

amma → අම්මා
ratak → රටක්
gama → ගම

2. Adhoc Dataset

Common Singlish spellings used in real communication.

Examples:

machan → මචං
mokadda → මොකද්ද

3. Code-Mixed Dataset

Mixed Sinhala + English sentences.

Examples:

mama meeting ekata yanawa → මම meeting එකට යනවා
api project eka finish karamu → අපි project එක finish කරමු

4. Adjunct Dataset

Synthetic dataset focused on rare Sinhala conjunct clusters.

Training Details

Parameter	Value
Model	Gemma
Fine-tuning	LoRA
Batch Size	2
Gradient Accumulation	8
Learning Rate	1.5e-4
Scheduler	Cosine
Max Length	256

Example Usage

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

repo_id = "Pudamya/small100-singlish-sinhala-3phase-final"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained(repo_id, trust_remote_code=True)

sentences = [
    "kohomada oyata",
    "mama bath kanawa",
    "api heta hamuwemu",
    "mama gnathin hambenna yanawa",
    "eyala ekka mannyokka kanna ymu",
    "kshana",
    "oyt gnanaya naha"
]

for s in sentences:
    inputs = tokenizer(s, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=128)
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)

    print("Singlish:", s)
    print("Sinhala :", result)
    print()

Output:

Singlish: kohomada oyata
Sinhala : කොහොමද ඔයාට

Singlish: mama bath kanawa
Sinhala : මම බත් කනවා

Singlish: api heta hamuwemu
Sinhala : අපි හෙට හමුවෙමු

Singlish: mama gnathin hambenna yanawa
Sinhala : මම ඥාතීන් හම්බෙන්න යනවා

Singlish: eyala ekka mannyokka kanna ymu
Sinhala : එයාලා එක්ක මඤ්ඤොක්කා කන්න යමු

Singlish: kshana
Sinhala : ක්ෂණ

Singlish: oyt gnanaya naha
Sinhala : ඔයාට ඥානය නැහැ

Applications

This model can be used for:

Sinhala input systems
Chat applications
Social media text normalization
Transliteration tools
NLP preprocessing for Sinhala

Author

Pudamya Vidusini Rathnayake

Singlish → Sinhala Transliteration Research

Notes

Training time is dependent on dataset size and hardware configuration.

The model was trained using large phonetic datasets and specialized conjunct boosters to improve accuracy for complex Sinhala orthography.

Downloads last month: 66

Safetensors

Model size

0.3B params

Tensor type

BF16