Gemma3 Singlish → Sinhala Transliteration Model

Overview

This model performs Singlish (Romanized Sinhala) → Sinhala script transliteration.

It is designed to correctly handle:

  • phonetic Singlish
  • code-mixed Sinhala-English text
  • adhoc spellings
  • rare Sinhala conjunct characters

Examples of difficult conjunct clusters handled by the model:

  • ක්ෂ
  • ශ්‍ර
  • ස්ථ
  • මඤ්ඤ

Example:

Singlish Sinhala
gnathin ඥාතින්
jnanaya ඥානය
mannyokka මඤ්ඤොක්කා
kshana ක්ෂණ
shraddha ශ්‍රද්ධා

Model Architecture

Base model: Gemma

Fine-tuning method:

  • LoRA (Low Rank Adaptation)
  • Efficient fine-tuning for large language models

Training Strategy

The model was trained using a 3-phase curriculum training approach to improve performance on both common and rare transliteration patterns.

Phase 1 — Phonetic Learning

Datasets used:

  • Phonetic dataset (1M rows)

Goal:

Learn general Singlish → Sinhala phonetic mapping

Example:

amma → අම්මා
gama → ගම
ratak → රටක්

Phase 2 — Adhoc + Code-Mix Learning

Datasets used:

  • Adhoc dataset
  • Code-mixed Sinhala-English dataset

Goal:

Handle:

  • informal spellings
  • mixed language sentences
  • real-world Singlish usage

Example:

mama office ekata yanawa → මම office එකට යනවා
today mama busy → අද මම busy

Phase 3 — Rare Conjunct Booster

Datasets used:

  • Adjunct dataset
  • Replay samples from phonetic dataset
  • Replay samples from adhoc dataset

Goal:

Improve difficult Sinhala conjunct clusters:

  • ක්ෂ
  • ශ්‍ර
  • ස්ථ
  • මඤ්ඤ

Example:

gnathin → ඥාතින්
kshana → ක්ෂණ
mannyokka → මඤ්ඤොක්කා

Datasets Used

Training data consists of multiple dataset types:

1. Phonetic Dataset

Romanized Sinhala → Sinhala script

Examples:

amma → අම්මා
ratak → රටක්
gama → ගම

2. Adhoc Dataset

Common Singlish spellings used in real communication.

Examples:

machan → මචං
mokadda → මොකද්ද

3. Code-Mixed Dataset

Mixed Sinhala + English sentences.

Examples:

mama meeting ekata yanawa → මම meeting එකට යනවා
api project eka finish karamu → අපි project එක finish කරමු

4. Adjunct Dataset

Synthetic dataset focused on rare Sinhala conjunct clusters.


Training Details

Parameter Value
Model Gemma
Fine-tuning LoRA
Batch Size 2
Gradient Accumulation 8
Learning Rate 1.5e-4
Scheduler Cosine
Max Length 256

Example Usage

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

repo_id = "Pudamya/small100-singlish-sinhala-3phase-final"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained(repo_id, trust_remote_code=True)

sentences = [
    "kohomada oyata",
    "mama bath kanawa",
    "api heta hamuwemu",
    "mama gnathin hambenna yanawa",
    "eyala ekka mannyokka kanna ymu",
    "kshana",
    "oyt gnanaya naha"
]

for s in sentences:
    inputs = tokenizer(s, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=128)
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)

    print("Singlish:", s)
    print("Sinhala :", result)
    print()

Output:

Singlish: kohomada oyata
Sinhala : කොහොමද ඔයාට

Singlish: mama bath kanawa
Sinhala : මම බත් කනවා

Singlish: api heta hamuwemu
Sinhala : අපි හෙට හමුවෙමු

Singlish: mama gnathin hambenna yanawa
Sinhala : මම ඥාතීන් හම්බෙන්න යනවා

Singlish: eyala ekka mannyokka kanna ymu
Sinhala : එයාලා එක්ක මඤ්ඤොක්කා කන්න යමු

Singlish: kshana
Sinhala : ක්ෂණ

Singlish: oyt gnanaya naha
Sinhala : ඔයාට ඥානය නැහැ


Applications

This model can be used for:

  • Sinhala input systems
  • Chat applications
  • Social media text normalization
  • Transliteration tools
  • NLP preprocessing for Sinhala

Author

Pudamya Vidusini Rathnayake

Singlish → Sinhala Transliteration Research


Notes

Training time is dependent on dataset size and hardware configuration.

The model was trained using large phonetic datasets and specialized conjunct boosters to improve accuracy for complex Sinhala orthography.

Downloads last month
66
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support