Gemma3 Singlish → Sinhala Transliteration Model
Overview
This model performs Singlish (Romanized Sinhala) → Sinhala script transliteration.
It is designed to correctly handle:
- phonetic Singlish
- code-mixed Sinhala-English text
- adhoc spellings
- rare Sinhala conjunct characters
Examples of difficult conjunct clusters handled by the model:
- ඥ
- ක්ෂ
- ශ්ර
- ස්ථ
- මඤ්ඤ
Example:
| Singlish | Sinhala |
|---|---|
| gnathin | ඥාතින් |
| jnanaya | ඥානය |
| mannyokka | මඤ්ඤොක්කා |
| kshana | ක්ෂණ |
| shraddha | ශ්රද්ධා |
Model Architecture
Base model: Gemma
Fine-tuning method:
- LoRA (Low Rank Adaptation)
- Efficient fine-tuning for large language models
Training Strategy
The model was trained using a 3-phase curriculum training approach to improve performance on both common and rare transliteration patterns.
Phase 1 — Phonetic Learning
Datasets used:
- Phonetic dataset (1M rows)
Goal:
Learn general Singlish → Sinhala phonetic mapping
Example:
amma → අම්මා
gama → ගම
ratak → රටක්
Phase 2 — Adhoc + Code-Mix Learning
Datasets used:
- Adhoc dataset
- Code-mixed Sinhala-English dataset
Goal:
Handle:
- informal spellings
- mixed language sentences
- real-world Singlish usage
Example:
mama office ekata yanawa → මම office එකට යනවා
today mama busy → අද මම busy
Phase 3 — Rare Conjunct Booster
Datasets used:
- Adjunct dataset
- Replay samples from phonetic dataset
- Replay samples from adhoc dataset
Goal:
Improve difficult Sinhala conjunct clusters:
- ඥ
- ක්ෂ
- ශ්ර
- ස්ථ
- මඤ්ඤ
Example:
gnathin → ඥාතින්
kshana → ක්ෂණ
mannyokka → මඤ්ඤොක්කා
Datasets Used
Training data consists of multiple dataset types:
1. Phonetic Dataset
Romanized Sinhala → Sinhala script
Examples:
amma → අම්මා
ratak → රටක්
gama → ගම
2. Adhoc Dataset
Common Singlish spellings used in real communication.
Examples:
machan → මචං
mokadda → මොකද්ද
3. Code-Mixed Dataset
Mixed Sinhala + English sentences.
Examples:
mama meeting ekata yanawa → මම meeting එකට යනවා
api project eka finish karamu → අපි project එක finish කරමු
4. Adjunct Dataset
Synthetic dataset focused on rare Sinhala conjunct clusters.
Training Details
| Parameter | Value |
|---|---|
| Model | Gemma |
| Fine-tuning | LoRA |
| Batch Size | 2 |
| Gradient Accumulation | 8 |
| Learning Rate | 1.5e-4 |
| Scheduler | Cosine |
| Max Length | 256 |
Example Usage
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
repo_id = "Pudamya/small100-singlish-sinhala-3phase-final"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained(repo_id, trust_remote_code=True)
sentences = [
"kohomada oyata",
"mama bath kanawa",
"api heta hamuwemu",
"mama gnathin hambenna yanawa",
"eyala ekka mannyokka kanna ymu",
"kshana",
"oyt gnanaya naha"
]
for s in sentences:
inputs = tokenizer(s, return_tensors="pt")
outputs = model.generate(**inputs, max_length=128)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Singlish:", s)
print("Sinhala :", result)
print()
Output:
Singlish: kohomada oyata
Sinhala : කොහොමද ඔයාට
Singlish: mama bath kanawa
Sinhala : මම බත් කනවා
Singlish: api heta hamuwemu
Sinhala : අපි හෙට හමුවෙමු
Singlish: mama gnathin hambenna yanawa
Sinhala : මම ඥාතීන් හම්බෙන්න යනවා
Singlish: eyala ekka mannyokka kanna ymu
Sinhala : එයාලා එක්ක මඤ්ඤොක්කා කන්න යමු
Singlish: kshana
Sinhala : ක්ෂණ
Singlish: oyt gnanaya naha
Sinhala : ඔයාට ඥානය නැහැ
Applications
This model can be used for:
- Sinhala input systems
- Chat applications
- Social media text normalization
- Transliteration tools
- NLP preprocessing for Sinhala
Author
Pudamya Vidusini Rathnayake
Singlish → Sinhala Transliteration Research
Notes
Training time is dependent on dataset size and hardware configuration.
The model was trained using large phonetic datasets and specialized conjunct boosters to improve accuracy for complex Sinhala orthography.
- Downloads last month
- 66