Small100 — Singlish → Sinhala Transliteration

Fine-tuned version of alirezamsh/small100 for the task of Singlish-to-Sinhala transliteration, developed as part of the IndoNLP 2025 Shared Task on Singlish–Sinhala Transliteration.

This is the merged (LoRA weights absorbed) final model.

Task

Singlish (romanised colloquial Sinhala) → Sinhala script transliteration.

Input (Singlish)	Output (Sinhala)
`mama giya`	`මම ගිය`
`kohomada`	`කොහොමද`

Training Pipeline

Trained using a three-phase curriculum strategy with LoRA applied to all attention projection and feed-forward layers.

Data

Split	Source	Size
Phase 1 & 2 training	`phonetic_train_1M.csv`	1,000,000 samples
Adhoc fine-tuning	`adhoc.csv`	11,937 samples
Phonetic validation	`phonetic_test.csv`	10,003 samples
Adhoc validation	`adhoc_test.csv`	5,003 samples

Synthetic Augmentation

Adhoc data was expanded with a rule-based Singlish augmenter simulating natural romanisation variation:

Vowel dropping — randomly drops non-boundary vowels (e.g. kohomada → khmada)
Cluster simplification — collapses common digraphs (th→t, sh→s, nd→n, etc.)
Vowel swapping — substitutes phonetically similar vowels (a↔e, i↔e, o↔u)

Aggression factor: 0.5. Applied at 15% / 20% / 15% across the three phases.

Three-Phase Curriculum

Phase	Data	Epochs	LR	Validation	Aug
1 — Foundation	65% of phonetic train (~650K)	2	1e-4	Phonetic	15%
2 — Expansion	Remaining phonetic + 5× adhoc + 80K replay	2	5e-5	Adhoc	20%
3 — Mastery	10× adhoc + 200K phonetic mix	2	2e-5	Adhoc	15%

Each phase resumes from the previous phase's LoRA adapter. Early stopping: patience=5, metric=CER.

LoRA Configuration

Parameter	Value
Rank (r)	64
Alpha	128
Dropout	0.05
Target modules	`q_proj`, `k_proj`, `v_proj`, `out_proj`, `fc1`, `fc2`
Trainable params	19,267,584 / 352,003,072 (5.47%)

Training Arguments

Parameter	Value
Batch size	8
Gradient accumulation	4 (effective batch: 32)
Weight decay	0.01
Max grad norm	1.0
Warmup ratio	0.03
Optimizer	AdamW fused
Precision	bfloat16 / fp16

Evaluation Results

Test Set	CER ↓	WER ↓	BLEU ↑	BERTScore ↑
Phonetic	0.0211	0.0970	0.7723	0.9906
Adhoc	0.0461	0.1653	0.6439	0.9899

BERTScore computed using Ransaka/sinhala-bert-medium-v2.

Small100 achieved the best average CER (0.0336) among all models evaluated on this shared task.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "savinugunarathna/Small100-Singlish-Sinhala-Merged"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

tokenizer.src_lang = "en"
tokenizer.tgt_lang = "si"

inputs = tokenizer("mama giya", return_tensors="pt")
outputs = model.generate(**inputs, num_beams=4, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# → මම ගිය

Downloads last month: 14

Safetensors

Model size

0.3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for savinugunarathna/Small100-Singlish-Sinhala-Merged

Base model

alirezamsh/small100

Adapter

(4)

this model

Adapters

1 model

savinugunarathna
/

Small100-Singlish-Sinhala-Merged