Small100 — Singlish → Sinhala Transliteration
Fine-tuned version of alirezamsh/small100 for the task of Singlish-to-Sinhala transliteration, developed as part of the IndoNLP 2025 Shared Task on Singlish–Sinhala Transliteration.
This is the merged (LoRA weights absorbed) final model.
Task
Singlish (romanised colloquial Sinhala) → Sinhala script transliteration.
| Input (Singlish) | Output (Sinhala) |
|---|---|
mama giya |
මම ගිය |
kohomada |
කොහොමද |
Training Pipeline
Trained using a three-phase curriculum strategy with LoRA applied to all attention projection and feed-forward layers.
Data
| Split | Source | Size |
|---|---|---|
| Phase 1 & 2 training | phonetic_train_1M.csv |
1,000,000 samples |
| Adhoc fine-tuning | adhoc.csv |
11,937 samples |
| Phonetic validation | phonetic_test.csv |
10,003 samples |
| Adhoc validation | adhoc_test.csv |
5,003 samples |
Synthetic Augmentation
Adhoc data was expanded with a rule-based Singlish augmenter simulating natural romanisation variation:
- Vowel dropping — randomly drops non-boundary vowels (e.g.
kohomada→khmada) - Cluster simplification — collapses common digraphs (
th→t,sh→s,nd→n, etc.) - Vowel swapping — substitutes phonetically similar vowels (
a↔e,i↔e,o↔u)
Aggression factor: 0.5. Applied at 15% / 20% / 15% across the three phases.
Three-Phase Curriculum
| Phase | Data | Epochs | LR | Validation | Aug |
|---|---|---|---|---|---|
| 1 — Foundation | 65% of phonetic train (~650K) | 2 | 1e-4 | Phonetic | 15% |
| 2 — Expansion | Remaining phonetic + 5× adhoc + 80K replay | 2 | 5e-5 | Adhoc | 20% |
| 3 — Mastery | 10× adhoc + 200K phonetic mix | 2 | 2e-5 | Adhoc | 15% |
Each phase resumes from the previous phase's LoRA adapter. Early stopping: patience=5, metric=CER.
LoRA Configuration
| Parameter | Value |
|---|---|
| Rank (r) | 64 |
| Alpha | 128 |
| Dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, out_proj, fc1, fc2 |
| Trainable params | 19,267,584 / 352,003,072 (5.47%) |
Training Arguments
| Parameter | Value |
|---|---|
| Batch size | 8 |
| Gradient accumulation | 4 (effective batch: 32) |
| Weight decay | 0.01 |
| Max grad norm | 1.0 |
| Warmup ratio | 0.03 |
| Optimizer | AdamW fused |
| Precision | bfloat16 / fp16 |
Evaluation Results
| Test Set | CER ↓ | WER ↓ | BLEU ↑ | BERTScore ↑ |
|---|---|---|---|---|
| Phonetic | 0.0211 | 0.0970 | 0.7723 | 0.9906 |
| Adhoc | 0.0461 | 0.1653 | 0.6439 | 0.9899 |
BERTScore computed using Ransaka/sinhala-bert-medium-v2.
Small100 achieved the best average CER (0.0336) among all models evaluated on this shared task.
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = "savinugunarathna/Small100-Singlish-Sinhala-Merged"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
tokenizer.src_lang = "en"
tokenizer.tgt_lang = "si"
inputs = tokenizer("mama giya", return_tensors="pt")
outputs = model.generate(**inputs, num_beams=4, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# → මම ගිය
- Downloads last month
- 14