Singlish → Sinhala Transliteration (SMaLL-100, Two-Phase)
This model is a two-phase fine-tuned version of alirezamsh/small100 for Singlish-to-Sinhala transliteration.
Training Strategy
Phase 1: Phonetic Foundation
Trained on a large phonetic dataset (500k Singlish–Sinhala pairs) to learn stable phonetic mappings.
Phase 2: Ad-hoc Fine-tuning
Fine-tuned on deshanksuman/SwaBhasha_Transliteration_Sinhala to adapt to real-world/noisy Singlish inputs.
Evaluation
| Model | BLEU (char) ↑ | CER ↓ | WER ↓ | Exact Match ↑ |
|---|---|---|---|---|
| Phase-1 (Phonetic) | 82.03 | 0.114 | 0.311 | 10.6% |
| Phase-2 (Final, Ad-hoc) | 81.40 | 0.115 | 0.379 | 6.0% |
Note: Phase-1 performs best on clean benchmark data, while Phase-2 is optimized for robustness on noisy, real-world Singlish inputs.
How to Use
import torch
from transformers import M2M100ForConditionalGeneration
# small100 tokenizer is custom
from tokenization_small100 import SMALL100Tokenizer
repo_id = "Pudamya/small100-singlish-sinhala-transliteration"
tokenizer = SMALL100Tokenizer.from_pretrained(repo_id)
tokenizer.tgt_lang = "si"
model = M2M100ForConditionalGeneration.from_pretrained(repo_id)
text = "mama gedara yanawa"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, num_beams=5, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 8