Singlish → Sinhala Transliteration (SMaLL-100, Two-Phase)

This model is a two-phase fine-tuned version of alirezamsh/small100 for Singlish-to-Sinhala transliteration.

Training Strategy

Phase 1: Phonetic Foundation

Trained on a large phonetic dataset (500k Singlish–Sinhala pairs) to learn stable phonetic mappings.

Phase 2: Ad-hoc Fine-tuning

Fine-tuned on deshanksuman/SwaBhasha_Transliteration_Sinhala to adapt to real-world/noisy Singlish inputs.

Evaluation

Model BLEU (char) ↑ CER ↓ WER ↓ Exact Match ↑
Phase-1 (Phonetic) 82.03 0.114 0.311 10.6%
Phase-2 (Final, Ad-hoc) 81.40 0.115 0.379 6.0%

Note: Phase-1 performs best on clean benchmark data, while Phase-2 is optimized for robustness on noisy, real-world Singlish inputs.

How to Use

import torch
from transformers import M2M100ForConditionalGeneration

# small100 tokenizer is custom
from tokenization_small100 import SMALL100Tokenizer

repo_id = "Pudamya/small100-singlish-sinhala-transliteration"

tokenizer = SMALL100Tokenizer.from_pretrained(repo_id)
tokenizer.tgt_lang = "si"

model = M2M100ForConditionalGeneration.from_pretrained(repo_id)

text = "mama gedara yanawa"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, num_beams=5, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
8
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support