--- language: - km tags: - khmer - ipa - phonemization - seq2seq - text2text-generation - encoder-decoder license: apache-2.0 --- # Khmer IPA A sequence-to-sequence model that converts Khmer script to IPA (International Phonetic Alphabet) transcriptions. Trained from scratch using a BERT-based encoder-decoder architecture with character-level tokenizers for both input (Khmer) and output (IPA). ## Model Details | Property | Value | |---|---| | Architecture | `EncoderDecoderModel` (BERT encoder + BERT decoder) | | Hidden size | 512 | | Layers (enc + dec) | 6 each | | Attention heads | 8 | | Feed-forward size | 1024 | | Encoder vocab size | 1000 (Khmer characters) | | Decoder vocab size | 1000 (IPA characters) | | Max sequence length | 128 | | Best eval loss | 0.1736 (checkpoint 26000, ~11 epochs) | ## Usage This model uses **two separate tokenizers** — one for Khmer input and one for IPA output — stored in subfolders. ```python from transformers import EncoderDecoderModel, AutoTokenizer model = EncoderDecoderModel.from_pretrained("byumatrixlab/khmer-ipa") encoder_tokenizer = AutoTokenizer.from_pretrained("byumatrixlab/khmer-ipa", subfolder="encoder_tokenizer") decoder_tokenizer = AutoTokenizer.from_pretrained("byumatrixlab/khmer-ipa", subfolder="decoder_tokenizer") def khmer_to_ipa(text, num_beams=4): inputs = encoder_tokenizer( text, return_tensors="pt", truncation=True, max_length=128, ) output_ids = model.generate( **inputs, max_length=128, num_beams=num_beams, early_stopping=True, ) return decoder_tokenizer.decode(output_ids[0], skip_special_tokens=True) print(khmer_to_ipa("សួស្តី")) # → suǝsdǝy print(khmer_to_ipa("ខ្ញុំជាសិស្ស")) # → kɲomciesəh ``` ### Batched inference ```python def khmer_to_ipa_batch(texts, num_beams=4): inputs = encoder_tokenizer( texts, return_tensors="pt", padding=True, truncation=True, max_length=128, ) output_ids = model.generate( **inputs, max_length=128, num_beams=num_beams, early_stopping=True, ) return [decoder_tokenizer.decode(ids, skip_special_tokens=True) for ids in output_ids] ``` ## Training Data and Repository Developed by the BYU MATRIX Lab Training code and data processing scripts: [MekongPhon](https://github.com/byu-matrix-lab/MekongPhon)