| --- |
| language: |
| - km |
| tags: |
| - khmer |
| - ipa |
| - phonemization |
| - seq2seq |
| - text2text-generation |
| - encoder-decoder |
| license: apache-2.0 |
| --- |
| |
| # Khmer IPA |
|
|
| A sequence-to-sequence model that converts Khmer script to IPA (International Phonetic Alphabet) transcriptions. |
| Trained from scratch using a BERT-based encoder-decoder architecture with character-level tokenizers for both input (Khmer) and output (IPA). |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |---|---| |
| | Architecture | `EncoderDecoderModel` (BERT encoder + BERT decoder) | |
| | Hidden size | 512 | |
| | Layers (enc + dec) | 6 each | |
| | Attention heads | 8 | |
| | Feed-forward size | 1024 | |
| | Encoder vocab size | 1000 (Khmer characters) | |
| | Decoder vocab size | 1000 (IPA characters) | |
| | Max sequence length | 128 | |
| | Best eval loss | 0.1736 (checkpoint 26000, ~11 epochs) | |
|
|
| ## Usage |
|
|
| This model uses **two separate tokenizers** — one for Khmer input and one for IPA output — stored in subfolders. |
|
|
| ```python |
| from transformers import EncoderDecoderModel, AutoTokenizer |
| |
| model = EncoderDecoderModel.from_pretrained("byumatrixlab/khmer-ipa") |
| encoder_tokenizer = AutoTokenizer.from_pretrained("byumatrixlab/khmer-ipa", subfolder="encoder_tokenizer") |
| decoder_tokenizer = AutoTokenizer.from_pretrained("byumatrixlab/khmer-ipa", subfolder="decoder_tokenizer") |
| |
| def khmer_to_ipa(text, num_beams=4): |
| inputs = encoder_tokenizer( |
| text, |
| return_tensors="pt", |
| truncation=True, |
| max_length=128, |
| ) |
| output_ids = model.generate( |
| **inputs, |
| max_length=128, |
| num_beams=num_beams, |
| early_stopping=True, |
| ) |
| return decoder_tokenizer.decode(output_ids[0], skip_special_tokens=True) |
| |
| print(khmer_to_ipa("សួស្តី")) |
| # → suǝsdǝy |
| |
| print(khmer_to_ipa("ខ្ញុំជាសិស្ស")) |
| # → kɲomciesəh |
| ``` |
|
|
| ### Batched inference |
|
|
| ```python |
| def khmer_to_ipa_batch(texts, num_beams=4): |
| inputs = encoder_tokenizer( |
| texts, |
| return_tensors="pt", |
| padding=True, |
| truncation=True, |
| max_length=128, |
| ) |
| output_ids = model.generate( |
| **inputs, |
| max_length=128, |
| num_beams=num_beams, |
| early_stopping=True, |
| ) |
| return [decoder_tokenizer.decode(ids, skip_special_tokens=True) for ids in output_ids] |
| ``` |
|
|
| ## Training Data and Repository |
|
|
| Developed by the BYU MATRIX Lab |
| Training code and data processing scripts: [MekongPhon](https://github.com/byu-matrix-lab/MekongPhon) |
|
|