khmer-ipa / README.md
Ammonsh's picture
Upload README.md with huggingface_hub
a8c73a0 verified
---
language:
- km
tags:
- khmer
- ipa
- phonemization
- seq2seq
- text2text-generation
- encoder-decoder
license: apache-2.0
---
# Khmer IPA
A sequence-to-sequence model that converts Khmer script to IPA (International Phonetic Alphabet) transcriptions.
Trained from scratch using a BERT-based encoder-decoder architecture with character-level tokenizers for both input (Khmer) and output (IPA).
## Model Details
| Property | Value |
|---|---|
| Architecture | `EncoderDecoderModel` (BERT encoder + BERT decoder) |
| Hidden size | 512 |
| Layers (enc + dec) | 6 each |
| Attention heads | 8 |
| Feed-forward size | 1024 |
| Encoder vocab size | 1000 (Khmer characters) |
| Decoder vocab size | 1000 (IPA characters) |
| Max sequence length | 128 |
| Best eval loss | 0.1736 (checkpoint 26000, ~11 epochs) |
## Usage
This model uses **two separate tokenizers** — one for Khmer input and one for IPA output — stored in subfolders.
```python
from transformers import EncoderDecoderModel, AutoTokenizer
model = EncoderDecoderModel.from_pretrained("byumatrixlab/khmer-ipa")
encoder_tokenizer = AutoTokenizer.from_pretrained("byumatrixlab/khmer-ipa", subfolder="encoder_tokenizer")
decoder_tokenizer = AutoTokenizer.from_pretrained("byumatrixlab/khmer-ipa", subfolder="decoder_tokenizer")
def khmer_to_ipa(text, num_beams=4):
inputs = encoder_tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=128,
)
output_ids = model.generate(
**inputs,
max_length=128,
num_beams=num_beams,
early_stopping=True,
)
return decoder_tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(khmer_to_ipa("សួស្តី"))
# → suǝsdǝy
print(khmer_to_ipa("ខ្ញុំជាសិស្ស"))
# → kɲomciesəh
```
### Batched inference
```python
def khmer_to_ipa_batch(texts, num_beams=4):
inputs = encoder_tokenizer(
texts,
return_tensors="pt",
padding=True,
truncation=True,
max_length=128,
)
output_ids = model.generate(
**inputs,
max_length=128,
num_beams=num_beams,
early_stopping=True,
)
return [decoder_tokenizer.decode(ids, skip_special_tokens=True) for ids in output_ids]
```
## Training Data and Repository
Developed by the BYU MATRIX Lab
Training code and data processing scripts: [MekongPhon](https://github.com/byu-matrix-lab/MekongPhon)