--- tags: - khmer - nlp - punctuation-restoration - inverse-text-normalization - asr - xlm-roberta license: mit language: - km --- # KhmerTagger: Inverse Text Normalization for Khmer ASR KhmerTagger is a model for inverse text normalization (ITN) of Khmer Automatic Speech Recognition (ASR) outputs. It performs punctuation restoration and number recognition to improve readability of raw ASR text. ## Model Description The model is based on XLM-RoBERTa as the encoder, with a bidirectional LSTM layer and two classification heads: - **Punctuation head**: Predicts punctuation marks (space, comma, question mark, exclamation mark, etc.) - **Number head**: Identifies and tags numeric entities in the text ## Usage ```python from transformers import XLMRobertaTokenizer import torch from model import KhmerTagger # Load tokenizer tokenizer = XLMRobertaTokenizer.from_pretrained("FacebookAI/xlm-roberta-base") # Load model model = KhmerTagger(n_punct_features=5, n_num_features=3) model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu", weights_only=True)) model.eval() # Your inference code here... ``` ## Training The model was trained on 1.5 million tokens of Khmer news data and achieved 97.2% accuracy on the validation set. ## Citation ```bibtex @misc{khmertagger2025, author = {Seanghay Yath}, title = {KhmerTagger: Inverse Text Normalization for Khmer Automatic Speech Recognition}, year = {2025}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/seanghay/khmertagger}}, note = {Open source project for Khmer punctuation restoration and number recognition using XLM-ROBERTa} } ```