File size: 1,734 Bytes
67ac7d1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---

tags:
- khmer
- nlp
- punctuation-restoration
- inverse-text-normalization
- asr
- xlm-roberta
license: mit
language:
- km
---


# KhmerTagger: Inverse Text Normalization for Khmer ASR

KhmerTagger is a model for inverse text normalization (ITN) of Khmer Automatic Speech Recognition (ASR) outputs. It performs punctuation restoration and number recognition to improve readability of raw ASR text.

## Model Description

The model is based on XLM-RoBERTa as the encoder, with a bidirectional LSTM layer and two classification heads:
- **Punctuation head**: Predicts punctuation marks (space, comma, question mark, exclamation mark, etc.)
- **Number head**: Identifies and tags numeric entities in the text

## Usage

```python

from transformers import XLMRobertaTokenizer

import torch

from model import KhmerTagger



# Load tokenizer

tokenizer = XLMRobertaTokenizer.from_pretrained("FacebookAI/xlm-roberta-base")



# Load model

model = KhmerTagger(n_punct_features=5, n_num_features=3)

model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu", weights_only=True))

model.eval()



# Your inference code here...

```

## Training

The model was trained on 1.5 million tokens of Khmer news data and achieved 97.2% accuracy on the validation set.

## Citation

```bibtex

@misc{khmertagger2025,

  author = {Seanghay Yath},

  title = {KhmerTagger: Inverse Text Normalization for Khmer Automatic Speech Recognition},

  year = {2025},

  publisher = {GitHub},

  journal = {GitHub repository},

  howpublished = {\url{https://github.com/seanghay/khmertagger}},

  note = {Open source project for Khmer punctuation restoration and number recognition using XLM-ROBERTa}

}

```