Panavath commited on
Commit
67ac7d1
·
verified ·
1 Parent(s): d116022

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +58 -0
  2. config.json +29 -0
  3. pytorch_model.bin +3 -0
README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - khmer
4
+ - nlp
5
+ - punctuation-restoration
6
+ - inverse-text-normalization
7
+ - asr
8
+ - xlm-roberta
9
+ license: mit
10
+ language:
11
+ - km
12
+ ---
13
+
14
+ # KhmerTagger: Inverse Text Normalization for Khmer ASR
15
+
16
+ KhmerTagger is a model for inverse text normalization (ITN) of Khmer Automatic Speech Recognition (ASR) outputs. It performs punctuation restoration and number recognition to improve readability of raw ASR text.
17
+
18
+ ## Model Description
19
+
20
+ The model is based on XLM-RoBERTa as the encoder, with a bidirectional LSTM layer and two classification heads:
21
+ - **Punctuation head**: Predicts punctuation marks (space, comma, question mark, exclamation mark, etc.)
22
+ - **Number head**: Identifies and tags numeric entities in the text
23
+
24
+ ## Usage
25
+
26
+ ```python
27
+ from transformers import XLMRobertaTokenizer
28
+ import torch
29
+ from model import KhmerTagger
30
+
31
+ # Load tokenizer
32
+ tokenizer = XLMRobertaTokenizer.from_pretrained("FacebookAI/xlm-roberta-base")
33
+
34
+ # Load model
35
+ model = KhmerTagger(n_punct_features=5, n_num_features=3)
36
+ model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu", weights_only=True))
37
+ model.eval()
38
+
39
+ # Your inference code here...
40
+ ```
41
+
42
+ ## Training
43
+
44
+ The model was trained on 1.5 million tokens of Khmer news data and achieved 97.2% accuracy on the validation set.
45
+
46
+ ## Citation
47
+
48
+ ```bibtex
49
+ @misc{khmertagger2025,
50
+ author = {Seanghay Yath},
51
+ title = {KhmerTagger: Inverse Text Normalization for Khmer Automatic Speech Recognition},
52
+ year = {2025},
53
+ publisher = {GitHub},
54
+ journal = {GitHub repository},
55
+ howpublished = {\url{https://github.com/seanghay/khmertagger}},
56
+ note = {Open source project for Khmer punctuation restoration and number recognition using XLM-ROBERTa}
57
+ }
58
+ ```
config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "khmer_tagger",
3
+ "base_model": "FacebookAI/xlm-roberta-base",
4
+ "architecture": {
5
+ "encoder": "XLM-RoBERTa",
6
+ "lstm": {
7
+ "hidden_size": 1024,
8
+ "num_layers": 1,
9
+ "bidirectional": true
10
+ },
11
+ "num_punct_features": 5,
12
+ "num_num_features": 3
13
+ },
14
+ "tags": {
15
+ "punctuation": [
16
+ "0",
17
+ "!",
18
+ "?",
19
+ "SPACE",
20
+ "។"
21
+ ],
22
+ "numbers": [
23
+ "0",
24
+ "NUMBER_B",
25
+ "NUMBER_I"
26
+ ]
27
+ },
28
+ "sequence_length": 256
29
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9ae9610c1e1d25b538979e98c705122b9a4dd90d2908dbc5f92d1ced37e9860b
3
+ size 1179470814