Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

README.md +58 -0
config.json +29 -0
pytorch_model.bin +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,58 @@

+---
+tags:
+- khmer
+- nlp
+- punctuation-restoration
+- inverse-text-normalization
+- asr
+- xlm-roberta
+license: mit
+language:
+- km
+---
+# KhmerTagger: Inverse Text Normalization for Khmer ASR
+KhmerTagger is a model for inverse text normalization (ITN) of Khmer Automatic Speech Recognition (ASR) outputs. It performs punctuation restoration and number recognition to improve readability of raw ASR text.
+## Model Description
+The model is based on XLM-RoBERTa as the encoder, with a bidirectional LSTM layer and two classification heads:
+- **Punctuation head**: Predicts punctuation marks (space, comma, question mark, exclamation mark, etc.)
+- **Number head**: Identifies and tags numeric entities in the text
+## Usage
+```python
+from transformers import XLMRobertaTokenizer
+import torch
+from model import KhmerTagger
+# Load tokenizer
+tokenizer = XLMRobertaTokenizer.from_pretrained("FacebookAI/xlm-roberta-base")
+# Load model
+model = KhmerTagger(n_punct_features=5, n_num_features=3)
+model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu", weights_only=True))
+model.eval()
+# Your inference code here...
+```
+## Training
+The model was trained on 1.5 million tokens of Khmer news data and achieved 97.2% accuracy on the validation set.
+## Citation
+```bibtex
+@misc{khmertagger2025,
+  author = {Seanghay Yath},
+  title = {KhmerTagger: Inverse Text Normalization for Khmer Automatic Speech Recognition},
+  year = {2025},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\url{https://github.com/seanghay/khmertagger}},
+  note = {Open source project for Khmer punctuation restoration and number recognition using XLM-ROBERTa}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "model_type": "khmer_tagger",
+  "base_model": "FacebookAI/xlm-roberta-base",
+  "architecture": {
+    "encoder": "XLM-RoBERTa",
+    "lstm": {
+      "hidden_size": 1024,
+      "num_layers": 1,
+      "bidirectional": true
+    },
+    "num_punct_features": 5,
+    "num_num_features": 3
+  },
+  "tags": {
+    "punctuation": [
+      "0",
+      "!",
+      "?",
+      "SPACE",
+      "។"
+    ],
+    "numbers": [
+      "0",
+      "NUMBER_B",
+      "NUMBER_I"
+    ]
+  },
+  "sequence_length": 256
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9ae9610c1e1d25b538979e98c705122b9a4dd90d2908dbc5f92d1ced37e9860b
+size 1179470814