Msok99 commited on
Commit
b85bc9f
·
verified ·
1 Parent(s): 65b6773

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -0
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ language: ["khm"]
4
+ license: mit
5
+ tags: ["tokenizer", "khmer", "Unigram", "general-purpose", "sentencepiece"]
6
+ ---
7
+
8
+ # 🇰🇭 KM Improved 22K Tokenizer
9
+
10
+ A **general-purpose Khmer tokenizer** optimized for both accuracy and speed.
11
+ It provides a stable backbone for Khmer NLP applications such as classification,
12
+ question answering, translation, and summarization.
13
+
14
+ ---
15
+
16
+ ## 🧠 Model Details
17
+
18
+ ### Model Description
19
+ - **Developer:** Sok Meas (@Msok99)
20
+ - **Model type:** SentencePiece Unigram Tokenizer
21
+ - **Language:** Khmer (khm)
22
+ - **License:** MIT
23
+ - **Finetuned from:** None (trained from scratch)
24
+
25
+ ### Model Sources
26
+ - **Repository:** [https://huggingface.co/Msok99/km-improved-22k](https://huggingface.co/Msok99/km-improved-22k)
27
+
28
+ ---
29
+
30
+ ## ⚙️ Uses
31
+
32
+ ### Direct Use
33
+ - Tokenizing Khmer text for downstream NLP models
34
+ - Preparing training data for transformer-based fine-tuning
35
+ - Segmenting sentences for analysis or embedding generation
36
+
37
+ ### Downstream Use
38
+ - Integration into Khmer LLMs or chatbots
39
+ - Pre- and post-processing for summarization or translation systems
40
+
41
+ ### Out-of-Scope Use
42
+ - Not designed for English or heavily mixed Khmer–English content
43
+ - Not an inference or generation model itself
44
+
45
+ ---
46
+
47
+ ## ⚖️ Bias, Risks & Limitations
48
+ - Very long or compound words may still split into several sub-tokens
49
+ - Limited exposure to informal slang or non-standard Khmer orthography
50
+
51
+ ### Recommendations
52
+ For code-switched text (Khmer + English), use the merged model
53
+ [`Msok99/lfm2-khmer-merged-18k`](https://huggingface.co/Msok99/lfm2-khmer-merged-18k).
54
+
55
+ ---
56
+
57
+ ## 🚀 How to Get Started
58
+
59
+ ```python
60
+ from transformers import AutoTokenizer
61
+
62
+ tokenizer = AutoTokenizer.from_pretrained("Msok99/km-improved-22k")
63
+
64
+ text = "ក្នុងឆ្នាំ២០២៥ កម្ពុជានឹងអភិវឌ្ឍន៍បច្ចេកវិទ្យាថ្មី។"
65
+ tokens = tokenizer.tokenize(text)
66
+ print(tokens)
67
+ print(tokenizer.decode(tokenizer.encode(text)))