Msok99
/

km-improved-32k

Model card Files Files and versions

km-improved-32k / README.md

Msok99's picture

Create README.md

b5cbb66 verified 3 months ago

|

history blame contribute delete

3.32 kB

	---
	library_name: transformers
	language: ["khm"]
	license: mit
	tags: ["tokenizer", "khmer", "unigram", "sentencepiece", "large", "coverage", "high-resource"]
	---

	# 🇰🇭 KM Improved 32K Tokenizer

	The KM Improved 32K is a high-capacity Khmer tokenizer designed to maximize word coverage
	across diverse domains including technical, cultural, historical, and academic texts.
	It aims to reduce subword fragmentation and improve contextual understanding for large-scale
	Khmer and multilingual language models.

	---

	## 🧠 Model Details

	### Model Description
	- Developer: Sok Meas (@Msok99)
	- Model Type: SentencePiece Unigram
	- Language: Khmer (khm)
	- License: MIT
	- Base Version: [`Msok99/km-improved-22k-v4`](https://huggingface.co/Msok99/km-improved-22k-v4)
	- Vocabulary Size: 32,000
	- Goal: Maximize coverage and minimize over-segmentation

	### Model Sources
	- Repository: [https://huggingface.co/Msok99/km-improved-32k](https://huggingface.co/Msok99/km-improved-32k)

	---

	## ⚙️ Key Features

	\| Feature \| Description \|
	\|----------\|-------------\|
	\| Extended Vocabulary \| 32,000 tokens for higher domain coverage \|
	\| Improved Context Retention \| Keeps compound and rare words intact \|
	\| Reduced Fragmentation \| Fewer subword splits across long sentences \|
	\| Perfect Decode Fidelity \| 100% reversible encoding/decoding \|
	\| Broad Domain Corpus \| Includes academic, scientific, literary, and technical texts \|

	---

	## 📊 Performance Overview

	\| Category \| Avg Tokens \| Chars/Token \|
	\|-----------\|-------------\|-------------\|
	\| Formal News \| 13.6 \| 4.19 \|
	\| Technology / Scientific \| 10.8 \| 5.32 \|
	\| Culture / History \| 11.0 \| 4.58 \|
	\| Education / Academic \| 9.4 \| 5.44 \|
	\| Mixed Texts \| 12.2 \| 3.86 \|
	\| Overall Efficiency \| — \| ≈4.0 chars/token \|

	---

	## 🧩 Use Cases

	### Direct Use
	- Pretraining and fine-tuning Khmer LLMs
	- Large-scale corpus tokenization for RAG or embedding generation
	- Tokenization for Khmer–English mixed datasets (with limited English words)

	### Downstream Use
	- RAG systems and document retrieval
	- Knowledge base construction and summarization pipelines
	- Academic and research-oriented text analysis

	### Out-of-Scope Use
	- Mobile or latency-sensitive applications (consider `18k` or `22k` models)
	- Tokenizing purely English text

	---

	## ⚖️ Bias, Risks, and Limitations
	- Larger vocabulary may increase model size slightly (~5–8%)
	- Some rare or domain-specific words might be underrepresented in informal text
	- Heavier memory usage during training and inference

	### Recommendations
	For smaller models or chatbots prioritizing speed, use
	[`Msok99/km-improved-22k-v4`](https://huggingface.co/Msok99/km-improved-22k-v4).
	For mixed Khmer–English systems, use
	[`Msok99/lfm2-khmer-merged-18k`](https://huggingface.co/Msok99/lfm2-khmer-merged-18k).

	---

	## 🚀 How to Get Started

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("Msok99/km-improved-32k")

	text = "សង្គ្រាមត្រជាក់មានឥទ្ធិពលដល់នយោបាយពិភពលោក។"
	tokens = tokenizer.tokenize(text)
	print(tokens)
	print(tokenizer.decode(tokenizer.encode(text)))