Msok99
/

18k_tokenizer_v2

Model card Files Files and versions

18k_tokenizer_v2 / README.md

Msok99's picture

Create README.md

8ad2b23 verified 6 months ago

|

history blame contribute delete

2.12 kB

	---
	library_name: transformers
	language: ["khm"]
	tags: ["tokenizer", "khmer", "unigram", "sentencepiece", "compact", "efficient"]
	---

	# 🇰🇭 Khmer Tokenizer V2 – 18K Vocabulary

	A compact and efficient Khmer tokenizer designed for use in NLP pipelines such as
	classification, translation, summarization, and text generation.

	Trained on diverse Khmer text sources, this tokenizer focuses on efficiency,
	morphological accuracy, and perfect reconstruction during decoding.

	---

	## Model Details

	### Model Description

	- Developed by: Sok Meas (@Msok99)
	- Model type: SentencePiece Unigram Tokenizer
	- Language(s): Khmer
	- License: MIT
	- Finetuned from model: None (trained from scratch)

	### Model Sources

	- Repository: [https://huggingface.co/Msok99/18k_tokenizer_v2](https://huggingface.co/Msok99/18k_tokenizer_v2)

	---

	## Uses

	### Direct Use
	- Tokenization for Khmer NLP models
	- Embedding generation
	- Text preprocessing for machine learning or fine-tuning tasks

	### Downstream Use
	- Suitable for use with any Khmer-based LLM, classifier, or translation model
	- Can be paired with encoder-decoder architectures (e.g., T5, mBART)

	### Out-of-Scope Use
	- Not designed for semantic similarity or embedding search directly
	- Not a model for language generation by itself

	---

	## Bias, Risks, and Limitations
	- May not perfectly segment highly colloquial or dialectal Khmer
	- Some rare archaic terms could be split into smaller subwords
	- The tokenizer is purely statistical (no semantic understanding)

	### Recommendations
	Users fine-tuning Khmer models should ensure corpus cleaning consistency
	and consider domain-specific retraining if using technical or code-mixed datasets.

	---

	## How to Get Started with the Model

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("Msok99/18k_tokenizer_v2")
	text = "ក្រសួងអប់រំបានចេញសេចក្តីជូនដំណឹង។"
	tokens = tokenizer.tokenize(text)
	print(tokens)
	print(tokenizer.decode(tokenizer.encode(text)))