LaoNLP-Enhanced Tokenizer
This is an enhanced Lao SentencePiece/WordLevel tokenizer, built upon the original work of savath/laonlp-enhanced.
🙏 Special thanks to Savath for providing the base tokenizer.
🔹 Update Notes
- Cleaned vocab using dictionary-based validation.
- Preserved encoding order of existing tokens.
- Added support for new words not present in the dictionary.
- Fully compatible with Hugging Face
PreTrainedTokenizerFast.
🔹 Installation
pip install transformers
🔹 Update notes
- Cleaned vocab using dictionary-based validation.
- Preserved encoding order of existing tokens.
- Added support for new words not present in the dictionary.
- Compatible with Hugging Face
PreTrainedTokenizerFast.
🔹 Usage
from transformers import AutoTokenizer
from tokenizers import pre_tokenizers
tokenizer = AutoTokenizer.from_pretrained("LuoYiSULIXAY/laonlp-enhanced-update")
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
tokens = tokenizer.tokenize(" ເປັນແນວໃດ ສະບາຍດີ ບໍເຈົ້າຮູ້ບໍວ່າຂ້ອຍ ແມ່ນ ໃຜ")
print(tokens)
tokens = tokenizer.tokenize("ນີ້ແມ່ນການທົດສອບ")
print(tokens)
🔹 Citation
If you use this tokenizer, please cite both the original repo and this update version:
java
@misc{savath2024laonlp,
title = {LaoNLP-Enhanced},
author = {Savath},
year = {2024},
url = {https://huggingface.co/savath/laonlp-enhanced}
}
@misc{luoyi2025laonlpupdate,
title = {LaoNLP-Enhanced-Update},
author = {Sulixay Vilaiphone (LuoYi)},
year = {2025},
url = {https://huggingface.co/LuoYiSULIXAY/laonlp-enhanced-update}
}
✍️ Maintainer: Sulixay Vilaiphone (LuoYi)
📧 Contact: Sulixay2001@gmail.com