LaoNLP-Enhanced Tokenizer

This is an enhanced Lao SentencePiece/WordLevel tokenizer, built upon the original work of savath/laonlp-enhanced.
🙏 Special thanks to Savath for providing the base tokenizer.


🔹 Update Notes

  • Cleaned vocab using dictionary-based validation.
  • Preserved encoding order of existing tokens.
  • Added support for new words not present in the dictionary.
  • Fully compatible with Hugging Face PreTrainedTokenizerFast.

🔹 Installation

pip install transformers

🔹 Update notes

  • Cleaned vocab using dictionary-based validation.
  • Preserved encoding order of existing tokens.
  • Added support for new words not present in the dictionary.
  • Compatible with Hugging Face PreTrainedTokenizerFast.

🔹 Usage

from transformers import AutoTokenizer
from tokenizers import pre_tokenizers

tokenizer = AutoTokenizer.from_pretrained("LuoYiSULIXAY/laonlp-enhanced-update")
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
tokens = tokenizer.tokenize(" ເປັນແນວໃດ ສະບາຍດີ ບໍເຈົ້າຮູ້ບໍວ່າຂ້ອຍ ແມ່ນ ໃຜ")
print(tokens)


tokens = tokenizer.tokenize("ນີ້ແມ່ນການທົດສອບ")
print(tokens)

🔹 Citation

If you use this tokenizer, please cite both the original repo and this update version:

java
@misc{savath2024laonlp,
  title   = {LaoNLP-Enhanced},
  author  = {Savath},
  year    = {2024},
  url     = {https://huggingface.co/savath/laonlp-enhanced}
}

@misc{luoyi2025laonlpupdate,
  title   = {LaoNLP-Enhanced-Update},
  author  = {Sulixay Vilaiphone (LuoYi)},
  year    = {2025},
  url     = {https://huggingface.co/LuoYiSULIXAY/laonlp-enhanced-update}
}

✍️ Maintainer: Sulixay Vilaiphone (LuoYi)

📧 Contact: Sulixay2001@gmail.com

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support