--- language: th license: apache-2.0 tags: - thai - tokenizer - nlp - subword model_type: unigram library_name: tokenizers pretty_name: Advanced Thai Tokenizer V3 datasets: - ZombitX64/Thai-corpus-word metrics: - accuracy - character --- # Advanced Thai Tokenizer V3 ## Overview Advanced Thai language tokenizer (Unigram, HuggingFace-compatible) trained on a large, cleaned, real-world Thai corpus. Handles Thai, mixed Thai-English, numbers, and modern vocabulary. Designed for LLM/NLP use, with robust roundtrip accuracy and no byte-level artifacts. ## Performance - **Overall Accuracy:** 24/24 (100.0%) - **Vocabulary Size:** 35,590 tokens - **Average Compression:** 3.45 chars/token - **UNK Ratio:** 0% - **Thai Character Coverage:** 100% - **Tested on:** Real-world, mixed, and edge-case sentences - **Training Corpus:** `combined_thai_corpus.txt` (cleaned, deduplicated, multi-domain) ## Key Features - ✅ No Thai character corruption (no byte-level fallback, no normalization loss) - ✅ Handles mixed Thai-English, numbers, and symbols - ✅ Modern vocabulary (internet, technology, social, business) - ✅ Efficient compression (subword, not word-level) - ✅ Clean decoding without artifacts - ✅ HuggingFace-compatible (tokenizer.json, vocab.json, config) - ✅ Production-ready: tested, documented, and robust ## Quick Start ```python from transformers import AutoTokenizer # Load tokenizer from HuggingFace Hub try: tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer") text = "นั่งตาก ลม" tokens = tokenizer.tokenize(text) print(f"Tokens: {tokens}") encoding = tokenizer(text, return_tensors=None, add_special_tokens=False) decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True) print(f"Original: {text}") print(f"Decoded: {decoded}") except Exception as e: print(f"Error loading tokenizer: {e}") ``` ## Files - `tokenizer.json` — Main tokenizer file (HuggingFace format) - `vocab.json` — Vocabulary mapping - `tokenizer_config.json` — Transformers config - `metadata.json` — Performance and configuration details - `usage_examples.json` — Code examples - `README.md` — This file - `combined_thai_corpus.txt` — Training corpus (not included in repo, see dataset card) Created: July 2025 --- # Model Card for Advanced Thai Tokenizer V3 ## Model Details **Developed by:** ZombitX64 (https://huggingface.co/ZombitX64) **Model type:** Unigram (subword) tokenizer **Language(s):** th (Thai), mixed Thai-English **License:** Apache-2.0 **Finetuned from model:** N/A (trained from scratch) ### Model Sources - **Repository:** https://huggingface.co/ZombitX64/Thaitokenizer ## Uses ### Direct Use - Tokenization for Thai LLMs, NLP, and downstream tasks - Preprocessing for text classification, NER, QA, summarization, etc. - Robust for mixed Thai-English, numbers, and social content ### Downstream Use - Plug into HuggingFace Transformers pipelines - Use as tokenizer for Thai LLM pretraining/fine-tuning - Integrate with spaCy, PyThaiNLP, or custom pipelines ### Out-of-Scope Use - Not a language model (no text generation by itself) - Not suitable for non-Thai-centric tasks ## Bias, Risks, and Limitations - Trained on public Thai web/corpus data; may reflect real-world bias - Not guaranteed to cover rare dialects, slang, or OCR errors - No explicit filtering for toxic/biased content in corpus - Tokenizer does not understand context/meaning (no disambiguation) ### Recommendations - For best results, use with LLMs or models trained on similar corpus - For sensitive/critical applications, review corpus and test thoroughly - For word-level tasks, use with context-aware models (NER, POS) ## How to Get Started with the Model ```python from transformers import AutoTokenizer # Load tokenizer from HuggingFace Hub try: tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer") text = "นั่งตาก ลม" tokens = tokenizer.tokenize(text) print(f"Tokens: {tokens}") encoding = tokenizer(text, return_tensors=None, add_special_tokens=False) decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True) print(f"Original: {text}") print(f"Decoded: {decoded}") except Exception as e: print(f"Error loading tokenizer: {e}") ``` ## Training Details ### Training Data - **Source:** `combined_thai_corpus.txt` (cleaned, deduplicated, multi-domain Thai text) - **Size:** 71.7M - **Preprocessing:** Remove duplicates, normalize encoding, minimal cleaning, no normalization, no byte fallback ### Training Procedure - **Tokenizer:** HuggingFace Tokenizers (Unigram) - **Vocab size:** 35,590 - **Special tokens:** - **Pre-tokenizer:** Punctuation only - **No normalization, no post-processor, no decoder** - **Training regime:** CPU, Python 3.11, single run, see script for details ### Speeds, Sizes, Times - **Training time:** - - **Checkpoint size:** tokenizer.json ~[size] KB ## Evaluation ### Testing Data, Factors & Metrics - **Testing data:** Real-world Thai sentences, mixed content, edge cases - **Metrics:** Roundtrip accuracy, UNK ratio, Thai character coverage, compression ratio - **Results:** 100% roundtrip, 0% UNK, 100% Thai char coverage, 3.45 chars/token ## Environmental Impact - Training on CPU, low energy usage - No large-scale GPU/TPU compute required ## Technical Specifications - **Model architecture:** Unigram (subword) tokenizer - **Software:** tokenizers==0.15+, Python 3.11 - **Hardware:** Standard CPU (no GPU required) ## Citation If you use this tokenizer, please cite: ``` @misc{zombitx64_thaitokenizer_v3_2025, author = {ZombitX64}, title = {Advanced Thai Tokenizer V3}, year = {2025}, howpublished = {\\url{https://huggingface.co/ZombitX64/Thaitokenizer}} } ``` ## Model Card Authors - ZombitX64 (https://huggingface.co/ZombitX64) ## Model Card Contact For questions or feedback, open an issue on the HuggingFace repo or contact ZombitX64 via HuggingFace.