ZombitX64
/

Thaitokenizer

@@ -1,19 +1,38 @@
 # Advanced Thai Tokenizer V3
 ## Overview
-Advanced Thai language tokenizer with improved handling of Thai text, mixed content, and modern vocabulary.
 ## Performance
-- Overall Accuracy: 24/24 (100.0%)
-- Vocabulary Size: 35,590 tokens
-- Average Compression: 3.45 chars/token
 ## Key Features
-- ✅ No Thai character corruption
-- ✅ Handles mixed Thai-English content
-- ✅ Modern vocabulary (internet, technology terms)
-- ✅ Efficient compression
 - ✅ Clean decoding without artifacts
 ## Quick Start
 ```python
@@ -26,13 +45,131 @@ encoding = tokenizer.encode(text)
 # Best decoding method
 decoded = "".join(token for token in encoding.tokens
                  if not (token.startswith('<') and token.endswith('>')))
 ```
 ## Files
-- `tokenizer.json` - Main tokenizer file
-- `vocab.json` - Vocabulary mapping
-- `metadata.json` - Performance and configuration details
-- `usage_examples.json` - Code examples
-- `README.md` - This file
 Created: July 2025

+---
+language: th
+license: apache-2.0
+tags:
+  - thai
+  - tokenizer
+  - nlp
+  - subword
+model_type: unigram
+library_name: tokenizers
+pretty_name: Advanced Thai Tokenizer V3
+---
 # Advanced Thai Tokenizer V3
 ## Overview
+Advanced Thai language tokenizer (Unigram, HuggingFace-compatible) trained on a large, cleaned, real-world Thai corpus. Handles Thai, mixed Thai-English, numbers, and modern vocabulary. Designed for LLM/NLP use, with robust roundtrip accuracy and no byte-level artifacts.
 ## Performance
+- **Overall Accuracy:** 24/24 (100.0%)
+- **Vocabulary Size:** 35,590 tokens
+- **Average Compression:** 3.45 chars/token
+- **UNK Ratio:** 0%
+- **Thai Character Coverage:** 100%
+- **Tested on:** Real-world, mixed, and edge-case sentences
+- **Training Corpus:** `combined_thai_corpus.txt` (cleaned, deduplicated, multi-domain)
 ## Key Features
+- ✅ No Thai character corruption (no byte-level fallback, no normalization loss)
+- ✅ Handles mixed Thai-English, numbers, and symbols
+- ✅ Modern vocabulary (internet, technology, social, business)
+- ✅ Efficient compression (subword, not word-level)
 - ✅ Clean decoding without artifacts
+- ✅ HuggingFace-compatible (tokenizer.json, vocab.json, config)
+- ✅ Production-ready: tested, documented, and robust
 ## Quick Start
 ```python
 # Best decoding method
 decoded = "".join(token for token in encoding.tokens
                  if not (token.startswith('<') and token.endswith('>')))
+print(f"Original: {text}")
+print(f"Tokens: {encoding.tokens}")
+print(f"Decoded: {decoded}")
 ```
 ## Files
+- `tokenizer.json` — Main tokenizer file (HuggingFace format)
+- `vocab.json` — Vocabulary mapping
+- `tokenizer_config.json` — Transformers config
+- `metadata.json` — Performance and configuration details
+- `usage_examples.json` — Code examples
+- `README.md` — This file
+- `combined_thai_corpus.txt` — Training corpus (not included in repo, see dataset card)
 Created: July 2025
+---
+# Model Card for Advanced Thai Tokenizer V3
+## Model Details
+**Developed by:** ZombitX64 (https://huggingface.co/ZombitX64)
+**Model type:** Unigram (subword) tokenizer
+**Language(s):** th (Thai), mixed Thai-English
+**License:** Apache-2.0
+**Finetuned from model:** N/A (trained from scratch)
+### Model Sources
+- **Repository:** https://huggingface.co/ZombitX64/Thaitokenizer
+## Uses
+### Direct Use
+- Tokenization for Thai LLMs, NLP, and downstream tasks
+- Preprocessing for text classification, NER, QA, summarization, etc.
+- Robust for mixed Thai-English, numbers, and social content
+### Downstream Use
+- Plug into HuggingFace Transformers pipelines
+- Use as tokenizer for Thai LLM pretraining/fine-tuning
+- Integrate with spaCy, PyThaiNLP, or custom pipelines
+### Out-of-Scope Use
+- Not a language model (no text generation by itself)
+- Not suitable for non-Thai-centric tasks
+## Bias, Risks, and Limitations
+- Trained on public Thai web/corpus data; may reflect real-world bias
+- Not guaranteed to cover rare dialects, slang, or OCR errors
+- No explicit filtering for toxic/biased content in corpus
+- Tokenizer does not understand context/meaning (no disambiguation)
+### Recommendations
+- For best results, use with LLMs or models trained on similar corpus
+- For sensitive/critical applications, review corpus and test thoroughly
+- For word-level tasks, use with context-aware models (NER, POS)
+## How to Get Started with the Model
+```python
+from tokenizers import Tokenizer
+tokenizer = Tokenizer.from_file("tokenizer.json")
+text = "นั่งตากลมริมทะเล"
+tokens = tokenizer.encode(text).tokens
+print(tokens)
+```
+## Training Details
+### Training Data
+- **Source:** `combined_thai_corpus.txt` (cleaned, deduplicated, multi-domain Thai text)
+- **Size:** [Add number of lines/size if known]
+- **Preprocessing:** Remove duplicates, normalize encoding, minimal cleaning, no normalization, no byte fallback
+### Training Procedure
+- **Tokenizer:** HuggingFace Tokenizers (Unigram)
+- **Vocab size:** 35,590
+- **Special tokens:** <unk>
+- **Pre-tokenizer:** Punctuation only
+- **No normalization, no post-processor, no decoder**
+- **Training regime:** CPU, Python 3.11, single run, see script for details
+### Speeds, Sizes, Times
+- **Training time:** [Add time if known]
+- **Checkpoint size:** tokenizer.json ~[size] KB
+## Evaluation
+### Testing Data, Factors & Metrics
+- **Testing data:** Real-world Thai sentences, mixed content, edge cases
+- **Metrics:** Roundtrip accuracy, UNK ratio, Thai character coverage, compression ratio
+- **Results:** 100% roundtrip, 0% UNK, 100% Thai char coverage, 3.45 chars/token
+## Environmental Impact
+- Training on CPU, low energy usage
+- No large-scale GPU/TPU compute required
+## Technical Specifications
+- **Model architecture:** Unigram (subword) tokenizer
+- **Software:** tokenizers==0.15+, Python 3.11
+- **Hardware:** Standard CPU (no GPU required)
+## Citation
+If you use this tokenizer, please cite:
+```
+@misc{zombitx64_thaitokenizer_v3_2025,
+  author = {ZombitX64},
+  title = {Advanced Thai Tokenizer V3},
+  year = {2025},
+  howpublished = {\\url{https://huggingface.co/ZombitX64/Thaitokenizer}}
+}
+```
+## Model Card Authors
+- ZombitX64 (https://huggingface.co/ZombitX64)
+- [Add contributors if any]
+## Model Card Contact
+For questions or feedback, open an issue on the HuggingFace repo or contact ZombitX64 via HuggingFace.