File size: 6,030 Bytes
16fe072 1630c96 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
---
language: th
license: apache-2.0
tags:
- thai
- tokenizer
- nlp
- subword
model_type: unigram
library_name: tokenizers
pretty_name: Advanced Thai Tokenizer V3
datasets:
- ZombitX64/Thai-corpus-word
metrics:
- accuracy
- character
---
# Advanced Thai Tokenizer V3
## Overview
Advanced Thai language tokenizer (Unigram, HuggingFace-compatible) trained on a large, cleaned, real-world Thai corpus. Handles Thai, mixed Thai-English, numbers, and modern vocabulary. Designed for LLM/NLP use, with robust roundtrip accuracy and no byte-level artifacts.
## Performance
- **Overall Accuracy:** 24/24 (100.0%)
- **Vocabulary Size:** 35,590 tokens
- **Average Compression:** 3.45 chars/token
- **UNK Ratio:** 0%
- **Thai Character Coverage:** 100%
- **Tested on:** Real-world, mixed, and edge-case sentences
- **Training Corpus:** `combined_thai_corpus.txt` (cleaned, deduplicated, multi-domain)
## Key Features
- ✅ No Thai character corruption (no byte-level fallback, no normalization loss)
- ✅ Handles mixed Thai-English, numbers, and symbols
- ✅ Modern vocabulary (internet, technology, social, business)
- ✅ Efficient compression (subword, not word-level)
- ✅ Clean decoding without artifacts
- ✅ HuggingFace-compatible (tokenizer.json, vocab.json, config)
- ✅ Production-ready: tested, documented, and robust
## Quick Start
```python
from transformers import AutoTokenizer
# Load tokenizer from HuggingFace Hub
try:
tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer")
text = "นั่งตาก ลม"
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
encoding = tokenizer(text, return_tensors=None, add_special_tokens=False)
decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True)
print(f"Original: {text}")
print(f"Decoded: {decoded}")
except Exception as e:
print(f"Error loading tokenizer: {e}")
```
## Files
- `tokenizer.json` — Main tokenizer file (HuggingFace format)
- `vocab.json` — Vocabulary mapping
- `tokenizer_config.json` — Transformers config
- `metadata.json` — Performance and configuration details
- `usage_examples.json` — Code examples
- `README.md` — This file
- `combined_thai_corpus.txt` — Training corpus (not included in repo, see dataset card)
Created: July 2025
---
# Model Card for Advanced Thai Tokenizer V3
## Model Details
**Developed by:** ZombitX64 (https://huggingface.co/ZombitX64)
**Model type:** Unigram (subword) tokenizer
**Language(s):** th (Thai), mixed Thai-English
**License:** Apache-2.0
**Finetuned from model:** N/A (trained from scratch)
### Model Sources
- **Repository:** https://huggingface.co/ZombitX64/Thaitokenizer
## Uses
### Direct Use
- Tokenization for Thai LLMs, NLP, and downstream tasks
- Preprocessing for text classification, NER, QA, summarization, etc.
- Robust for mixed Thai-English, numbers, and social content
### Downstream Use
- Plug into HuggingFace Transformers pipelines
- Use as tokenizer for Thai LLM pretraining/fine-tuning
- Integrate with spaCy, PyThaiNLP, or custom pipelines
### Out-of-Scope Use
- Not a language model (no text generation by itself)
- Not suitable for non-Thai-centric tasks
## Bias, Risks, and Limitations
- Trained on public Thai web/corpus data; may reflect real-world bias
- Not guaranteed to cover rare dialects, slang, or OCR errors
- No explicit filtering for toxic/biased content in corpus
- Tokenizer does not understand context/meaning (no disambiguation)
### Recommendations
- For best results, use with LLMs or models trained on similar corpus
- For sensitive/critical applications, review corpus and test thoroughly
- For word-level tasks, use with context-aware models (NER, POS)
## How to Get Started with the Model
```python
from transformers import AutoTokenizer
# Load tokenizer from HuggingFace Hub
try:
tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer")
text = "นั่งตาก ลม"
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
encoding = tokenizer(text, return_tensors=None, add_special_tokens=False)
decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True)
print(f"Original: {text}")
print(f"Decoded: {decoded}")
except Exception as e:
print(f"Error loading tokenizer: {e}")
```
## Training Details
### Training Data
- **Source:** `combined_thai_corpus.txt` (cleaned, deduplicated, multi-domain Thai text)
- **Size:** 71.7M
- **Preprocessing:** Remove duplicates, normalize encoding, minimal cleaning, no normalization, no byte fallback
### Training Procedure
- **Tokenizer:** HuggingFace Tokenizers (Unigram)
- **Vocab size:** 35,590
- **Special tokens:** <unk>
- **Pre-tokenizer:** Punctuation only
- **No normalization, no post-processor, no decoder**
- **Training regime:** CPU, Python 3.11, single run, see script for details
### Speeds, Sizes, Times
- **Training time:** -
- **Checkpoint size:** tokenizer.json ~[size] KB
## Evaluation
### Testing Data, Factors & Metrics
- **Testing data:** Real-world Thai sentences, mixed content, edge cases
- **Metrics:** Roundtrip accuracy, UNK ratio, Thai character coverage, compression ratio
- **Results:** 100% roundtrip, 0% UNK, 100% Thai char coverage, 3.45 chars/token
## Environmental Impact
- Training on CPU, low energy usage
- No large-scale GPU/TPU compute required
## Technical Specifications
- **Model architecture:** Unigram (subword) tokenizer
- **Software:** tokenizers==0.15+, Python 3.11
- **Hardware:** Standard CPU (no GPU required)
## Citation
If you use this tokenizer, please cite:
```
@misc{zombitx64_thaitokenizer_v3_2025,
author = {ZombitX64},
title = {Advanced Thai Tokenizer V3},
year = {2025},
howpublished = {\\url{https://huggingface.co/ZombitX64/Thaitokenizer}}
}
```
## Model Card Authors
- ZombitX64 (https://huggingface.co/ZombitX64)
## Model Card Contact
For questions or feedback, open an issue on the HuggingFace repo or contact ZombitX64 via HuggingFace. |