File size: 6,030 Bytes
16fe072
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1630c96
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
---
language: th
license: apache-2.0
tags:
- thai
- tokenizer
- nlp
- subword
model_type: unigram
library_name: tokenizers
pretty_name: Advanced Thai Tokenizer V3
datasets:
- ZombitX64/Thai-corpus-word
metrics:
- accuracy
- character
---

# Advanced Thai Tokenizer V3

## Overview
Advanced Thai language tokenizer (Unigram, HuggingFace-compatible) trained on a large, cleaned, real-world Thai corpus. Handles Thai, mixed Thai-English, numbers, and modern vocabulary. Designed for LLM/NLP use, with robust roundtrip accuracy and no byte-level artifacts.

## Performance
- **Overall Accuracy:** 24/24 (100.0%)
- **Vocabulary Size:** 35,590 tokens
- **Average Compression:** 3.45 chars/token
- **UNK Ratio:** 0%
- **Thai Character Coverage:** 100%
- **Tested on:** Real-world, mixed, and edge-case sentences
- **Training Corpus:** `combined_thai_corpus.txt` (cleaned, deduplicated, multi-domain)

## Key Features
- ✅ No Thai character corruption (no byte-level fallback, no normalization loss)
- ✅ Handles mixed Thai-English, numbers, and symbols
- ✅ Modern vocabulary (internet, technology, social, business)
- ✅ Efficient compression (subword, not word-level)
- ✅ Clean decoding without artifacts
- ✅ HuggingFace-compatible (tokenizer.json, vocab.json, config)
- ✅ Production-ready: tested, documented, and robust

## Quick Start
```python
from transformers import AutoTokenizer

# Load tokenizer from HuggingFace Hub
try:
    tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer")
    text = "นั่งตาก ลม"
    tokens = tokenizer.tokenize(text)
    print(f"Tokens: {tokens}")
    encoding = tokenizer(text, return_tensors=None, add_special_tokens=False)
    decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True)
    print(f"Original: {text}")
    print(f"Decoded: {decoded}")
except Exception as e:
    print(f"Error loading tokenizer: {e}")
```

## Files
- `tokenizer.json` — Main tokenizer file (HuggingFace format)
- `vocab.json` — Vocabulary mapping
- `tokenizer_config.json` — Transformers config
- `metadata.json` — Performance and configuration details
- `usage_examples.json` — Code examples
- `README.md` — This file
- `combined_thai_corpus.txt` — Training corpus (not included in repo, see dataset card)

Created: July 2025

---

# Model Card for Advanced Thai Tokenizer V3

## Model Details

**Developed by:** ZombitX64 (https://huggingface.co/ZombitX64)  
**Model type:** Unigram (subword) tokenizer  
**Language(s):** th (Thai), mixed Thai-English  
**License:** Apache-2.0  
**Finetuned from model:** N/A (trained from scratch)

### Model Sources
- **Repository:** https://huggingface.co/ZombitX64/Thaitokenizer

## Uses

### Direct Use
- Tokenization for Thai LLMs, NLP, and downstream tasks
- Preprocessing for text classification, NER, QA, summarization, etc.
- Robust for mixed Thai-English, numbers, and social content

### Downstream Use
- Plug into HuggingFace Transformers pipelines
- Use as tokenizer for Thai LLM pretraining/fine-tuning
- Integrate with spaCy, PyThaiNLP, or custom pipelines

### Out-of-Scope Use
- Not a language model (no text generation by itself)
- Not suitable for non-Thai-centric tasks

## Bias, Risks, and Limitations

- Trained on public Thai web/corpus data; may reflect real-world bias
- Not guaranteed to cover rare dialects, slang, or OCR errors
- No explicit filtering for toxic/biased content in corpus
- Tokenizer does not understand context/meaning (no disambiguation)

### Recommendations

- For best results, use with LLMs or models trained on similar corpus
- For sensitive/critical applications, review corpus and test thoroughly
- For word-level tasks, use with context-aware models (NER, POS)

## How to Get Started with the Model

```python
from transformers import AutoTokenizer

# Load tokenizer from HuggingFace Hub
try:
    tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer")
    text = "นั่งตาก ลม"
    tokens = tokenizer.tokenize(text)
    print(f"Tokens: {tokens}")
    encoding = tokenizer(text, return_tensors=None, add_special_tokens=False)
    decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True)
    print(f"Original: {text}")
    print(f"Decoded: {decoded}")
except Exception as e:
    print(f"Error loading tokenizer: {e}")
```

## Training Details

### Training Data
- **Source:** `combined_thai_corpus.txt` (cleaned, deduplicated, multi-domain Thai text)
- **Size:** 71.7M
- **Preprocessing:** Remove duplicates, normalize encoding, minimal cleaning, no normalization, no byte fallback

### Training Procedure
- **Tokenizer:** HuggingFace Tokenizers (Unigram)
- **Vocab size:** 35,590
- **Special tokens:** <unk>
- **Pre-tokenizer:** Punctuation only
- **No normalization, no post-processor, no decoder**
- **Training regime:** CPU, Python 3.11, single run, see script for details

### Speeds, Sizes, Times
- **Training time:** -
- **Checkpoint size:** tokenizer.json ~[size] KB

## Evaluation

### Testing Data, Factors & Metrics
- **Testing data:** Real-world Thai sentences, mixed content, edge cases
- **Metrics:** Roundtrip accuracy, UNK ratio, Thai character coverage, compression ratio
- **Results:** 100% roundtrip, 0% UNK, 100% Thai char coverage, 3.45 chars/token

## Environmental Impact

- Training on CPU, low energy usage
- No large-scale GPU/TPU compute required

## Technical Specifications

- **Model architecture:** Unigram (subword) tokenizer
- **Software:** tokenizers==0.15+, Python 3.11
- **Hardware:** Standard CPU (no GPU required)

## Citation

If you use this tokenizer, please cite:

```
@misc{zombitx64_thaitokenizer_v3_2025,
  author = {ZombitX64},
  title = {Advanced Thai Tokenizer V3},
  year = {2025},
  howpublished = {\\url{https://huggingface.co/ZombitX64/Thaitokenizer}}
}
```

## Model Card Authors

- ZombitX64 (https://huggingface.co/ZombitX64)

## Model Card Contact

For questions or feedback, open an issue on the HuggingFace repo or contact ZombitX64 via HuggingFace.