JonusNattapong commited on
Commit
1630c96
·
verified ·
1 Parent(s): 55f2143

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +150 -13
README.md CHANGED
@@ -1,19 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Advanced Thai Tokenizer V3
2
 
3
  ## Overview
4
- Advanced Thai language tokenizer with improved handling of Thai text, mixed content, and modern vocabulary.
5
 
6
  ## Performance
7
- - Overall Accuracy: 24/24 (100.0%)
8
- - Vocabulary Size: 35,590 tokens
9
- - Average Compression: 3.45 chars/token
 
 
 
 
10
 
11
  ## Key Features
12
- - ✅ No Thai character corruption
13
- - ✅ Handles mixed Thai-English content
14
- - ✅ Modern vocabulary (internet, technology terms)
15
- - ✅ Efficient compression
16
  - ✅ Clean decoding without artifacts
 
 
17
 
18
  ## Quick Start
19
  ```python
@@ -26,13 +45,131 @@ encoding = tokenizer.encode(text)
26
  # Best decoding method
27
  decoded = "".join(token for token in encoding.tokens
28
  if not (token.startswith('<') and token.endswith('>')))
 
 
 
29
  ```
30
 
31
  ## Files
32
- - `tokenizer.json` - Main tokenizer file
33
- - `vocab.json` - Vocabulary mapping
34
- - `metadata.json` - Performance and configuration details
35
- - `usage_examples.json` - Code examples
36
- - `README.md` - This file
 
 
37
 
38
  Created: July 2025
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: th
3
+ license: apache-2.0
4
+ tags:
5
+ - thai
6
+ - tokenizer
7
+ - nlp
8
+ - subword
9
+ model_type: unigram
10
+ library_name: tokenizers
11
+ pretty_name: Advanced Thai Tokenizer V3
12
+ ---
13
+
14
  # Advanced Thai Tokenizer V3
15
 
16
  ## Overview
17
+ Advanced Thai language tokenizer (Unigram, HuggingFace-compatible) trained on a large, cleaned, real-world Thai corpus. Handles Thai, mixed Thai-English, numbers, and modern vocabulary. Designed for LLM/NLP use, with robust roundtrip accuracy and no byte-level artifacts.
18
 
19
  ## Performance
20
+ - **Overall Accuracy:** 24/24 (100.0%)
21
+ - **Vocabulary Size:** 35,590 tokens
22
+ - **Average Compression:** 3.45 chars/token
23
+ - **UNK Ratio:** 0%
24
+ - **Thai Character Coverage:** 100%
25
+ - **Tested on:** Real-world, mixed, and edge-case sentences
26
+ - **Training Corpus:** `combined_thai_corpus.txt` (cleaned, deduplicated, multi-domain)
27
 
28
  ## Key Features
29
+ - ✅ No Thai character corruption (no byte-level fallback, no normalization loss)
30
+ - ✅ Handles mixed Thai-English, numbers, and symbols
31
+ - ✅ Modern vocabulary (internet, technology, social, business)
32
+ - ✅ Efficient compression (subword, not word-level)
33
  - ✅ Clean decoding without artifacts
34
+ - ✅ HuggingFace-compatible (tokenizer.json, vocab.json, config)
35
+ - ✅ Production-ready: tested, documented, and robust
36
 
37
  ## Quick Start
38
  ```python
 
45
  # Best decoding method
46
  decoded = "".join(token for token in encoding.tokens
47
  if not (token.startswith('<') and token.endswith('>')))
48
+ print(f"Original: {text}")
49
+ print(f"Tokens: {encoding.tokens}")
50
+ print(f"Decoded: {decoded}")
51
  ```
52
 
53
  ## Files
54
+ - `tokenizer.json` Main tokenizer file (HuggingFace format)
55
+ - `vocab.json` Vocabulary mapping
56
+ - `tokenizer_config.json` Transformers config
57
+ - `metadata.json` Performance and configuration details
58
+ - `usage_examples.json` Code examples
59
+ - `README.md` — This file
60
+ - `combined_thai_corpus.txt` — Training corpus (not included in repo, see dataset card)
61
 
62
  Created: July 2025
63
+
64
+ ---
65
+
66
+ # Model Card for Advanced Thai Tokenizer V3
67
+
68
+ ## Model Details
69
+
70
+ **Developed by:** ZombitX64 (https://huggingface.co/ZombitX64)
71
+ **Model type:** Unigram (subword) tokenizer
72
+ **Language(s):** th (Thai), mixed Thai-English
73
+ **License:** Apache-2.0
74
+ **Finetuned from model:** N/A (trained from scratch)
75
+
76
+ ### Model Sources
77
+ - **Repository:** https://huggingface.co/ZombitX64/Thaitokenizer
78
+
79
+ ## Uses
80
+
81
+ ### Direct Use
82
+ - Tokenization for Thai LLMs, NLP, and downstream tasks
83
+ - Preprocessing for text classification, NER, QA, summarization, etc.
84
+ - Robust for mixed Thai-English, numbers, and social content
85
+
86
+ ### Downstream Use
87
+ - Plug into HuggingFace Transformers pipelines
88
+ - Use as tokenizer for Thai LLM pretraining/fine-tuning
89
+ - Integrate with spaCy, PyThaiNLP, or custom pipelines
90
+
91
+ ### Out-of-Scope Use
92
+ - Not a language model (no text generation by itself)
93
+ - Not suitable for non-Thai-centric tasks
94
+
95
+ ## Bias, Risks, and Limitations
96
+
97
+ - Trained on public Thai web/corpus data; may reflect real-world bias
98
+ - Not guaranteed to cover rare dialects, slang, or OCR errors
99
+ - No explicit filtering for toxic/biased content in corpus
100
+ - Tokenizer does not understand context/meaning (no disambiguation)
101
+
102
+ ### Recommendations
103
+
104
+ - For best results, use with LLMs or models trained on similar corpus
105
+ - For sensitive/critical applications, review corpus and test thoroughly
106
+ - For word-level tasks, use with context-aware models (NER, POS)
107
+
108
+ ## How to Get Started with the Model
109
+
110
+ ```python
111
+ from tokenizers import Tokenizer
112
+ tokenizer = Tokenizer.from_file("tokenizer.json")
113
+ text = "นั่งตากลมริมทะเล"
114
+ tokens = tokenizer.encode(text).tokens
115
+ print(tokens)
116
+ ```
117
+
118
+ ## Training Details
119
+
120
+ ### Training Data
121
+ - **Source:** `combined_thai_corpus.txt` (cleaned, deduplicated, multi-domain Thai text)
122
+ - **Size:** [Add number of lines/size if known]
123
+ - **Preprocessing:** Remove duplicates, normalize encoding, minimal cleaning, no normalization, no byte fallback
124
+
125
+ ### Training Procedure
126
+ - **Tokenizer:** HuggingFace Tokenizers (Unigram)
127
+ - **Vocab size:** 35,590
128
+ - **Special tokens:** <unk>
129
+ - **Pre-tokenizer:** Punctuation only
130
+ - **No normalization, no post-processor, no decoder**
131
+ - **Training regime:** CPU, Python 3.11, single run, see script for details
132
+
133
+ ### Speeds, Sizes, Times
134
+ - **Training time:** [Add time if known]
135
+ - **Checkpoint size:** tokenizer.json ~[size] KB
136
+
137
+ ## Evaluation
138
+
139
+ ### Testing Data, Factors & Metrics
140
+ - **Testing data:** Real-world Thai sentences, mixed content, edge cases
141
+ - **Metrics:** Roundtrip accuracy, UNK ratio, Thai character coverage, compression ratio
142
+ - **Results:** 100% roundtrip, 0% UNK, 100% Thai char coverage, 3.45 chars/token
143
+
144
+ ## Environmental Impact
145
+
146
+ - Training on CPU, low energy usage
147
+ - No large-scale GPU/TPU compute required
148
+
149
+ ## Technical Specifications
150
+
151
+ - **Model architecture:** Unigram (subword) tokenizer
152
+ - **Software:** tokenizers==0.15+, Python 3.11
153
+ - **Hardware:** Standard CPU (no GPU required)
154
+
155
+ ## Citation
156
+
157
+ If you use this tokenizer, please cite:
158
+
159
+ ```
160
+ @misc{zombitx64_thaitokenizer_v3_2025,
161
+ author = {ZombitX64},
162
+ title = {Advanced Thai Tokenizer V3},
163
+ year = {2025},
164
+ howpublished = {\\url{https://huggingface.co/ZombitX64/Thaitokenizer}}
165
+ }
166
+ ```
167
+
168
+ ## Model Card Authors
169
+
170
+ - ZombitX64 (https://huggingface.co/ZombitX64)
171
+ - [Add contributors if any]
172
+
173
+ ## Model Card Contact
174
+
175
+ For questions or feedback, open an issue on the HuggingFace repo or contact ZombitX64 via HuggingFace.