File size: 8,455 Bytes
dbd156e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
---
license: mit
language:
- tt
tags:
- tokenizer
- tatar-language
- wordpiece
- unigram
- bpe
- bbpe
- huggingface
metrics:
- unknown_rate
- compression_ratio
- word_coverage
- tokens_per_second
---

# TatarTokenizer: Tokenizers for the Tatar Language

This repository contains a comprehensive collection of pre-trained tokenizers for the Tatar language. We provide **four different tokenization algorithms** (WordPiece, Unigram, BPE, and BBPE) with **multiple vocabulary sizes** (25k and 50k), trained on a large Tatar corpus. All tokenizers achieve **0% unknown rate** on test data and are ready to use with the `tokenizers` library or Hugging Face Transformers.

## ๐Ÿ“ฆ Available Tokenizers

The following tokenizers are included:

| Tokenizer          | Type      | Vocab Size | Compression Ratio | Speed (tokens/sec) | Notes |
|--------------------|-----------|------------|-------------------|---------------------|-------|
| `wp_50k`           | WordPiece | 50,000     | 4.67              | 378,751             | Best overall balance |
| `wp_25k`           | WordPiece | 25,000     | 4.36              | **496,273**         | Fastest tokenizer |
| `uni_50k`          | Unigram   | 50,000     | 4.59              | 189,623             | Probabilistic model |
| `uni_25k`          | Unigram   | 25,000     | 4.30              | 260,403             | Good for smaller vocab |
| `bpe_50k`          | BPE       | 50,000     | 4.60              | 247,421             | Standard BPE |
| `bpe_50k_freq5`    | BPE       | 50,000     | 4.60              | 226,591             | Higher frequency threshold |
| `bbpe_50k`         | BBPE      | 50,000     | 4.60              | 227,322             | Byte-level BPE |
| `bbpe_25k`         | BBPE      | 25,000     | 4.28              | 257,104             | Compact byte-level |
| `bbpe_fixed_50k`   | BBPE*     | 50,000     | **5.17**          | 315,922             | Best compression ratio |
| `bpe_fixed_50k`    | BPE*      | 50,000     | 4.75              | 337,247             | Fast BPE variant |

\* *Fixed versions with improved Unicode handling*

**Key observations:**
- All tokenizers except `bpe_fixed_50k` achieve **0% unknown rate** on test data
- `bbpe_fixed_50k` offers the **best compression** (5.17 chars/token)
- `wp_25k` is the **fastest** (nearly 500k tokens/second)
- WordPiece models provide the most **human-readable tokens**

## ๐Ÿ“ Repository Structure

The files are organized in subdirectories for each tokenizer type and size:

```
TatarTokenizer/
โ”œโ”€โ”€ tokenizers/
โ”‚   โ”œโ”€โ”€ wordpiece/
โ”‚   โ”‚   โ”œโ”€โ”€ 50k/          # wp_50k.json
โ”‚   โ”‚   โ””โ”€โ”€ 25k/          # wp_25k.json
โ”‚   โ”œโ”€โ”€ unigram/
โ”‚   โ”‚   โ”œโ”€โ”€ 50k/          # uni_50k.json
โ”‚   โ”‚   โ””โ”€โ”€ 25k/          # uni_25k.json
โ”‚   โ”œโ”€โ”€ bpe/
โ”‚   โ”‚   โ”œโ”€โ”€ 50k/          # bpe_50k.json
โ”‚   โ”‚   โ””โ”€โ”€ 50k_freq5/    # bpe_50k_freq5.json
โ”‚   โ”œโ”€โ”€ bbpe/
โ”‚   โ”‚   โ”œโ”€โ”€ 50k/          # bbpe_50k.json
โ”‚   โ”‚   โ””โ”€โ”€ 25k/          # bbpe_25k.json
โ”‚   โ”œโ”€โ”€ bpe_fixed/
โ”‚   โ”‚   โ””โ”€โ”€ 50k/          # bpe_fixed_50k.json
โ”‚   โ””โ”€โ”€ bbpe_fixed/
โ”‚       โ””โ”€โ”€ 50k/          # bbpe_fixed_50k.json
โ””โ”€โ”€ test_results/          # Evaluation reports and visualizations
    โ”œโ”€โ”€ tokenizer_test_report.csv
    โ”œโ”€โ”€ test_summary_*.txt
    โ”œโ”€โ”€ comparison_*.png
    โ”œโ”€โ”€ token_length_dist_*.png
    โ”œโ”€โ”€ correlation_*.png
    โ””โ”€โ”€ top10_score_*.png
```

Each tokenizer is saved as a single `.json` file compatible with the Hugging Face `tokenizers` library.

## ๐Ÿš€ Usage

### Installation

First, install the required libraries:

```bash
pip install huggingface_hub tokenizers
```

### Load a Tokenizer

```python
from huggingface_hub import hf_hub_download
from tokenizers import Tokenizer

# Download and load the WordPiece 50k tokenizer
tokenizer_file = hf_hub_download(
    repo_id="TatarNLPWorld/TatarTokenizer",
    filename="tokenizers/wordpiece/50k/wp_50k.json"
)

tokenizer = Tokenizer.from_file(tokenizer_file)

# Test it
text = "ะšะฐะทะฐะฝ - ะขะฐั‚ะฐั€ัั‚ะฐะฝะฝั‹าฃ ะฑะฐัˆะบะฐะปะฐัั‹"
encoding = tokenizer.encode(text)
print(f"Text: {text}")
print(f"Tokens: {encoding.tokens}")
print(f"Token IDs: {encoding.ids}")
print(f"Decoded: {tokenizer.decode(encoding.ids)}")
```

### Using with Hugging Face Transformers

You can easily convert any tokenizer to Hugging Face format:

```python
from transformers import PreTrainedTokenizerFast

hf_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    unk_token='[UNK]',
    pad_token='[PAD]',
    cls_token='[CLS]',
    sep_token='[SEP]',
    mask_token='[MASK]'
)

# Now you can use it with any transformer model
```

### Download All Files for a Specific Tokenizer

```python
from huggingface_hub import snapshot_download

# Download all files for WordPiece 50k
model_path = snapshot_download(
    repo_id="TatarNLPWorld/TatarTokenizer",
    allow_patterns="tokenizers/wordpiece/50k/*",
    local_dir="./tatar_tokenizer_wp50k"
)
```

## ๐Ÿ“Š Evaluation Results

We conducted extensive testing on a held-out corpus of 10,000 documents (19.5 million characters). Here are the key findings:

### Best Tokenizers by Category

| Category | Winner | Value |
|----------|--------|-------|
| **Best Compression** | `bbpe_fixed_50k` | 5.17 chars/token |
| **Fastest** | `wp_25k` | 496,273 tokens/sec |
| **Best Overall** | `wp_50k` | Balanced performance |
| **Most Readable** | WordPiece family | Human-readable tokens |

### Performance Summary

All tokenizers (except `bpe_fixed_50k`) achieve:
- **0% unknown rate** on test data
- **100% word coverage** for common vocabulary
- Compression ratios between 4.28 and 5.17

### Visualizations

The repository includes comprehensive evaluation visualizations in the `test_results/` folder:
- **Comparison plots** showing unknown rate, compression ratio, and speed by tokenizer type
- **Token length distributions** for each best-in-class tokenizer
- **Correlation matrices** between different metrics
- **Top-10 rankings** by composite score

Both Russian and English versions of all plots are available.

## ๐Ÿงช Test Results Summary

| Model | Type | Unknown Rate | Compression | Word Coverage | Speed (tokens/sec) |
|-------|------|--------------|-------------|---------------|-------------------|
| wp_50k | WordPiece | 0.0000 | 4.67 | 1.0000 | 378,751 |
| wp_25k | WordPiece | 0.0000 | 4.36 | 1.0000 | **496,273** |
| uni_50k | Unigram | 0.0000 | 4.59 | 1.0000 | 189,623 |
| uni_25k | Unigram | 0.0000 | 4.30 | 1.0000 | 260,403 |
| bpe_50k | BPE | 0.0000 | 4.60 | 1.0000 | 247,421 |
| bbpe_fixed_50k | BBPE_fixed | 0.0000 | **5.17** | 1.0000 | 315,922 |

## ๐ŸŽฏ Recommendations

Based on our evaluation, we recommend:

1. **For BERT-like models**: Use `wp_50k` (WordPiece) - best balance of readability and performance
2. **For maximum speed**: Use `wp_25k` - fastest tokenizer, ideal for high-throughput applications
3. **For maximum compression**: Use `bbpe_fixed_50k` - most efficient tokenization
4. **For GPT-like models**: Use `bpe_50k` or `bbpe_50k` - compatible with modern LLM architectures
5. **For research**: All tokenizers are provided for comparative studies

## ๐Ÿ“ License

All tokenizers are released under the **MIT License**. You are free to use, modify, and distribute them for any purpose, with proper attribution.

## ๐Ÿค Citation

If you use these tokenizers in your research, please cite:

```bibtex
@software{tatartokenizer_2026,
    title = {TatarTokenizer: A Comprehensive Collection of Tokenizers for the Tatar Language},
    author = {Arabov, Mullosharaf Kurbonvoich},
    year = {2026},
    publisher = {Kazan Federal University},
    url = {https://huggingface.co/TatarNLPWorld/TatarTokenizer}
}
```

## ๐ŸŒ Language

All tokenizers are trained on Tatar text and are intended for use with the Tatar language (language code `tt`). They handle Tatar-specific characters perfectly (`ำ™`, `ำ˜`, `าฏ`, `าฎ`, `า—`, `า–`, `าฃ`, `าข`, `าป`, `าบ`, `ำฉ`, `ำจ`).

## ๐Ÿ™Œ Acknowledgements

These tokenizers were trained and evaluated by [TatarNLPWorld](https://huggingface.co/TatarNLPWorld) as part of an effort to advance NLP resources for the Tatar language. We thank the open-source community for the tools and libraries that made this work possible.

Special thanks to the Hugging Face team for the `tokenizers` library and the Hugging Face Hub platform.