| --- |
| license: mit |
| language: |
| - tt |
| tags: |
| - tokenizer |
| - tatar-language |
| - wordpiece |
| - unigram |
| - bpe |
| - bbpe |
| - huggingface |
| metrics: |
| - unknown_rate |
| - compression_ratio |
| - word_coverage |
| - tokens_per_second |
| --- |
| |
| # TatarTokenizer: Tokenizers for the Tatar Language |
|
|
| This repository contains a comprehensive collection of pre-trained tokenizers for the Tatar language. We provide **four different tokenization algorithms** (WordPiece, Unigram, BPE, and BBPE) with **multiple vocabulary sizes** (25k and 50k), trained on a large Tatar corpus. All tokenizers achieve **0% unknown rate** on test data and are ready to use with the `tokenizers` library or Hugging Face Transformers. |
|
|
| ## ๐ฆ Available Tokenizers |
|
|
| The following tokenizers are included: |
|
|
| | Tokenizer | Type | Vocab Size | Compression Ratio | Speed (tokens/sec) | Notes | |
| |--------------------|-----------|------------|-------------------|---------------------|-------| |
| | `wp_50k` | WordPiece | 50,000 | 4.67 | 378,751 | Best overall balance | |
| | `wp_25k` | WordPiece | 25,000 | 4.36 | **496,273** | Fastest tokenizer | |
| | `uni_50k` | Unigram | 50,000 | 4.59 | 189,623 | Probabilistic model | |
| | `uni_25k` | Unigram | 25,000 | 4.30 | 260,403 | Good for smaller vocab | |
| | `bpe_50k` | BPE | 50,000 | 4.60 | 247,421 | Standard BPE | |
| | `bpe_50k_freq5` | BPE | 50,000 | 4.60 | 226,591 | Higher frequency threshold | |
| | `bbpe_50k` | BBPE | 50,000 | 4.60 | 227,322 | Byte-level BPE | |
| | `bbpe_25k` | BBPE | 25,000 | 4.28 | 257,104 | Compact byte-level | |
| | `bbpe_fixed_50k` | BBPE* | 50,000 | **5.17** | 315,922 | Best compression ratio | |
| | `bpe_fixed_50k` | BPE* | 50,000 | 4.75 | 337,247 | Fast BPE variant | |
|
|
| \* *Fixed versions with improved Unicode handling* |
|
|
| **Key observations:** |
| - All tokenizers except `bpe_fixed_50k` achieve **0% unknown rate** on test data |
| - `bbpe_fixed_50k` offers the **best compression** (5.17 chars/token) |
| - `wp_25k` is the **fastest** (nearly 500k tokens/second) |
| - WordPiece models provide the most **human-readable tokens** |
|
|
| ## ๐ Repository Structure |
|
|
| The files are organized in subdirectories for each tokenizer type and size: |
|
|
| ``` |
| TatarTokenizer/ |
| โโโ tokenizers/ |
| โ โโโ wordpiece/ |
| โ โ โโโ 50k/ # wp_50k.json |
| โ โ โโโ 25k/ # wp_25k.json |
| โ โโโ unigram/ |
| โ โ โโโ 50k/ # uni_50k.json |
| โ โ โโโ 25k/ # uni_25k.json |
| โ โโโ bpe/ |
| โ โ โโโ 50k/ # bpe_50k.json |
| โ โ โโโ 50k_freq5/ # bpe_50k_freq5.json |
| โ โโโ bbpe/ |
| โ โ โโโ 50k/ # bbpe_50k.json |
| โ โ โโโ 25k/ # bbpe_25k.json |
| โ โโโ bpe_fixed/ |
| โ โ โโโ 50k/ # bpe_fixed_50k.json |
| โ โโโ bbpe_fixed/ |
| โ โโโ 50k/ # bbpe_fixed_50k.json |
| โโโ test_results/ # Evaluation reports and visualizations |
| โโโ tokenizer_test_report.csv |
| โโโ test_summary_*.txt |
| โโโ comparison_*.png |
| โโโ token_length_dist_*.png |
| โโโ correlation_*.png |
| โโโ top10_score_*.png |
| ``` |
|
|
| Each tokenizer is saved as a single `.json` file compatible with the Hugging Face `tokenizers` library. |
|
|
| ## ๐ Usage |
|
|
| ### Installation |
|
|
| First, install the required libraries: |
|
|
| ```bash |
| pip install huggingface_hub tokenizers |
| ``` |
|
|
| ### Load a Tokenizer |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| from tokenizers import Tokenizer |
| |
| # Download and load the WordPiece 50k tokenizer |
| tokenizer_file = hf_hub_download( |
| repo_id="TatarNLPWorld/TatarTokenizer", |
| filename="tokenizers/wordpiece/50k/wp_50k.json" |
| ) |
| |
| tokenizer = Tokenizer.from_file(tokenizer_file) |
| |
| # Test it |
| text = "ะะฐะทะฐะฝ - ะขะฐัะฐัััะฐะฝะฝัาฃ ะฑะฐัะบะฐะปะฐัั" |
| encoding = tokenizer.encode(text) |
| print(f"Text: {text}") |
| print(f"Tokens: {encoding.tokens}") |
| print(f"Token IDs: {encoding.ids}") |
| print(f"Decoded: {tokenizer.decode(encoding.ids)}") |
| ``` |
|
|
| ### Using with Hugging Face Transformers |
|
|
| You can easily convert any tokenizer to Hugging Face format: |
|
|
| ```python |
| from transformers import PreTrainedTokenizerFast |
| |
| hf_tokenizer = PreTrainedTokenizerFast( |
| tokenizer_object=tokenizer, |
| unk_token='[UNK]', |
| pad_token='[PAD]', |
| cls_token='[CLS]', |
| sep_token='[SEP]', |
| mask_token='[MASK]' |
| ) |
| |
| # Now you can use it with any transformer model |
| ``` |
|
|
| ### Download All Files for a Specific Tokenizer |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| |
| # Download all files for WordPiece 50k |
| model_path = snapshot_download( |
| repo_id="TatarNLPWorld/TatarTokenizer", |
| allow_patterns="tokenizers/wordpiece/50k/*", |
| local_dir="./tatar_tokenizer_wp50k" |
| ) |
| ``` |
|
|
| ## ๐ Evaluation Results |
|
|
| We conducted extensive testing on a held-out corpus of 10,000 documents (19.5 million characters). Here are the key findings: |
|
|
| ### Best Tokenizers by Category |
|
|
| | Category | Winner | Value | |
| |----------|--------|-------| |
| | **Best Compression** | `bbpe_fixed_50k` | 5.17 chars/token | |
| | **Fastest** | `wp_25k` | 496,273 tokens/sec | |
| | **Best Overall** | `wp_50k` | Balanced performance | |
| | **Most Readable** | WordPiece family | Human-readable tokens | |
|
|
| ### Performance Summary |
|
|
| All tokenizers (except `bpe_fixed_50k`) achieve: |
| - **0% unknown rate** on test data |
| - **100% word coverage** for common vocabulary |
| - Compression ratios between 4.28 and 5.17 |
|
|
| ### Visualizations |
|
|
| The repository includes comprehensive evaluation visualizations in the `test_results/` folder: |
| - **Comparison plots** showing unknown rate, compression ratio, and speed by tokenizer type |
| - **Token length distributions** for each best-in-class tokenizer |
| - **Correlation matrices** between different metrics |
| - **Top-10 rankings** by composite score |
|
|
| Both Russian and English versions of all plots are available. |
|
|
| ## ๐งช Test Results Summary |
|
|
| | Model | Type | Unknown Rate | Compression | Word Coverage | Speed (tokens/sec) | |
| |-------|------|--------------|-------------|---------------|-------------------| |
| | wp_50k | WordPiece | 0.0000 | 4.67 | 1.0000 | 378,751 | |
| | wp_25k | WordPiece | 0.0000 | 4.36 | 1.0000 | **496,273** | |
| | uni_50k | Unigram | 0.0000 | 4.59 | 1.0000 | 189,623 | |
| | uni_25k | Unigram | 0.0000 | 4.30 | 1.0000 | 260,403 | |
| | bpe_50k | BPE | 0.0000 | 4.60 | 1.0000 | 247,421 | |
| | bbpe_fixed_50k | BBPE_fixed | 0.0000 | **5.17** | 1.0000 | 315,922 | |
|
|
| ## ๐ฏ Recommendations |
|
|
| Based on our evaluation, we recommend: |
|
|
| 1. **For BERT-like models**: Use `wp_50k` (WordPiece) - best balance of readability and performance |
| 2. **For maximum speed**: Use `wp_25k` - fastest tokenizer, ideal for high-throughput applications |
| 3. **For maximum compression**: Use `bbpe_fixed_50k` - most efficient tokenization |
| 4. **For GPT-like models**: Use `bpe_50k` or `bbpe_50k` - compatible with modern LLM architectures |
| 5. **For research**: All tokenizers are provided for comparative studies |
|
|
| ## ๐ License |
|
|
| All tokenizers are released under the **MIT License**. You are free to use, modify, and distribute them for any purpose, with proper attribution. |
|
|
| ## ๐ค Citation |
|
|
| If you use these tokenizers in your research, please cite: |
|
|
| ```bibtex |
| @software{tatartokenizer_2026, |
| title = {TatarTokenizer: A Comprehensive Collection of Tokenizers for the Tatar Language}, |
| author = {Arabov, Mullosharaf Kurbonvoich}, |
| year = {2026}, |
| publisher = {Kazan Federal University}, |
| url = {https://huggingface.co/TatarNLPWorld/TatarTokenizer} |
| } |
| ``` |
|
|
| ## ๐ Language |
|
|
| All tokenizers are trained on Tatar text and are intended for use with the Tatar language (language code `tt`). They handle Tatar-specific characters perfectly (`ำ`, `ำ`, `าฏ`, `าฎ`, `า`, `า`, `าฃ`, `าข`, `าป`, `าบ`, `ำฉ`, `ำจ`). |
|
|
| ## ๐ Acknowledgements |
|
|
| These tokenizers were trained and evaluated by [TatarNLPWorld](https://huggingface.co/TatarNLPWorld) as part of an effort to advance NLP resources for the Tatar language. We thank the open-source community for the tools and libraries that made this work possible. |
|
|
| Special thanks to the Hugging Face team for the `tokenizers` library and the Hugging Face Hub platform. |
|
|