--- license: mit language: - tt tags: - tokenizer - tatar-language - wordpiece - unigram - bpe - bbpe - huggingface metrics: - unknown_rate - compression_ratio - word_coverage - tokens_per_second --- # TatarTokenizer: Tokenizers for the Tatar Language This repository contains a comprehensive collection of pre-trained tokenizers for the Tatar language. We provide **four different tokenization algorithms** (WordPiece, Unigram, BPE, and BBPE) with **multiple vocabulary sizes** (25k and 50k), trained on a large Tatar corpus. All tokenizers achieve **0% unknown rate** on test data and are ready to use with the `tokenizers` library or Hugging Face Transformers. ## ๐Ÿ“ฆ Available Tokenizers The following tokenizers are included: | Tokenizer | Type | Vocab Size | Compression Ratio | Speed (tokens/sec) | Notes | |--------------------|-----------|------------|-------------------|---------------------|-------| | `wp_50k` | WordPiece | 50,000 | 4.67 | 378,751 | Best overall balance | | `wp_25k` | WordPiece | 25,000 | 4.36 | **496,273** | Fastest tokenizer | | `uni_50k` | Unigram | 50,000 | 4.59 | 189,623 | Probabilistic model | | `uni_25k` | Unigram | 25,000 | 4.30 | 260,403 | Good for smaller vocab | | `bpe_50k` | BPE | 50,000 | 4.60 | 247,421 | Standard BPE | | `bpe_50k_freq5` | BPE | 50,000 | 4.60 | 226,591 | Higher frequency threshold | | `bbpe_50k` | BBPE | 50,000 | 4.60 | 227,322 | Byte-level BPE | | `bbpe_25k` | BBPE | 25,000 | 4.28 | 257,104 | Compact byte-level | | `bbpe_fixed_50k` | BBPE* | 50,000 | **5.17** | 315,922 | Best compression ratio | | `bpe_fixed_50k` | BPE* | 50,000 | 4.75 | 337,247 | Fast BPE variant | \* *Fixed versions with improved Unicode handling* **Key observations:** - All tokenizers except `bpe_fixed_50k` achieve **0% unknown rate** on test data - `bbpe_fixed_50k` offers the **best compression** (5.17 chars/token) - `wp_25k` is the **fastest** (nearly 500k tokens/second) - WordPiece models provide the most **human-readable tokens** ## ๐Ÿ“ Repository Structure The files are organized in subdirectories for each tokenizer type and size: ``` TatarTokenizer/ โ”œโ”€โ”€ tokenizers/ โ”‚ โ”œโ”€โ”€ wordpiece/ โ”‚ โ”‚ โ”œโ”€โ”€ 50k/ # wp_50k.json โ”‚ โ”‚ โ””โ”€โ”€ 25k/ # wp_25k.json โ”‚ โ”œโ”€โ”€ unigram/ โ”‚ โ”‚ โ”œโ”€โ”€ 50k/ # uni_50k.json โ”‚ โ”‚ โ””โ”€โ”€ 25k/ # uni_25k.json โ”‚ โ”œโ”€โ”€ bpe/ โ”‚ โ”‚ โ”œโ”€โ”€ 50k/ # bpe_50k.json โ”‚ โ”‚ โ””โ”€โ”€ 50k_freq5/ # bpe_50k_freq5.json โ”‚ โ”œโ”€โ”€ bbpe/ โ”‚ โ”‚ โ”œโ”€โ”€ 50k/ # bbpe_50k.json โ”‚ โ”‚ โ””โ”€โ”€ 25k/ # bbpe_25k.json โ”‚ โ”œโ”€โ”€ bpe_fixed/ โ”‚ โ”‚ โ””โ”€โ”€ 50k/ # bpe_fixed_50k.json โ”‚ โ””โ”€โ”€ bbpe_fixed/ โ”‚ โ””โ”€โ”€ 50k/ # bbpe_fixed_50k.json โ””โ”€โ”€ test_results/ # Evaluation reports and visualizations โ”œโ”€โ”€ tokenizer_test_report.csv โ”œโ”€โ”€ test_summary_*.txt โ”œโ”€โ”€ comparison_*.png โ”œโ”€โ”€ token_length_dist_*.png โ”œโ”€โ”€ correlation_*.png โ””โ”€โ”€ top10_score_*.png ``` Each tokenizer is saved as a single `.json` file compatible with the Hugging Face `tokenizers` library. ## ๐Ÿš€ Usage ### Installation First, install the required libraries: ```bash pip install huggingface_hub tokenizers ``` ### Load a Tokenizer ```python from huggingface_hub import hf_hub_download from tokenizers import Tokenizer # Download and load the WordPiece 50k tokenizer tokenizer_file = hf_hub_download( repo_id="TatarNLPWorld/TatarTokenizer", filename="tokenizers/wordpiece/50k/wp_50k.json" ) tokenizer = Tokenizer.from_file(tokenizer_file) # Test it text = "ะšะฐะทะฐะฝ - ะขะฐั‚ะฐั€ัั‚ะฐะฝะฝั‹าฃ ะฑะฐัˆะบะฐะปะฐัั‹" encoding = tokenizer.encode(text) print(f"Text: {text}") print(f"Tokens: {encoding.tokens}") print(f"Token IDs: {encoding.ids}") print(f"Decoded: {tokenizer.decode(encoding.ids)}") ``` ### Using with Hugging Face Transformers You can easily convert any tokenizer to Hugging Face format: ```python from transformers import PreTrainedTokenizerFast hf_tokenizer = PreTrainedTokenizerFast( tokenizer_object=tokenizer, unk_token='[UNK]', pad_token='[PAD]', cls_token='[CLS]', sep_token='[SEP]', mask_token='[MASK]' ) # Now you can use it with any transformer model ``` ### Download All Files for a Specific Tokenizer ```python from huggingface_hub import snapshot_download # Download all files for WordPiece 50k model_path = snapshot_download( repo_id="TatarNLPWorld/TatarTokenizer", allow_patterns="tokenizers/wordpiece/50k/*", local_dir="./tatar_tokenizer_wp50k" ) ``` ## ๐Ÿ“Š Evaluation Results We conducted extensive testing on a held-out corpus of 10,000 documents (19.5 million characters). Here are the key findings: ### Best Tokenizers by Category | Category | Winner | Value | |----------|--------|-------| | **Best Compression** | `bbpe_fixed_50k` | 5.17 chars/token | | **Fastest** | `wp_25k` | 496,273 tokens/sec | | **Best Overall** | `wp_50k` | Balanced performance | | **Most Readable** | WordPiece family | Human-readable tokens | ### Performance Summary All tokenizers (except `bpe_fixed_50k`) achieve: - **0% unknown rate** on test data - **100% word coverage** for common vocabulary - Compression ratios between 4.28 and 5.17 ### Visualizations The repository includes comprehensive evaluation visualizations in the `test_results/` folder: - **Comparison plots** showing unknown rate, compression ratio, and speed by tokenizer type - **Token length distributions** for each best-in-class tokenizer - **Correlation matrices** between different metrics - **Top-10 rankings** by composite score Both Russian and English versions of all plots are available. ## ๐Ÿงช Test Results Summary | Model | Type | Unknown Rate | Compression | Word Coverage | Speed (tokens/sec) | |-------|------|--------------|-------------|---------------|-------------------| | wp_50k | WordPiece | 0.0000 | 4.67 | 1.0000 | 378,751 | | wp_25k | WordPiece | 0.0000 | 4.36 | 1.0000 | **496,273** | | uni_50k | Unigram | 0.0000 | 4.59 | 1.0000 | 189,623 | | uni_25k | Unigram | 0.0000 | 4.30 | 1.0000 | 260,403 | | bpe_50k | BPE | 0.0000 | 4.60 | 1.0000 | 247,421 | | bbpe_fixed_50k | BBPE_fixed | 0.0000 | **5.17** | 1.0000 | 315,922 | ## ๐ŸŽฏ Recommendations Based on our evaluation, we recommend: 1. **For BERT-like models**: Use `wp_50k` (WordPiece) - best balance of readability and performance 2. **For maximum speed**: Use `wp_25k` - fastest tokenizer, ideal for high-throughput applications 3. **For maximum compression**: Use `bbpe_fixed_50k` - most efficient tokenization 4. **For GPT-like models**: Use `bpe_50k` or `bbpe_50k` - compatible with modern LLM architectures 5. **For research**: All tokenizers are provided for comparative studies ## ๐Ÿ“ License All tokenizers are released under the **MIT License**. You are free to use, modify, and distribute them for any purpose, with proper attribution. ## ๐Ÿค Citation If you use these tokenizers in your research, please cite: ```bibtex @software{tatartokenizer_2026, title = {TatarTokenizer: A Comprehensive Collection of Tokenizers for the Tatar Language}, author = {Arabov, Mullosharaf Kurbonvoich}, year = {2026}, publisher = {Kazan Federal University}, url = {https://huggingface.co/TatarNLPWorld/TatarTokenizer} } ``` ## ๐ŸŒ Language All tokenizers are trained on Tatar text and are intended for use with the Tatar language (language code `tt`). They handle Tatar-specific characters perfectly (`ำ™`, `ำ˜`, `าฏ`, `าฎ`, `า—`, `า–`, `าฃ`, `าข`, `าป`, `าบ`, `ำฉ`, `ำจ`). ## ๐Ÿ™Œ Acknowledgements These tokenizers were trained and evaluated by [TatarNLPWorld](https://huggingface.co/TatarNLPWorld) as part of an effort to advance NLP resources for the Tatar language. We thank the open-source community for the tools and libraries that made this work possible. Special thanks to the Hugging Face team for the `tokenizers` library and the Hugging Face Hub platform.