TatarTokenizer / README.md
ArabovMK's picture
Rename README.MD to README.md
dd61853 verified
metadata
license: mit
language:
  - tt
tags:
  - tokenizer
  - tatar-language
  - wordpiece
  - unigram
  - bpe
  - bbpe
  - huggingface
metrics:
  - unknown_rate
  - compression_ratio
  - word_coverage
  - tokens_per_second

TatarTokenizer: Tokenizers for the Tatar Language

This repository contains a comprehensive collection of pre-trained tokenizers for the Tatar language. We provide four different tokenization algorithms (WordPiece, Unigram, BPE, and BBPE) with multiple vocabulary sizes (25k and 50k), trained on a large Tatar corpus. All tokenizers achieve 0% unknown rate on test data and are ready to use with the tokenizers library or Hugging Face Transformers.

๐Ÿ“ฆ Available Tokenizers

The following tokenizers are included:

Tokenizer Type Vocab Size Compression Ratio Speed (tokens/sec) Notes
wp_50k WordPiece 50,000 4.67 378,751 Best overall balance
wp_25k WordPiece 25,000 4.36 496,273 Fastest tokenizer
uni_50k Unigram 50,000 4.59 189,623 Probabilistic model
uni_25k Unigram 25,000 4.30 260,403 Good for smaller vocab
bpe_50k BPE 50,000 4.60 247,421 Standard BPE
bpe_50k_freq5 BPE 50,000 4.60 226,591 Higher frequency threshold
bbpe_50k BBPE 50,000 4.60 227,322 Byte-level BPE
bbpe_25k BBPE 25,000 4.28 257,104 Compact byte-level
bbpe_fixed_50k BBPE* 50,000 5.17 315,922 Best compression ratio
bpe_fixed_50k BPE* 50,000 4.75 337,247 Fast BPE variant

* Fixed versions with improved Unicode handling

Key observations:

  • All tokenizers except bpe_fixed_50k achieve 0% unknown rate on test data
  • bbpe_fixed_50k offers the best compression (5.17 chars/token)
  • wp_25k is the fastest (nearly 500k tokens/second)
  • WordPiece models provide the most human-readable tokens

๐Ÿ“ Repository Structure

The files are organized in subdirectories for each tokenizer type and size:

TatarTokenizer/
โ”œโ”€โ”€ tokenizers/
โ”‚   โ”œโ”€โ”€ wordpiece/
โ”‚   โ”‚   โ”œโ”€โ”€ 50k/          # wp_50k.json
โ”‚   โ”‚   โ””โ”€โ”€ 25k/          # wp_25k.json
โ”‚   โ”œโ”€โ”€ unigram/
โ”‚   โ”‚   โ”œโ”€โ”€ 50k/          # uni_50k.json
โ”‚   โ”‚   โ””โ”€โ”€ 25k/          # uni_25k.json
โ”‚   โ”œโ”€โ”€ bpe/
โ”‚   โ”‚   โ”œโ”€โ”€ 50k/          # bpe_50k.json
โ”‚   โ”‚   โ””โ”€โ”€ 50k_freq5/    # bpe_50k_freq5.json
โ”‚   โ”œโ”€โ”€ bbpe/
โ”‚   โ”‚   โ”œโ”€โ”€ 50k/          # bbpe_50k.json
โ”‚   โ”‚   โ””โ”€โ”€ 25k/          # bbpe_25k.json
โ”‚   โ”œโ”€โ”€ bpe_fixed/
โ”‚   โ”‚   โ””โ”€โ”€ 50k/          # bpe_fixed_50k.json
โ”‚   โ””โ”€โ”€ bbpe_fixed/
โ”‚       โ””โ”€โ”€ 50k/          # bbpe_fixed_50k.json
โ””โ”€โ”€ test_results/          # Evaluation reports and visualizations
    โ”œโ”€โ”€ tokenizer_test_report.csv
    โ”œโ”€โ”€ test_summary_*.txt
    โ”œโ”€โ”€ comparison_*.png
    โ”œโ”€โ”€ token_length_dist_*.png
    โ”œโ”€โ”€ correlation_*.png
    โ””โ”€โ”€ top10_score_*.png

Each tokenizer is saved as a single .json file compatible with the Hugging Face tokenizers library.

๐Ÿš€ Usage

Installation

First, install the required libraries:

pip install huggingface_hub tokenizers

Load a Tokenizer

from huggingface_hub import hf_hub_download
from tokenizers import Tokenizer

# Download and load the WordPiece 50k tokenizer
tokenizer_file = hf_hub_download(
    repo_id="TatarNLPWorld/TatarTokenizer",
    filename="tokenizers/wordpiece/50k/wp_50k.json"
)

tokenizer = Tokenizer.from_file(tokenizer_file)

# Test it
text = "ะšะฐะทะฐะฝ - ะขะฐั‚ะฐั€ัั‚ะฐะฝะฝั‹าฃ ะฑะฐัˆะบะฐะปะฐัั‹"
encoding = tokenizer.encode(text)
print(f"Text: {text}")
print(f"Tokens: {encoding.tokens}")
print(f"Token IDs: {encoding.ids}")
print(f"Decoded: {tokenizer.decode(encoding.ids)}")

Using with Hugging Face Transformers

You can easily convert any tokenizer to Hugging Face format:

from transformers import PreTrainedTokenizerFast

hf_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    unk_token='[UNK]',
    pad_token='[PAD]',
    cls_token='[CLS]',
    sep_token='[SEP]',
    mask_token='[MASK]'
)

# Now you can use it with any transformer model

Download All Files for a Specific Tokenizer

from huggingface_hub import snapshot_download

# Download all files for WordPiece 50k
model_path = snapshot_download(
    repo_id="TatarNLPWorld/TatarTokenizer",
    allow_patterns="tokenizers/wordpiece/50k/*",
    local_dir="./tatar_tokenizer_wp50k"
)

๐Ÿ“Š Evaluation Results

We conducted extensive testing on a held-out corpus of 10,000 documents (19.5 million characters). Here are the key findings:

Best Tokenizers by Category

Category Winner Value
Best Compression bbpe_fixed_50k 5.17 chars/token
Fastest wp_25k 496,273 tokens/sec
Best Overall wp_50k Balanced performance
Most Readable WordPiece family Human-readable tokens

Performance Summary

All tokenizers (except bpe_fixed_50k) achieve:

  • 0% unknown rate on test data
  • 100% word coverage for common vocabulary
  • Compression ratios between 4.28 and 5.17

Visualizations

The repository includes comprehensive evaluation visualizations in the test_results/ folder:

  • Comparison plots showing unknown rate, compression ratio, and speed by tokenizer type
  • Token length distributions for each best-in-class tokenizer
  • Correlation matrices between different metrics
  • Top-10 rankings by composite score

Both Russian and English versions of all plots are available.

๐Ÿงช Test Results Summary

Model Type Unknown Rate Compression Word Coverage Speed (tokens/sec)
wp_50k WordPiece 0.0000 4.67 1.0000 378,751
wp_25k WordPiece 0.0000 4.36 1.0000 496,273
uni_50k Unigram 0.0000 4.59 1.0000 189,623
uni_25k Unigram 0.0000 4.30 1.0000 260,403
bpe_50k BPE 0.0000 4.60 1.0000 247,421
bbpe_fixed_50k BBPE_fixed 0.0000 5.17 1.0000 315,922

๐ŸŽฏ Recommendations

Based on our evaluation, we recommend:

  1. For BERT-like models: Use wp_50k (WordPiece) - best balance of readability and performance
  2. For maximum speed: Use wp_25k - fastest tokenizer, ideal for high-throughput applications
  3. For maximum compression: Use bbpe_fixed_50k - most efficient tokenization
  4. For GPT-like models: Use bpe_50k or bbpe_50k - compatible with modern LLM architectures
  5. For research: All tokenizers are provided for comparative studies

๐Ÿ“ License

All tokenizers are released under the MIT License. You are free to use, modify, and distribute them for any purpose, with proper attribution.

๐Ÿค Citation

If you use these tokenizers in your research, please cite:

@software{tatartokenizer_2026,
    title = {TatarTokenizer: A Comprehensive Collection of Tokenizers for the Tatar Language},
    author = {Arabov, Mullosharaf Kurbonvoich},
    year = {2026},
    publisher = {Kazan Federal University},
    url = {https://huggingface.co/TatarNLPWorld/TatarTokenizer}
}

๐ŸŒ Language

All tokenizers are trained on Tatar text and are intended for use with the Tatar language (language code tt). They handle Tatar-specific characters perfectly (ำ™, ำ˜, าฏ, าฎ, า—, า–, าฃ, าข, าป, าบ, ำฉ, ำจ).

๐Ÿ™Œ Acknowledgements

These tokenizers were trained and evaluated by TatarNLPWorld as part of an effort to advance NLP resources for the Tatar language. We thank the open-source community for the tools and libraries that made this work possible.

Special thanks to the Hugging Face team for the tokenizers library and the Hugging Face Hub platform.