Rename README.MD to README.md

dd61853 verified 10 days ago

8.46 kB

license: mit
language:
  - tt
tags:
  - tokenizer
  - tatar-language
  - wordpiece
  - unigram
  - bpe
  - bbpe
  - huggingface
metrics:
  - unknown_rate
  - compression_ratio
  - word_coverage
  - tokens_per_second

TatarTokenizer: Tokenizers for the Tatar Language

This repository contains a comprehensive collection of pre-trained tokenizers for the Tatar language. We provide four different tokenization algorithms (WordPiece, Unigram, BPE, and BBPE) with multiple vocabulary sizes (25k and 50k), trained on a large Tatar corpus. All tokenizers achieve 0% unknown rate on test data and are ready to use with the tokenizers library or Hugging Face Transformers.

📦 Available Tokenizers

The following tokenizers are included:

Tokenizer	Type	Vocab Size	Compression Ratio	Speed (tokens/sec)	Notes
`wp_50k`	WordPiece	50,000	4.67	378,751	Best overall balance
`wp_25k`	WordPiece	25,000	4.36	496,273	Fastest tokenizer
`uni_50k`	Unigram	50,000	4.59	189,623	Probabilistic model
`uni_25k`	Unigram	25,000	4.30	260,403	Good for smaller vocab
`bpe_50k`	BPE	50,000	4.60	247,421	Standard BPE
`bpe_50k_freq5`	BPE	50,000	4.60	226,591	Higher frequency threshold
`bbpe_50k`	BBPE	50,000	4.60	227,322	Byte-level BPE
`bbpe_25k`	BBPE	25,000	4.28	257,104	Compact byte-level
`bbpe_fixed_50k`	BBPE*	50,000	5.17	315,922	Best compression ratio
`bpe_fixed_50k`	BPE*	50,000	4.75	337,247	Fast BPE variant

* Fixed versions with improved Unicode handling

Key observations:

All tokenizers except bpe_fixed_50k achieve 0% unknown rate on test data
bbpe_fixed_50k offers the best compression (5.17 chars/token)
wp_25k is the fastest (nearly 500k tokens/second)
WordPiece models provide the most human-readable tokens

📁 Repository Structure

The files are organized in subdirectories for each tokenizer type and size:

TatarTokenizer/
├── tokenizers/
│   ├── wordpiece/
│   │   ├── 50k/          # wp_50k.json
│   │   └── 25k/          # wp_25k.json
│   ├── unigram/
│   │   ├── 50k/          # uni_50k.json
│   │   └── 25k/          # uni_25k.json
│   ├── bpe/
│   │   ├── 50k/          # bpe_50k.json
│   │   └── 50k_freq5/    # bpe_50k_freq5.json
│   ├── bbpe/
│   │   ├── 50k/          # bbpe_50k.json
│   │   └── 25k/          # bbpe_25k.json
│   ├── bpe_fixed/
│   │   └── 50k/          # bpe_fixed_50k.json
│   └── bbpe_fixed/
│       └── 50k/          # bbpe_fixed_50k.json
└── test_results/          # Evaluation reports and visualizations
    ├── tokenizer_test_report.csv
    ├── test_summary_*.txt
    ├── comparison_*.png
    ├── token_length_dist_*.png
    ├── correlation_*.png
    └── top10_score_*.png

Each tokenizer is saved as a single .json file compatible with the Hugging Face tokenizers library.

🚀 Usage

Installation

First, install the required libraries:

pip install huggingface_hub tokenizers

Load a Tokenizer

from huggingface_hub import hf_hub_download
from tokenizers import Tokenizer

# Download and load the WordPiece 50k tokenizer
tokenizer_file = hf_hub_download(
    repo_id="TatarNLPWorld/TatarTokenizer",
    filename="tokenizers/wordpiece/50k/wp_50k.json"
)

tokenizer = Tokenizer.from_file(tokenizer_file)

# Test it
text = "Казан - Татарстанның башкаласы"
encoding = tokenizer.encode(text)
print(f"Text: {text}")
print(f"Tokens: {encoding.tokens}")
print(f"Token IDs: {encoding.ids}")
print(f"Decoded: {tokenizer.decode(encoding.ids)}")

Using with Hugging Face Transformers

You can easily convert any tokenizer to Hugging Face format:

from transformers import PreTrainedTokenizerFast

hf_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    unk_token='[UNK]',
    pad_token='[PAD]',
    cls_token='[CLS]',
    sep_token='[SEP]',
    mask_token='[MASK]'
)

# Now you can use it with any transformer model

Download All Files for a Specific Tokenizer

from huggingface_hub import snapshot_download

# Download all files for WordPiece 50k
model_path = snapshot_download(
    repo_id="TatarNLPWorld/TatarTokenizer",
    allow_patterns="tokenizers/wordpiece/50k/*",
    local_dir="./tatar_tokenizer_wp50k"
)

📊 Evaluation Results

We conducted extensive testing on a held-out corpus of 10,000 documents (19.5 million characters). Here are the key findings:

Best Tokenizers by Category

Category	Winner	Value
Best Compression	`bbpe_fixed_50k`	5.17 chars/token
Fastest	`wp_25k`	496,273 tokens/sec
Best Overall	`wp_50k`	Balanced performance
Most Readable	WordPiece family	Human-readable tokens

Performance Summary

All tokenizers (except bpe_fixed_50k) achieve:

0% unknown rate on test data
100% word coverage for common vocabulary
Compression ratios between 4.28 and 5.17

Visualizations

The repository includes comprehensive evaluation visualizations in the test_results/ folder:

Comparison plots showing unknown rate, compression ratio, and speed by tokenizer type
Token length distributions for each best-in-class tokenizer
Correlation matrices between different metrics
Top-10 rankings by composite score

Both Russian and English versions of all plots are available.

🧪 Test Results Summary

Model	Type	Compression	Word Coverage	Speed (tokens/sec)
wp_50k	WordPiece	4.67	1.0000	378,751
wp_25k	WordPiece	4.36	1.0000	496,273
uni_50k	Unigram	4.59	1.0000	189,623
uni_25k	Unigram	4.30	1.0000	260,403
bpe_50k	BPE	4.60	1.0000	247,421
bbpe_fixed_50k	BBPE_fixed	5.17	1.0000	315,922

🎯 Recommendations

Based on our evaluation, we recommend:

For BERT-like models: Use wp_50k (WordPiece) - best balance of readability and performance
For maximum speed: Use wp_25k - fastest tokenizer, ideal for high-throughput applications
For maximum compression: Use bbpe_fixed_50k - most efficient tokenization
For GPT-like models: Use bpe_50k or bbpe_50k - compatible with modern LLM architectures
For research: All tokenizers are provided for comparative studies

📝 License

All tokenizers are released under the MIT License. You are free to use, modify, and distribute them for any purpose, with proper attribution.

🤝 Citation

If you use these tokenizers in your research, please cite:

@software{tatartokenizer_2026,
    title = {TatarTokenizer: A Comprehensive Collection of Tokenizers for the Tatar Language},
    author = {Arabov, Mullosharaf Kurbonvoich},
    year = {2026},
    publisher = {Kazan Federal University},
    url = {https://huggingface.co/TatarNLPWorld/TatarTokenizer}
}

🌐 Language

All tokenizers are trained on Tatar text and are intended for use with the Tatar language (language code tt). They handle Tatar-specific characters perfectly (ә, Ә, ү, Ү, җ, Җ, ң, Ң, һ, Һ, ө, Ө).

🙌 Acknowledgements

These tokenizers were trained and evaluated by TatarNLPWorld as part of an effort to advance NLP resources for the Tatar language. We thank the open-source community for the tools and libraries that made this work possible.

Special thanks to the Hugging Face team for the tokenizers library and the Hugging Face Hub platform.