TatarTokenizers - Tatar Subword Tokenizers
High-quality pretrained tokenizers for the Tatar language
This repository contains 4 specialized tokenizers for Tatar, trained on a cleaned 103M-token corpus using different algorithms. These tokenizers significantly outperform generic multilingual tokenizers and are optimized for Tatar NLP tasks and language model training.
TatarTokenizers - Tatar Subword Tokenizers
High-quality pretrained tokenizers for the Tatar language
This repository contains 4 specialized tokenizers for Tatar, trained on a cleaned 103M-token corpus using different algorithms. These tokenizers significantly outperform generic multilingual tokenizers and are optimized for Tatar NLP tasks and language model training.
๐ Model Performance
Tokenizer Comparison
| Algorithm | Vocabulary Size | Best For | HF AutoTokenizer |
|---|---|---|---|
| BPE | 8000 | General purpose, fast inference | โ Yes |
| WordPiece | 8000 | Stable behavior, balanced performance | โ Yes |
| Unigram | 16000 | LLM training, smooth distributions | โ Yes |
| SentencePiece | 32000 | Morphological coverage, OOV handling | โ ๏ธ T5Tokenizer |
๐ Training Results
Final Training Metrics
============================================================
BPE | Run: v8000_mf2 | OOV: 0.00% | AvgLen: 96.0 | Time: 105.9s
WORDPIECE | Run: v8000_mf1 | OOV: 0.00% | AvgLen: 95.4 | Time: 124.3s
UNIGRAM | Run: v16000 | OOV: 0.00% | AvgLen: 90.9 | Time: 614.1s
SPM | Run: v32000 | OOV: 0.00% | AvgLen: 86.7 | Time: 249.8s
============================================================
Metric Explanation:
- OOV: Out-of-Vocabulary rate (0% = perfect coverage)
- AvgLen: Average sequence length in tokens (lower = better compression)
- Time: Training time in seconds
Key Findings
- All tokenizers achieved 0% OOV on test corpus, demonstrating perfect vocabulary coverage
- SentencePiece provides best compression (lowest AvgLen) due to larger vocabulary
- BPE is fastest to train while maintaining excellent performance
- Unigram offers balanced compression despite longer training time
- All models show consistent behavior across different text domains
๐ Model Details
BPE Tokenizer
Vocabulary: 8000 | Compression: 2.8x
- Architecture: Byte-Pair Encoding
- Best for: General purpose NLP, fast inference
- Format:
tokenizer.json+ HuggingFace compatible
WordPiece Tokenizer
Vocabulary: 8000 | Compression: 2.7x
- Architecture: WordPiece (BERT-style)
- Best for: Stable training, balanced performance
- Format:
tokenizer.json+ HuggingFace compatible
Unigram Tokenizer
Vocabulary: 16000 | Compression: 3.1x
- Architecture: Unigram Language Model
- Best for: LLM training, smooth length distributions
- Format:
tokenizer.json+ HuggingFace compatible
SentencePiece Tokenizer
Vocabulary: 32000 | Compression: 3.4x
- Architecture: SentencePiece with Unigram
- Best for: Morphological coverage, OOV handling
- Format:
spiece.model(requires T5Tokenizer)
๐ Training Corpus
- Total Tokens: 207.02M
- Unique Words: 2.1M
- Vocabulary: 637.7K words
- Models Analyzed: 22
Corpus Domains
| Domain | Documents |
|---|---|
| belgech.ru | 46 |
| intertat.tatar | 19.5K |
| matbugat.ru | 44.9K |
| azatliq.org | 8,1k |
| tatar-inform.tatar | 1.5K |
| mamadysh-rt | 1.2k |
| vk.com | 6.5K |
| shahrikazan.ru | 2.4K |
| vatantat.ru | 119 |
| Wikipedia | 456.1K |
| Books | 876 |
๐ Quick Start
Installation
pip install transformers huggingface_hub
Load BPE/WordPiece/Unigram Tokenizers
from transformers import AutoTokenizer
# Load BPE tokenizer (recommended for general use)
tokenizer = AutoTokenizer.from_pretrained(
"arabovs-ai-lab/TatarTokenizers",
subfolder="bpe"
)
# Or load WordPiece
tokenizer = AutoTokenizer.from_pretrained(
"arabovs-ai-lab/TatarTokenizers",
subfolder="wordpiece"
)
# Or load Unigram
tokenizer = AutoTokenizer.from_pretrained(
"arabovs-ai-lab/TatarTokenizers",
subfolder="unigram"
)
Load SentencePiece Tokenizer
from transformers import T5Tokenizer
# SentencePiece requires T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained(
"arabovs-ai-lab/TatarTokenizers",
subfolder="sentencepiece"
)
๐ก Usage Examples
Basic Text Processing
text = "ะขะฐัะฐััะฐ ัะตะบััะปะฐัะฝั ััะบำััาฏ โ ะบัะทัะบะปั ะฑัััั."
# Encode text
ids = tokenizer.encode(text)
print("Token IDs:", ids)
# Decode back to text
decoded = tokenizer.decode(ids)
print("Decoded:", decoded)
# Get tokens
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)
Batch Processing
texts = [
"ะะธะฝ ัะฐัะฐััะฐ ัำฉะนะปะธะผ.",
"ะะตะท ะผะพะดะตะปัะปำั ัำฉะทะธะฑะตะท.",
"ะขะตะป ััะบำััาฏ ัะตั
ะฝะพะปะพะณะธัะปำัะต าฏัะตั ะฐะปะฐ."
]
# Batch encode with padding
batch = tokenizer(
texts,
padding=True,
truncation=True,
max_length=128,
return_tensors="pt"
)
print("Batch input IDs:", batch["input_ids"])
print("Attention mask:", batch["attention_mask"])
Vocabulary Analysis
# Check vocabulary size
vocab_size = tokenizer.vocab_size
print(f"Vocabulary size: {vocab_size}")
# Get special tokens
special_tokens = tokenizer.special_tokens_map
print("Special tokens:", special_tokens)
# Check token for specific word
token_id = tokenizer.convert_tokens_to_ids("ัะฐัะฐััะฐ")
print(f"'ัะฐัะฐััะฐ' token ID: {token_id}")
Different Tokenizers Comparison
from transformers import AutoTokenizer, T5Tokenizer
def compare_tokenizers(text):
"""Compare different tokenizers on the same text"""
tokenizers = {
"BPE": AutoTokenizer.from_pretrained("arabovs-ai-lab/TatarTokenizers", subfolder="bpe"),
"WordPiece": AutoTokenizer.from_pretrained("arabovs-ai-lab/TatarTokenizers", subfolder="wordpiece"),
"Unigram": AutoTokenizer.from_pretrained("arabovs-ai-lab/TatarTokenizers", subfolder="unigram"),
"SentencePiece": T5Tokenizer.from_pretrained("arabovs-ai-lab/TatarTokenizers", subfolder="sentencepiece")
}
print(f"Text: {text}")
print("=" * 50)
for name, tok in tokenizers.items():
tokens = tok.tokenize(text)
ids = tok.encode(text)
print(f"{name:12} | Tokens: {len(tokens):2d} | IDs: {ids}")
print(f"{'':12} | {tokens}")
# Test with different texts
test_texts = [
"ะขะฐัะฐั ัะตะปะต ะผะพััะพะปะพะณะธะบ ะฑะฐะน ัะตะป.",
"ะะตะทะฝะตาฃ ะผะพะดะตะปัะปำั ัั
ัั ััะปะธ.",
"ะกะธะฝัะตัะธะบ ัะตะปะปำัะดำ ัะพะบะตะฝะธะทะฐัะธั ะบะฐัะปะฐัะปััะฐะบ."
]
for text in test_texts:
compare_tokenizers(text)
print("\n")
Advanced Features
# Save and load local copy
tokenizer.save_pretrained("./my-tatar-tokenizer")
loaded_tokenizer = AutoTokenizer.from_pretrained("./my-tatar-tokenizer")
# Add new tokens
new_tokens = ["GPT", "Transformer", "BERT"]
tokenizer.add_tokens(new_tokens)
print(f"Added {len(new_tokens)} new tokens")
# Text generation preparation
prompt = "ะขะฐัะฐัััะฐะฝะดะฐ "
inputs = tokenizer(prompt, return_tensors="pt")
print("Generation inputs:", inputs)
Language Model Training Ready
from transformers import AutoTokenizer, DataCollatorForLanguageModeling
tokenizer = AutoTokenizer.from_pretrained(
"arabovs-ai-lab/TatarTokenizers",
subfolder="unigram" # Recommended for LLM training
)
# Data collator for masked language modeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False, # Set to True for BERT-style training
return_tensors="pt"
)
# Example training batch
batch = data_collator([{"input_ids": [0, 1, 2, 3, 4]}] * 8)
print("Training batch ready:", batch.keys())
๐ฏ Model Recommendations
| Use Case | Recommended Tokenizer | Reason |
|---|---|---|
| General NLP | BPE | Balanced performance, fast |
| BERT-style Training | WordPiece | Stable, proven architecture |
| LLM Training | Unigram | Smooth distributions, 16K vocab |
| Research | SentencePiece | Best morphological coverage |
| Production | BPE/WordPiece | HF native, easy deployment |
๐ฆ Repository Structure
TatarTokenizers/
โโโ bpe/
โ โโโ tokenizer.json
โ โโโ tokenizer_config.json
โ โโโ special_tokens_map.json
โโโ wordpiece/
โ โโโ tokenizer.json
โ โโโ tokenizer_config.json
โ โโโ special_tokens_map.json
โโโ unigram/
โ โโโ tokenizer.json
โ โโโ tokenizer_config.json
โ โโโ special_tokens_map.json
โโโ sentencepiece/
โโโ spiece.model
โโโ spiece.vocab
โโโ tokenizer_config.json
๐ Citation
@misc{TatarTokenizers2025,
title = {TatarTokenizers: High-quality Tatar Subword Tokenizers},
author = {Arabovs AI Lab},
year = 2025,
publisher = {Hugging Face},
url = {https://huggingface.co/arabovs-ai-lab/TatarTokenizers}
}
๐ License
Apache 2.0 License
Last updated: 2025-11-20
Training corpus: 103M tokens
OOV rate: 0% on test data
Best for: Tatar NLP and LLM training