TatarTokenizers - Tatar Subword Tokenizers

High-quality pretrained tokenizers for the Tatar language

This repository contains 4 specialized tokenizers for Tatar, trained on a cleaned 103M-token corpus using different algorithms. These tokenizers significantly outperform generic multilingual tokenizers and are optimized for Tatar NLP tasks and language model training.

TatarTokenizers - Tatar Subword Tokenizers

High-quality pretrained tokenizers for the Tatar language

This repository contains 4 specialized tokenizers for Tatar, trained on a cleaned 103M-token corpus using different algorithms. These tokenizers significantly outperform generic multilingual tokenizers and are optimized for Tatar NLP tasks and language model training.

๐Ÿ† Model Performance

Tokenizer Comparison

Algorithm Vocabulary Size Best For HF AutoTokenizer
BPE 8000 General purpose, fast inference โœ… Yes
WordPiece 8000 Stable behavior, balanced performance โœ… Yes
Unigram 16000 LLM training, smooth distributions โœ… Yes
SentencePiece 32000 Morphological coverage, OOV handling โš ๏ธ T5Tokenizer

๐Ÿ“ˆ Training Results

Final Training Metrics

============================================================
BPE          | Run: v8000_mf2       | OOV: 0.00% | AvgLen: 96.0 | Time: 105.9s
WORDPIECE    | Run: v8000_mf1       | OOV: 0.00% | AvgLen: 95.4 | Time: 124.3s
UNIGRAM      | Run: v16000          | OOV: 0.00% | AvgLen: 90.9 | Time: 614.1s
SPM          | Run: v32000          | OOV: 0.00% | AvgLen: 86.7 | Time: 249.8s
============================================================

Metric Explanation:

  • OOV: Out-of-Vocabulary rate (0% = perfect coverage)
  • AvgLen: Average sequence length in tokens (lower = better compression)
  • Time: Training time in seconds

Key Findings

  • All tokenizers achieved 0% OOV on test corpus, demonstrating perfect vocabulary coverage
  • SentencePiece provides best compression (lowest AvgLen) due to larger vocabulary
  • BPE is fastest to train while maintaining excellent performance
  • Unigram offers balanced compression despite longer training time
  • All models show consistent behavior across different text domains

๐Ÿ“Š Model Details

BPE Tokenizer

Vocabulary: 8000 | Compression: 2.8x

  • Architecture: Byte-Pair Encoding
  • Best for: General purpose NLP, fast inference
  • Format: tokenizer.json + HuggingFace compatible

WordPiece Tokenizer

Vocabulary: 8000 | Compression: 2.7x

  • Architecture: WordPiece (BERT-style)
  • Best for: Stable training, balanced performance
  • Format: tokenizer.json + HuggingFace compatible

Unigram Tokenizer

Vocabulary: 16000 | Compression: 3.1x

  • Architecture: Unigram Language Model
  • Best for: LLM training, smooth length distributions
  • Format: tokenizer.json + HuggingFace compatible

SentencePiece Tokenizer

Vocabulary: 32000 | Compression: 3.4x

  • Architecture: SentencePiece with Unigram
  • Best for: Morphological coverage, OOV handling
  • Format: spiece.model (requires T5Tokenizer)

๐Ÿ“š Training Corpus

  • Total Tokens: 207.02M
  • Unique Words: 2.1M
  • Vocabulary: 637.7K words
  • Models Analyzed: 22

Corpus Domains

Domain Documents
belgech.ru 46
intertat.tatar 19.5K
matbugat.ru 44.9K
azatliq.org 8,1k
tatar-inform.tatar 1.5K
mamadysh-rt 1.2k
vk.com 6.5K
shahrikazan.ru 2.4K
vatantat.ru 119
Wikipedia 456.1K
Books 876

๐Ÿš€ Quick Start

Installation

pip install transformers huggingface_hub

Load BPE/WordPiece/Unigram Tokenizers

from transformers import AutoTokenizer

# Load BPE tokenizer (recommended for general use)
tokenizer = AutoTokenizer.from_pretrained(
    "arabovs-ai-lab/TatarTokenizers",
    subfolder="bpe"
)

# Or load WordPiece
tokenizer = AutoTokenizer.from_pretrained(
    "arabovs-ai-lab/TatarTokenizers", 
    subfolder="wordpiece"
)

# Or load Unigram
tokenizer = AutoTokenizer.from_pretrained(
    "arabovs-ai-lab/TatarTokenizers",
    subfolder="unigram" 
)

Load SentencePiece Tokenizer

from transformers import T5Tokenizer

# SentencePiece requires T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained(
    "arabovs-ai-lab/TatarTokenizers",
    subfolder="sentencepiece"
)

๐Ÿ’ก Usage Examples

Basic Text Processing

text = "ะขะฐั‚ะฐั€ั‡ะฐ ั‚ะตะบัั‚ะปะฐั€ะฝั‹ ััˆะบำ™ั€ั‚าฏ โ€” ะบั‹ะทั‹ะบะปั‹ ะฑัƒั€ั‹ั‡."

# Encode text
ids = tokenizer.encode(text)
print("Token IDs:", ids)

# Decode back to text
decoded = tokenizer.decode(ids)
print("Decoded:", decoded)

# Get tokens
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

Batch Processing

texts = [
    "ะœะธะฝ ั‚ะฐั‚ะฐั€ั‡ะฐ ัำฉะนะปะธะผ.",
    "ะ‘ะตะท ะผะพะดะตะปัŒะปำ™ั€ ั‚ำฉะทะธะฑะตะท.",
    "ะขะตะป ััˆะบำ™ั€ั‚าฏ ั‚ะตั…ะฝะพะปะพะณะธัะปำ™ั€ะต าฏัะตัˆ ะฐะปะฐ."
]

# Batch encode with padding
batch = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt"
)

print("Batch input IDs:", batch["input_ids"])
print("Attention mask:", batch["attention_mask"])

Vocabulary Analysis

# Check vocabulary size
vocab_size = tokenizer.vocab_size
print(f"Vocabulary size: {vocab_size}")

# Get special tokens
special_tokens = tokenizer.special_tokens_map
print("Special tokens:", special_tokens)

# Check token for specific word
token_id = tokenizer.convert_tokens_to_ids("ั‚ะฐั‚ะฐั€ั‡ะฐ")
print(f"'ั‚ะฐั‚ะฐั€ั‡ะฐ' token ID: {token_id}")

Different Tokenizers Comparison

from transformers import AutoTokenizer, T5Tokenizer

def compare_tokenizers(text):
    """Compare different tokenizers on the same text"""
    
    tokenizers = {
        "BPE": AutoTokenizer.from_pretrained("arabovs-ai-lab/TatarTokenizers", subfolder="bpe"),
        "WordPiece": AutoTokenizer.from_pretrained("arabovs-ai-lab/TatarTokenizers", subfolder="wordpiece"), 
        "Unigram": AutoTokenizer.from_pretrained("arabovs-ai-lab/TatarTokenizers", subfolder="unigram"),
        "SentencePiece": T5Tokenizer.from_pretrained("arabovs-ai-lab/TatarTokenizers", subfolder="sentencepiece")
    }
    
    print(f"Text: {text}")
    print("=" * 50)
    
    for name, tok in tokenizers.items():
        tokens = tok.tokenize(text)
        ids = tok.encode(text)
        print(f"{name:12} | Tokens: {len(tokens):2d} | IDs: {ids}")
        print(f"{'':12} | {tokens}")

# Test with different texts
test_texts = [
    "ะขะฐั‚ะฐั€ ั‚ะตะปะต ะผะพั€ั„ะพะปะพะณะธะบ ะฑะฐะน ั‚ะตะป.",
    "ะ‘ะตะทะฝะตาฃ ะผะพะดะตะปัŒะปำ™ั€ ัั…ัˆั‹ ััˆะปะธ.",
    "ะกะธะฝั‚ะตั‚ะธะบ ั‚ะตะปะปำ™ั€ะดำ™ ั‚ะพะบะตะฝะธะทะฐั†ะธั ะบะฐั‚ะปะฐัƒะปั‹ั€ะฐะบ."
]

for text in test_texts:
    compare_tokenizers(text)
    print("\n")

Advanced Features

# Save and load local copy
tokenizer.save_pretrained("./my-tatar-tokenizer")
loaded_tokenizer = AutoTokenizer.from_pretrained("./my-tatar-tokenizer")

# Add new tokens
new_tokens = ["GPT", "Transformer", "BERT"]
tokenizer.add_tokens(new_tokens)
print(f"Added {len(new_tokens)} new tokens")

# Text generation preparation
prompt = "ะขะฐั‚ะฐั€ัั‚ะฐะฝะดะฐ "
inputs = tokenizer(prompt, return_tensors="pt")
print("Generation inputs:", inputs)

Language Model Training Ready

from transformers import AutoTokenizer, DataCollatorForLanguageModeling

tokenizer = AutoTokenizer.from_pretrained(
    "arabovs-ai-lab/TatarTokenizers",
    subfolder="unigram"  # Recommended for LLM training
)

# Data collator for masked language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # Set to True for BERT-style training
    return_tensors="pt"
)

# Example training batch
batch = data_collator([{"input_ids": [0, 1, 2, 3, 4]}] * 8)
print("Training batch ready:", batch.keys())

๐ŸŽฏ Model Recommendations

Use Case Recommended Tokenizer Reason
General NLP BPE Balanced performance, fast
BERT-style Training WordPiece Stable, proven architecture
LLM Training Unigram Smooth distributions, 16K vocab
Research SentencePiece Best morphological coverage
Production BPE/WordPiece HF native, easy deployment

๐Ÿ“ฆ Repository Structure

TatarTokenizers/
โ”œโ”€โ”€ bpe/
โ”‚   โ”œโ”€โ”€ tokenizer.json
โ”‚   โ”œโ”€โ”€ tokenizer_config.json
โ”‚   โ””โ”€โ”€ special_tokens_map.json
โ”œโ”€โ”€ wordpiece/
โ”‚   โ”œโ”€โ”€ tokenizer.json
โ”‚   โ”œโ”€โ”€ tokenizer_config.json
โ”‚   โ””โ”€โ”€ special_tokens_map.json
โ”œโ”€โ”€ unigram/
โ”‚   โ”œโ”€โ”€ tokenizer.json
โ”‚   โ”œโ”€โ”€ tokenizer_config.json
โ”‚   โ””โ”€โ”€ special_tokens_map.json
โ””โ”€โ”€ sentencepiece/
    โ”œโ”€โ”€ spiece.model
    โ”œโ”€โ”€ spiece.vocab
    โ””โ”€โ”€ tokenizer_config.json

๐Ÿ“œ Citation

@misc{TatarTokenizers2025,
  title = {TatarTokenizers: High-quality Tatar Subword Tokenizers},
  author = {Arabovs AI Lab},
  year = 2025,
  publisher = {Hugging Face},
  url = {https://huggingface.co/arabovs-ai-lab/TatarTokenizers}
}

๐Ÿ“„ License

Apache 2.0 License


Last updated: 2025-11-20
Training corpus: 103M tokens
OOV rate: 0% on test data
Best for: Tatar NLP and LLM training

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using arabovs-ai-lab/TatarTokenizers 1