Rename README.MD to README.md

dd61853 verified 10 days ago

8.46 kB

	---
	license: mit
	language:
	- tt
	tags:
	- tokenizer
	- tatar-language
	- wordpiece
	- unigram
	- bpe
	- bbpe
	- huggingface
	metrics:
	- unknown_rate
	- compression_ratio
	- word_coverage
	- tokens_per_second
	---

	# TatarTokenizer: Tokenizers for the Tatar Language

	This repository contains a comprehensive collection of pre-trained tokenizers for the Tatar language. We provide four different tokenization algorithms (WordPiece, Unigram, BPE, and BBPE) with multiple vocabulary sizes (25k and 50k), trained on a large Tatar corpus. All tokenizers achieve 0% unknown rate on test data and are ready to use with the `tokenizers` library or Hugging Face Transformers.

	## 📦 Available Tokenizers

	The following tokenizers are included:

	\| Tokenizer \| Type \| Vocab Size \| Compression Ratio \| Speed (tokens/sec) \| Notes \|
	\|--------------------\|-----------\|------------\|-------------------\|---------------------\|-------\|
	\| `wp_50k` \| WordPiece \| 50,000 \| 4.67 \| 378,751 \| Best overall balance \|
	\| `wp_25k` \| WordPiece \| 25,000 \| 4.36 \| 496,273 \| Fastest tokenizer \|
	\| `uni_50k` \| Unigram \| 50,000 \| 4.59 \| 189,623 \| Probabilistic model \|
	\| `uni_25k` \| Unigram \| 25,000 \| 4.30 \| 260,403 \| Good for smaller vocab \|
	\| `bpe_50k` \| BPE \| 50,000 \| 4.60 \| 247,421 \| Standard BPE \|
	\| `bpe_50k_freq5` \| BPE \| 50,000 \| 4.60 \| 226,591 \| Higher frequency threshold \|
	\| `bbpe_50k` \| BBPE \| 50,000 \| 4.60 \| 227,322 \| Byte-level BPE \|
	\| `bbpe_25k` \| BBPE \| 25,000 \| 4.28 \| 257,104 \| Compact byte-level \|
	\| `bbpe_fixed_50k` \| BBPE* \| 50,000 \| 5.17 \| 315,922 \| Best compression ratio \|
	\| `bpe_fixed_50k` \| BPE* \| 50,000 \| 4.75 \| 337,247 \| Fast BPE variant \|

	\* Fixed versions with improved Unicode handling

	Key observations:
	- All tokenizers except `bpe_fixed_50k` achieve 0% unknown rate on test data
	- `bbpe_fixed_50k` offers the best compression (5.17 chars/token)
	- `wp_25k` is the fastest (nearly 500k tokens/second)
	- WordPiece models provide the most human-readable tokens

	## 📁 Repository Structure

	The files are organized in subdirectories for each tokenizer type and size:

	```
	TatarTokenizer/
	├── tokenizers/
	│ ├── wordpiece/
	│ │ ├── 50k/ # wp_50k.json
	│ │ └── 25k/ # wp_25k.json
	│ ├── unigram/
	│ │ ├── 50k/ # uni_50k.json
	│ │ └── 25k/ # uni_25k.json
	│ ├── bpe/
	│ │ ├── 50k/ # bpe_50k.json
	│ │ └── 50k_freq5/ # bpe_50k_freq5.json
	│ ├── bbpe/
	│ │ ├── 50k/ # bbpe_50k.json
	│ │ └── 25k/ # bbpe_25k.json
	│ ├── bpe_fixed/
	│ │ └── 50k/ # bpe_fixed_50k.json
	│ └── bbpe_fixed/
	│ └── 50k/ # bbpe_fixed_50k.json
	└── test_results/ # Evaluation reports and visualizations
	├── tokenizer_test_report.csv
	├── test_summary_*.txt
	├── comparison_*.png
	├── token_length_dist_*.png
	├── correlation_*.png
	└── top10_score_*.png
	```

	Each tokenizer is saved as a single `.json` file compatible with the Hugging Face `tokenizers` library.

	## 🚀 Usage

	### Installation

	First, install the required libraries:

	```bash
	pip install huggingface_hub tokenizers
	```

	### Load a Tokenizer

	```python
	from huggingface_hub import hf_hub_download
	from tokenizers import Tokenizer

	# Download and load the WordPiece 50k tokenizer
	tokenizer_file = hf_hub_download(
	repo_id="TatarNLPWorld/TatarTokenizer",
	filename="tokenizers/wordpiece/50k/wp_50k.json"
	)

	tokenizer = Tokenizer.from_file(tokenizer_file)

	# Test it
	text = "Казан - Татарстанның башкаласы"
	encoding = tokenizer.encode(text)
	print(f"Text: {text}")
	print(f"Tokens: {encoding.tokens}")
	print(f"Token IDs: {encoding.ids}")
	print(f"Decoded: {tokenizer.decode(encoding.ids)}")
	```

	### Using with Hugging Face Transformers

	You can easily convert any tokenizer to Hugging Face format:

	```python
	from transformers import PreTrainedTokenizerFast

	hf_tokenizer = PreTrainedTokenizerFast(
	tokenizer_object=tokenizer,
	unk_token='[UNK]',
	pad_token='[PAD]',
	cls_token='[CLS]',
	sep_token='[SEP]',
	mask_token='[MASK]'
	)

	# Now you can use it with any transformer model
	```

	### Download All Files for a Specific Tokenizer

	```python
	from huggingface_hub import snapshot_download

	# Download all files for WordPiece 50k
	model_path = snapshot_download(
	repo_id="TatarNLPWorld/TatarTokenizer",
	allow_patterns="tokenizers/wordpiece/50k/*",
	local_dir="./tatar_tokenizer_wp50k"
	)
	```

	## 📊 Evaluation Results

	We conducted extensive testing on a held-out corpus of 10,000 documents (19.5 million characters). Here are the key findings:

	### Best Tokenizers by Category

	\| Category \| Winner \| Value \|
	\|----------\|--------\|-------\|
	\| Best Compression \| `bbpe_fixed_50k` \| 5.17 chars/token \|
	\| Fastest \| `wp_25k` \| 496,273 tokens/sec \|
	\| Best Overall \| `wp_50k` \| Balanced performance \|
	\| Most Readable \| WordPiece family \| Human-readable tokens \|

	### Performance Summary

	All tokenizers (except `bpe_fixed_50k`) achieve:
	- 0% unknown rate on test data
	- 100% word coverage for common vocabulary
	- Compression ratios between 4.28 and 5.17

	### Visualizations

	The repository includes comprehensive evaluation visualizations in the `test_results/` folder:
	- Comparison plots showing unknown rate, compression ratio, and speed by tokenizer type
	- Token length distributions for each best-in-class tokenizer
	- Correlation matrices between different metrics
	- Top-10 rankings by composite score

	Both Russian and English versions of all plots are available.

	## 🧪 Test Results Summary

	\| Model \| Type \| Unknown Rate \| Compression \| Word Coverage \| Speed (tokens/sec) \|
	\|-------\|------\|--------------\|-------------\|---------------\|-------------------\|
	\| wp_50k \| WordPiece \| 0.0000 \| 4.67 \| 1.0000 \| 378,751 \|
	\| wp_25k \| WordPiece \| 0.0000 \| 4.36 \| 1.0000 \| 496,273 \|
	\| uni_50k \| Unigram \| 0.0000 \| 4.59 \| 1.0000 \| 189,623 \|
	\| uni_25k \| Unigram \| 0.0000 \| 4.30 \| 1.0000 \| 260,403 \|
	\| bpe_50k \| BPE \| 0.0000 \| 4.60 \| 1.0000 \| 247,421 \|
	\| bbpe_fixed_50k \| BBPE_fixed \| 0.0000 \| 5.17 \| 1.0000 \| 315,922 \|

	## 🎯 Recommendations

	Based on our evaluation, we recommend:

	1. For BERT-like models: Use `wp_50k` (WordPiece) - best balance of readability and performance
	2. For maximum speed: Use `wp_25k` - fastest tokenizer, ideal for high-throughput applications
	3. For maximum compression: Use `bbpe_fixed_50k` - most efficient tokenization
	4. For GPT-like models: Use `bpe_50k` or `bbpe_50k` - compatible with modern LLM architectures
	5. For research: All tokenizers are provided for comparative studies

	## 📝 License

	All tokenizers are released under the MIT License. You are free to use, modify, and distribute them for any purpose, with proper attribution.

	## 🤝 Citation

	If you use these tokenizers in your research, please cite:

	```bibtex
	@software{tatartokenizer_2026,
	title = {TatarTokenizer: A Comprehensive Collection of Tokenizers for the Tatar Language},
	author = {Arabov, Mullosharaf Kurbonvoich},
	year = {2026},
	publisher = {Kazan Federal University},
	url = {https://huggingface.co/TatarNLPWorld/TatarTokenizer}
	}
	```

	## 🌐 Language

	All tokenizers are trained on Tatar text and are intended for use with the Tatar language (language code `tt`). They handle Tatar-specific characters perfectly (`ә`, `Ә`, `ү`, `Ү`, `җ`, `Җ`, `ң`, `Ң`, `һ`, `Һ`, `ө`, `Ө`).

	## 🙌 Acknowledgements

	These tokenizers were trained and evaluated by [TatarNLPWorld](https://huggingface.co/TatarNLPWorld) as part of an effort to advance NLP resources for the Tatar language. We thank the open-source community for the tools and libraries that made this work possible.

	Special thanks to the Hugging Face team for the `tokenizers` library and the Hugging Face Hub platform.