Update README.md
Browse files
README.md
CHANGED
|
@@ -13,6 +13,13 @@ tags:
|
|
| 13 |
- wordpiece
|
| 14 |
---
|
| 15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
# TatarTokenizers - Tatar Subword Tokenizers
|
| 17 |
|
| 18 |
**High-quality pretrained tokenizers for the Tatar language**
|
|
@@ -23,20 +30,39 @@ This repository contains 4 specialized tokenizers for Tatar, trained on a cleane
|
|
| 23 |
|
| 24 |
### Tokenizer Comparison
|
| 25 |
|
| 26 |
-
| Algorithm | Vocabulary Size | Best For |
|
| 27 |
-
|-----------|-----------------|----------|------------------
|
| 28 |
-
| **BPE** | 8000 | General purpose, fast inference |
|
| 29 |
-
| **WordPiece** | 8000 | Stable behavior, balanced performance |
|
| 30 |
-
| **Unigram** | 16000 | LLM training, smooth distributions |
|
| 31 |
-
| **SentencePiece** | 32000 | Morphological coverage, OOV handling |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
### Key Findings
|
| 34 |
|
| 35 |
-
- **
|
| 36 |
-
- **
|
| 37 |
-
- **
|
| 38 |
-
-
|
| 39 |
-
- **
|
|
|
|
| 40 |
|
| 41 |
## ๐ Model Details
|
| 42 |
|
|
|
|
| 13 |
- wordpiece
|
| 14 |
---
|
| 15 |
|
| 16 |
+
# TatarTokenizers - Tatar Subword Tokenizers
|
| 17 |
+
|
| 18 |
+
**High-quality pretrained tokenizers for the Tatar language**
|
| 19 |
+
|
| 20 |
+
This repository contains 4 specialized tokenizers for Tatar, trained on a cleaned 103M-token corpus using different algorithms. These tokenizers significantly outperform generic multilingual tokenizers and are optimized for Tatar NLP tasks and language model training.
|
| 21 |
+
|
| 22 |
+
|
| 23 |
# TatarTokenizers - Tatar Subword Tokenizers
|
| 24 |
|
| 25 |
**High-quality pretrained tokenizers for the Tatar language**
|
|
|
|
| 30 |
|
| 31 |
### Tokenizer Comparison
|
| 32 |
|
| 33 |
+
| Algorithm | Vocabulary Size | Best For | HF AutoTokenizer |
|
| 34 |
+
|-----------|-----------------|----------|------------------|
|
| 35 |
+
| **BPE** | 8000 | General purpose, fast inference | โ
Yes |
|
| 36 |
+
| **WordPiece** | 8000 | Stable behavior, balanced performance | โ
Yes |
|
| 37 |
+
| **Unigram** | 16000 | LLM training, smooth distributions | โ
Yes |
|
| 38 |
+
| **SentencePiece** | 32000 | Morphological coverage, OOV handling | โ ๏ธ T5Tokenizer |
|
| 39 |
+
|
| 40 |
+
## ๐ Training Results
|
| 41 |
+
|
| 42 |
+
### Final Training Metrics
|
| 43 |
+
|
| 44 |
+
```
|
| 45 |
+
============================================================
|
| 46 |
+
BPE | Run: v8000_mf2 | OOV: 0.00% | AvgLen: 96.0 | Time: 105.9s
|
| 47 |
+
WORDPIECE | Run: v8000_mf1 | OOV: 0.00% | AvgLen: 95.4 | Time: 124.3s
|
| 48 |
+
UNIGRAM | Run: v16000 | OOV: 0.00% | AvgLen: 90.9 | Time: 614.1s
|
| 49 |
+
SPM | Run: v32000 | OOV: 0.00% | AvgLen: 86.7 | Time: 249.8s
|
| 50 |
+
============================================================
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
**Metric Explanation:**
|
| 54 |
+
- **OOV**: Out-of-Vocabulary rate (0% = perfect coverage)
|
| 55 |
+
- **AvgLen**: Average sequence length in tokens (lower = better compression)
|
| 56 |
+
- **Time**: Training time in seconds
|
| 57 |
|
| 58 |
### Key Findings
|
| 59 |
|
| 60 |
+
- **All tokenizers achieved 0% OOV** on test corpus, demonstrating perfect vocabulary coverage
|
| 61 |
+
- **SentencePiece provides best compression** (lowest AvgLen) due to larger vocabulary
|
| 62 |
+
- **BPE is fastest to train** while maintaining excellent performance
|
| 63 |
+
- **Unigram offers balanced compression** despite longer training time
|
| 64 |
+
- **All models show consistent behavior** across different text domains
|
| 65 |
+
|
| 66 |
|
| 67 |
## ๐ Model Details
|
| 68 |
|