arabovs-ai-lab
/

TatarTokenizers

@@ -13,6 +13,13 @@ tags:
   - wordpiece
 ---
 # TatarTokenizers - Tatar Subword Tokenizers
 **High-quality pretrained tokenizers for the Tatar language**
@@ -23,20 +30,39 @@ This repository contains 4 specialized tokenizers for Tatar, trained on a cleane
 ### Tokenizer Comparison
-| Algorithm | Vocabulary Size | Best For | Compression Ratio | HF AutoTokenizer |
-|-----------|-----------------|----------|-------------------|------------------|
-| **BPE** | 8000 | General purpose, fast inference | 2.8x | ✅ Yes |
-| **WordPiece** | 8000 | Stable behavior, balanced performance | 2.7x | ✅ Yes |
-| **Unigram** | 16000 | LLM training, smooth distributions | 3.1x | ✅ Yes |
-| **SentencePiece** | 32000 | Morphological coverage, OOV handling | 3.4x | ⚠️ T5Tokenizer |
 ### Key Findings
-- **BPE and WordPiece** achieve the best compression ratio and stable behavior
-- **Unigram** provides smoother sequence-length distribution, ideal for LLM training
-- **SentencePiece** offers the highest morphological coverage of Tatar
-- All tokenizers show **0% OOV rate** on test corpus
-- **FastText-style subword information** in SentencePiece handles rare words effectively
 ## 📊 Model Details

   - wordpiece
 ---
+# TatarTokenizers - Tatar Subword Tokenizers
+**High-quality pretrained tokenizers for the Tatar language**
+This repository contains 4 specialized tokenizers for Tatar, trained on a cleaned 103M-token corpus using different algorithms. These tokenizers significantly outperform generic multilingual tokenizers and are optimized for Tatar NLP tasks and language model training.
 # TatarTokenizers - Tatar Subword Tokenizers
 **High-quality pretrained tokenizers for the Tatar language**
 ### Tokenizer Comparison
+| Algorithm | Vocabulary Size | Best For | HF AutoTokenizer |
+|-----------|-----------------|----------|------------------|
+| **BPE** | 8000 | General purpose, fast inference | ✅ Yes |
+| **WordPiece** | 8000 | Stable behavior, balanced performance | ✅ Yes |
+| **Unigram** | 16000 | LLM training, smooth distributions | ✅ Yes |
+| **SentencePiece** | 32000 | Morphological coverage, OOV handling | ⚠️ T5Tokenizer |
+## 📈 Training Results
+### Final Training Metrics
+```
+============================================================
+BPE          | Run: v8000_mf2       | OOV: 0.00% | AvgLen: 96.0 | Time: 105.9s
+WORDPIECE    | Run: v8000_mf1       | OOV: 0.00% | AvgLen: 95.4 | Time: 124.3s
+UNIGRAM      | Run: v16000          | OOV: 0.00% | AvgLen: 90.9 | Time: 614.1s
+SPM          | Run: v32000          | OOV: 0.00% | AvgLen: 86.7 | Time: 249.8s
+============================================================
+```
+**Metric Explanation:**
+- **OOV**: Out-of-Vocabulary rate (0% = perfect coverage)
+- **AvgLen**: Average sequence length in tokens (lower = better compression)
+- **Time**: Training time in seconds
 ### Key Findings
+- **All tokenizers achieved 0% OOV** on test corpus, demonstrating perfect vocabulary coverage
+- **SentencePiece provides best compression** (lowest AvgLen) due to larger vocabulary
+- **BPE is fastest to train** while maintaining excellent performance
+- **Unigram offers balanced compression** despite longer training time
+- **All models show consistent behavior** across different text domains
 ## 📊 Model Details