ArabovMK commited on
Commit
dfec518
ยท
verified ยท
1 Parent(s): f88f727

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -11
README.md CHANGED
@@ -13,6 +13,13 @@ tags:
13
  - wordpiece
14
  ---
15
 
 
 
 
 
 
 
 
16
  # TatarTokenizers - Tatar Subword Tokenizers
17
 
18
  **High-quality pretrained tokenizers for the Tatar language**
@@ -23,20 +30,39 @@ This repository contains 4 specialized tokenizers for Tatar, trained on a cleane
23
 
24
  ### Tokenizer Comparison
25
 
26
- | Algorithm | Vocabulary Size | Best For | Compression Ratio | HF AutoTokenizer |
27
- |-----------|-----------------|----------|-------------------|------------------|
28
- | **BPE** | 8000 | General purpose, fast inference | 2.8x | โœ… Yes |
29
- | **WordPiece** | 8000 | Stable behavior, balanced performance | 2.7x | โœ… Yes |
30
- | **Unigram** | 16000 | LLM training, smooth distributions | 3.1x | โœ… Yes |
31
- | **SentencePiece** | 32000 | Morphological coverage, OOV handling | 3.4x | โš ๏ธ T5Tokenizer |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  ### Key Findings
34
 
35
- - **BPE and WordPiece** achieve the best compression ratio and stable behavior
36
- - **Unigram** provides smoother sequence-length distribution, ideal for LLM training
37
- - **SentencePiece** offers the highest morphological coverage of Tatar
38
- - All tokenizers show **0% OOV rate** on test corpus
39
- - **FastText-style subword information** in SentencePiece handles rare words effectively
 
40
 
41
  ## ๐Ÿ“Š Model Details
42
 
 
13
  - wordpiece
14
  ---
15
 
16
+ # TatarTokenizers - Tatar Subword Tokenizers
17
+
18
+ **High-quality pretrained tokenizers for the Tatar language**
19
+
20
+ This repository contains 4 specialized tokenizers for Tatar, trained on a cleaned 103M-token corpus using different algorithms. These tokenizers significantly outperform generic multilingual tokenizers and are optimized for Tatar NLP tasks and language model training.
21
+
22
+
23
  # TatarTokenizers - Tatar Subword Tokenizers
24
 
25
  **High-quality pretrained tokenizers for the Tatar language**
 
30
 
31
  ### Tokenizer Comparison
32
 
33
+ | Algorithm | Vocabulary Size | Best For | HF AutoTokenizer |
34
+ |-----------|-----------------|----------|------------------|
35
+ | **BPE** | 8000 | General purpose, fast inference | โœ… Yes |
36
+ | **WordPiece** | 8000 | Stable behavior, balanced performance | โœ… Yes |
37
+ | **Unigram** | 16000 | LLM training, smooth distributions | โœ… Yes |
38
+ | **SentencePiece** | 32000 | Morphological coverage, OOV handling | โš ๏ธ T5Tokenizer |
39
+
40
+ ## ๐Ÿ“ˆ Training Results
41
+
42
+ ### Final Training Metrics
43
+
44
+ ```
45
+ ============================================================
46
+ BPE | Run: v8000_mf2 | OOV: 0.00% | AvgLen: 96.0 | Time: 105.9s
47
+ WORDPIECE | Run: v8000_mf1 | OOV: 0.00% | AvgLen: 95.4 | Time: 124.3s
48
+ UNIGRAM | Run: v16000 | OOV: 0.00% | AvgLen: 90.9 | Time: 614.1s
49
+ SPM | Run: v32000 | OOV: 0.00% | AvgLen: 86.7 | Time: 249.8s
50
+ ============================================================
51
+ ```
52
+
53
+ **Metric Explanation:**
54
+ - **OOV**: Out-of-Vocabulary rate (0% = perfect coverage)
55
+ - **AvgLen**: Average sequence length in tokens (lower = better compression)
56
+ - **Time**: Training time in seconds
57
 
58
  ### Key Findings
59
 
60
+ - **All tokenizers achieved 0% OOV** on test corpus, demonstrating perfect vocabulary coverage
61
+ - **SentencePiece provides best compression** (lowest AvgLen) due to larger vocabulary
62
+ - **BPE is fastest to train** while maintaining excellent performance
63
+ - **Unigram offers balanced compression** despite longer training time
64
+ - **All models show consistent behavior** across different text domains
65
+
66
 
67
  ## ๐Ÿ“Š Model Details
68