Update README.md
Browse files
README.md
CHANGED
|
@@ -64,31 +64,26 @@ Together, these datasets amount to approximately **4.5 million Kannada sentences
|
|
| 64 |
|
| 65 |
These metrics evaluate tokenizer quality independent of any NLP model.
|
| 66 |
|
| 67 |
-
##
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
## **2. Fertility Score (FS)**
|
| 71 |
-
Lower = better (#tokens produced per input unit)
|
| 72 |
-
|
| 73 |
-
## **3. Tokenization Speed**
|
| 74 |
-
Measured in chars/sec
|
| 75 |
|
|
|
|
| 76 |
### **Results Across Vocab Sizes**
|
| 77 |
|
| 78 |
-
| Tokenizer | Vocab | CR
|
| 79 |
|-----------|-------|------|------|--------------------|
|
| 80 |
-
| **GAT (ours)** | 8k |
|
| 81 |
-
| SentencePiece | 8k | 3.
|
| 82 |
-
| BPE | 8k | 3.300 | 2.
|
| 83 |
-
| WordPiece | 8k |
|
| 84 |
-
| **GAT (ours)** | 16k | 2.400 | 1.986 |
|
| 85 |
-
| SentencePiece | 16k | 3.
|
| 86 |
-
| BPE | 16k | 3.840 |
|
| 87 |
-
| WordPiece | 16k |
|
| 88 |
-
| **GAT (ours)** | 32k |
|
| 89 |
-
| SentencePiece | 32k |
|
| 90 |
-
| BPE | 32k |
|
| 91 |
-
| WordPiece | 32k |
|
| 92 |
|
| 93 |
# Usage Example
|
| 94 |
|
|
|
|
| 64 |
|
| 65 |
These metrics evaluate tokenizer quality independent of any NLP model.
|
| 66 |
|
| 67 |
+
## Compression Ratio (CR)**
|
| 68 |
+
Higher = better
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
+
|
| 71 |
### **Results Across Vocab Sizes**
|
| 72 |
|
| 73 |
+
| Tokenizer | Vocab | CR | FS |
|
| 74 |
|-----------|-------|------|------|--------------------|
|
| 75 |
+
| **GAT (ours)** | 8k | 3.588 | 2.168 |
|
| 76 |
+
| SentencePiece | 8k | 3.100 | 2.445 |
|
| 77 |
+
| BPE | 8k | 3.300 | 2.711 | 16,081 |
|
| 78 |
+
| WordPiece | 8k | 2.343 | 3.486 |
|
| 79 |
+
| **GAT (ours)** | 16k | 2.400 | 1.986 |
|
| 80 |
+
| SentencePiece | 16k | 3.78 | 1.917 |
|
| 81 |
+
| BPE | 16k | 3.840 | 3.940 | 347,656 |
|
| 82 |
+
| WordPiece | 16k | 3.243 | 2.676 |
|
| 83 |
+
| **GAT (ours)** | 32k | 4.806 | 1.827 |
|
| 84 |
+
| SentencePiece | 32k | 3.855 | 1.675 |
|
| 85 |
+
| BPE | 32k | 3.512 | 1.769 |
|
| 86 |
+
| WordPiece | 32k | 3.143 | 1.708 |
|
| 87 |
|
| 88 |
# Usage Example
|
| 89 |
|