varuni commited on
Commit
ba86d5d
·
verified ·
1 Parent(s): 917fe37

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -21
README.md CHANGED
@@ -64,31 +64,26 @@ Together, these datasets amount to approximately **4.5 million Kannada sentences
64
 
65
  These metrics evaluate tokenizer quality independent of any NLP model.
66
 
67
- ## **1. Compression Ratio (CR)**
68
- Lower = better (fewer bytes used)
69
-
70
- ## **2. Fertility Score (FS)**
71
- Lower = better (#tokens produced per input unit)
72
-
73
- ## **3. Tokenization Speed**
74
- Measured in chars/sec
75
 
 
76
  ### **Results Across Vocab Sizes**
77
 
78
- | Tokenizer | Vocab | CR | FS| Speed (chars/s) ↑ |
79
  |-----------|-------|------|------|--------------------|
80
- | **GAT (ours)** | 8k | 2.188 | 2.168 | 14,970 |
81
- | SentencePiece | 8k | 3.500 | 2.445 | 12,300 |
82
- | BPE | 8k | 3.300 | 2.311 | 16,081 |
83
- | WordPiece | 8k | 4.743 | 3.486 | 17,647 |
84
- | **GAT (ours)** | 16k | 2.400 | 1.986 | 764,647 |
85
- | SentencePiece | 16k | 3.978 | 1.917 | 970,981 |
86
- | BPE | 16k | 3.840 | 2.340 | 347,656 |
87
- | WordPiece | 16k | 4.743 | 2.676 | 401,430 |
88
- | **GAT (ours)** | 32k | 2.606 | 1.827 | 1,145,656 |
89
- | SentencePiece | 32k | 4.555 | 1.675 | 620,047 |
90
- | BPE | 32k | 4.312 | 1.769 | 994,434 |
91
- | WordPiece | 32k | 4.743 | 1.708 | 1,025,616 |
92
 
93
  # Usage Example
94
 
 
64
 
65
  These metrics evaluate tokenizer quality independent of any NLP model.
66
 
67
+ ## Compression Ratio (CR)**
68
+ Higher = better
 
 
 
 
 
 
69
 
70
+
71
  ### **Results Across Vocab Sizes**
72
 
73
+ | Tokenizer | Vocab | CR | FS |
74
  |-----------|-------|------|------|--------------------|
75
+ | **GAT (ours)** | 8k | 3.588 | 2.168 |
76
+ | SentencePiece | 8k | 3.100 | 2.445 |
77
+ | BPE | 8k | 3.300 | 2.711 | 16,081 |
78
+ | WordPiece | 8k | 2.343 | 3.486 |
79
+ | **GAT (ours)** | 16k | 2.400 | 1.986 |
80
+ | SentencePiece | 16k | 3.78 | 1.917 |
81
+ | BPE | 16k | 3.840 | 3.940 | 347,656 |
82
+ | WordPiece | 16k | 3.243 | 2.676 |
83
+ | **GAT (ours)** | 32k | 4.806 | 1.827 |
84
+ | SentencePiece | 32k | 3.855 | 1.675 |
85
+ | BPE | 32k | 3.512 | 1.769 |
86
+ | WordPiece | 32k | 3.143 | 1.708 |
87
 
88
  # Usage Example
89