aloobun commited on
Commit
a713ddf
·
verified ·
1 Parent(s): 23ea9e4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -2
README.md CHANGED
@@ -1,7 +1,51 @@
1
  ---
2
  library_name: transformers
3
- tags: []
4
  ---
5
 
6
 
7
- **WE are COOKED**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
+ license: apache-2.0
4
  ---
5
 
6
 
7
+ **WE are COOKED**
8
+
9
+ # Test Log 08 March 2025
10
+
11
+ ### First Test:
12
+ Mean Perplexity : tested on `wikitext-2-raw-v1`, ~2k English samples was `1420.7414870547489`
13
+
14
+ ### Second Test
15
+ Evaluated the tokenizer's performance on:
16
+ - Unicode coverage.
17
+ - Token distribution.
18
+ - Tokenization complexity across different scripts.
19
+ - Encoding and decoding capabilities &
20
+ - Edge cases e.g., special characters, numbers, etc.
21
+ - 1k samples: 500 Hindi, 500 English
22
+
23
+ ### 1. Edge Case Handling
24
+
25
+ | **Language** | **Test Type** | **Token Count** | **Unique Tokens** |
26
+ |--------------|--------------------|-----------------|-------------------|
27
+ | **Hindi** | Script Test | 14 | 13 |
28
+ | | Unicode Test | 21 | 21 |
29
+ | | Special Characters | 19 | 19 |
30
+ | **English** | Script Test | 16 | 15 |
31
+ | | Unicode Test | 14 | 14 |
32
+ | | Special Characters | 18 | 18 |
33
+
34
+ ### 2. Unicode Coverage
35
+
36
+ | **Language** | **Coverage Ratio** | **Token Count** | **Unique Tokens** |
37
+ |--------------|--------------------|-----------------|-------------------|
38
+ | **Hindi** | 100% | 21 | 21 |
39
+ | **English** | 100% | 14 | 14 |
40
+
41
+ ### 3. Complexity
42
+
43
+ | **Language** | **Original Length** | **Token Count** | **Avg Token Length** | **Token Diversity** |
44
+ |--------------|---------------------|-----------------|----------------------|---------------------|
45
+ | **Hindi** | 49 | 14 | 9.07 | 0.928 |
46
+ | **English** | 65 | 16 | 4.06 | 0.937 |
47
+
48
+
49
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650a93c23449d9a49c356aab/QDI1ZPXPzQNARatnQkLmU.png)
50
+
51
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650a93c23449d9a49c356aab/Ppn4fCMqhc9Oy5_zxgpkn.png)