--- library_name: transformers license: apache-2.0 --- **WE are COOKED** # Test Log 08 March 2025 ### First Test: Mean Perplexity : tested on `wikitext-2-raw-v1`, ~2k English samples was `1420.7414870547489` ### Second Test Evaluated the tokenizer's performance on: - Unicode coverage. - Token distribution. - Tokenization complexity across different scripts. - Encoding and decoding capabilities & - Edge cases e.g., special characters, numbers, etc. - 1k samples: 500 Hindi, 500 English ### 1. Edge Case Handling | **Language** | **Test Type** | **Token Count** | **Unique Tokens** | |--------------|--------------------|-----------------|-------------------| | **Hindi** | Script Test | 14 | 13 | | | Unicode Test | 21 | 21 | | | Special Characters | 19 | 19 | | **English** | Script Test | 16 | 15 | | | Unicode Test | 14 | 14 | | | Special Characters | 18 | 18 | ### 2. Unicode Coverage | **Language** | **Coverage Ratio** | **Token Count** | **Unique Tokens** | |--------------|--------------------|-----------------|-------------------| | **Hindi** | 100% | 21 | 21 | | **English** | 100% | 14 | 14 | ### 3. Complexity | **Language** | **Original Length** | **Token Count** | **Avg Token Length** | **Token Diversity** | |--------------|---------------------|-----------------|----------------------|---------------------| | **Hindi** | 49 | 14 | 9.07 | 0.928 | | **English** | 65 | 16 | 4.06 | 0.937 | ### 4. Encoding-Decoding Capabilities ``` Hindi Analysis: Original Text: नमस्ते, मैं भारत से हूँ। दिल्ली बहुत बड़ा शहर है। Token IDs Count: 14 Token Strings: ['नम', 'सà¥įतà¥ĩ', ',', 'Ġमà¥Īà¤Ĥ', 'Ġà¤Ńारत', 'Ġसà¥ĩ', 'Ġहà¥Ĥà¤ģ', '।', 'Ġदिलà¥įलà¥Ģ', 'Ġबहà¥ģत', 'Ġबड़ा', 'Ġशहर', 'Ġहà¥Ī', '।'] Decoded Text: नमस्ते, मैं भारत से हूँ। दिल्ली बहुत बड़ा शहर है। Text Reconstruction: True Hindi Analysis: Original Text: हिंदी भाषा बहुत सुंदर है। Token IDs Count: 7 Token Strings: ['ह', 'िà¤Ĥदà¥Ģ', 'Ġà¤Ńाषा', 'Ġबहà¥ģत', 'Ġसà¥ģà¤Ĥदर', 'Ġहà¥Ī', '।'] Decoded Text: हिंदी भाषा बहुत सुंदर है। Text Reconstruction: True Hindi Analysis: Original Text: मुझे किताबें पढ़ना पसंद है। Token IDs Count: 7 Token Strings: ['म', 'à¥ģà¤Ŀà¥ĩ', 'Ġà¤ķिताबà¥ĩà¤Ĥ', 'Ġपढ़ना', 'Ġपसà¤Ĥद', 'Ġहà¥Ī', '।'] Decoded Text: मुझे किताबें पढ़ना पसंद है। Text Reconstruction: True Hindi Analysis: Original Text: यह एक उदाहरण वाक्य है। Token IDs Count: 6 Token Strings: ['यह', 'Ġà¤ıà¤ķ', 'Ġà¤īदाहरण', 'Ġवाà¤ķà¥įय', 'Ġहà¥Ī', '।'] Decoded Text: यह एक उदाहरण वाक्य है। Text Reconstruction: True English Analysis: Original Text: Hello, I am from India. Delhi is a big city. Token IDs Count: 13 Token Strings: ['Hello', ',', 'ĠI', 'Ġam', 'Ġfrom', 'ĠIndia', '.', 'ĠDelhi', 'Ġis', 'Ġa', 'Ġbig', 'Ġcity', '.'] Decoded Text: Hello, I am from India. Delhi is a big city. Text Reconstruction: True English Analysis: Original Text: The English language is widely spoken. Token IDs Count: 7 Token Strings: ['The', 'ĠEnglish', 'Ġlanguage', 'Ġis', 'Ġwidely', 'Ġspoken', '.'] Decoded Text: The English language is widely spoken. Text Reconstruction: True English Analysis: Original Text: I enjoy reading books. Token IDs Count: 5 Token Strings: ['I', 'Ġenjoy', 'Ġreading', 'Ġbooks', '.'] Decoded Text: I enjoy reading books. Text Reconstruction: True English Analysis: Original Text: This is an example sentence. Token IDs Count: 6 Token Strings: ['This', 'Ġis', 'Ġan', 'Ġexample', 'Ġsentence', '.'] Decoded Text: This is an example sentence. Text Reconstruction: True ``` ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650a93c23449d9a49c356aab/QDI1ZPXPzQNARatnQkLmU.png) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650a93c23449d9a49c356aab/Ppn4fCMqhc9Oy5_zxgpkn.png)