WWHO / EVALUATION.md
thekusaldarshana's picture
Seperate Before you Compress
e51bea7
================================================================================
BATTERY 1: SINHALA LINGUISTIC COMPLEXITY
================================================================================
Category Total Pass Fail
------------------------------------------------------
aadhyaathmika 1 1 0
aakhyaanaya 1 1 0
aathmaya 1 1 0
abhidhamma 1 1 0
adhyaapanaya 1 1 0
adhyaksha 1 1 0
aitihaasika 1 1 0
aniccataava 1 1 0
antahpuraya 1 1 0
antharjaathika 1 1 0
ashvayaa 1 1 0
aushadhaya 1 1 0
bare_hal_zwj 1 1 0
bare_virama 1 1 0
bare_zwj 1 1 0
braahmana 1 1 0
brahmaya 1 1 0
chandrikaa 1 1 0
chhandas 1 1 0
conjunct_anusvara 28 28 0
conjunct_pili_anusvara 22 22 0
constructed_multisyllable 252 252 0
cricket 1 1 0
dangling_zwj 1 1 0
dhammachakka 1 1 0
dhyaanaya 1 1 0
double_conjunct 29 29 0
dravyaya 1 1 0
duhkhaya 1 1 0
grahanaya 1 1 0
granthaya 1 1 0
indriya 1 1 0
jyotishya 1 1 0
kramaya 1 1 0
kshatriya 1 1 0
kshetraya 1 1 0
kshitija 1 1 0
mahaparinibbana 1 1 0
manahkalpita 1 1 0
mantraya 1 1 0
mrutyuva 1 1 0
multi_conjunct_sequence 1 1 0
nibbaanaya 1 1 0
nirvachanaathmaka 1 1 0
nishkriya 1 1 0
paticcasamuppaada 1 1 0
praadeshiiyakaranaya 1 1 0
praatibhaasika 1 1 0
prajaava 1 1 0
prakaashaya 1 1 0
prashast 1 1 0
pratipattiya 1 1 0
prativyuuhaathmaka 1 1 0
pratyaksha 1 1 0
pratyayaya 1 1 0
pratyuthpanna 1 1 0
praudha 1 1 0
premaya 1 1 0
quad_stack 1 1 0
quad_virama_chain 1 1 0
rakaransaya_form 3 3 0
ritvija 1 1 0
saammpradaayika 1 1 0
samasth 1 1 0
sammaasambuddha 1 1 0
samskrutaya 1 1 0
samudraya 1 1 0
sankhaaraya 1 1 0
sanskaaraya 1 1 0
sansthaapanaya 1 1 0
satyaya 1 1 0
saundarya 1 1 0
shaastraya 1 1 0
shaastriya 1 1 0
shraddhaava 1 1 0
shreemath 1 1 0
shreshtha 1 1 0
svaamiyaa 1 1 0
svabhaavaya 1 1 0
svachchhand 1 1 0
tantraya 1 1 0
triple_conjunct 1 1 0
triple_conjunct_gen 64 64 0
trividha 1 1 0
udghoshanaya 1 1 0
upaadaanaya 1 1 0
upanishad 1 1 0
vaichitrya 1 1 0
vaidya 1 1 0
vastraya 1 1 0
very_long_compound 1 1 0
vipassanaava 1 1 0
vishvaasaya 1 1 0
vowel_prefix_conjunct 1 1 0
vyaakaranaya 1 1 0
vyaapaaraya 1 1 0
vyatirekaya 1 1 0
vyavahaarika 1 1 0
vyavasthaava 1 1 0
yansaya_form 7 7 0
yantraya 1 1 0
zwnj_middle 1 1 0
Result: PASS — Tested 500 complex words. Violations: 0, Leading-space violations: 0
================================================================================
BATTERY 2: GLITCHED TOKEN DETECTION
================================================================================
Total unified vocab size: 328,020 (SGPE component: 128,001)
Zero-usage SGPE tokens: 1,394
Near-zero (< 3) tokens: 3,163
Result: PASS — Zero: 1394, Near-Zero: 3163, Glitched: 0
================================================================================
BATTERY 3: FRONTIER BENCHMARKING
================================================================================
1. Tokenization Anatomy (Visual Examples)
'ව්යාකරණය':
SGPE ['ව්යා', 'කරණය'] (2 tokens)
OpenAI (o200k_base) ['ව්', 'යා', 'ක', 'රණ', 'ය'] (5 tokens)
Llama 4 Scout ['ව්', 'යා', 'කර', 'ණය'] (4 tokens)
DeepSeek V3 ['ව', '්', 'ය', 'ා', 'ක', 'ර', '�', '�', 'ය'] (9 tokens)
'ශ්‍රී ලංකාව':
SGPE ['ශ්\u200dරී', ' ලංකාව'] (2 tokens)
OpenAI (o200k_base) ['ශ්', '\u200dරී', ' ලංක', 'ාව'] (4 tokens)
Llama 4 Scout ['ශ්', '\u200dර', 'ී', ' ල', 'ං', 'ක', 'ාව'] (7 tokens)
DeepSeek V3 ['�', '�', '්', '\u200d', 'ර', 'ී', ' �', '�', '�', '�', 'ක', 'ා', 'ව'] (13 tokens)
'अंतर्राष्ट्रीय':
SGPE ['अंतर्राष्ट्रीय'] (1 tokens)
OpenAI (o200k_base) ['अ', 'ंतर', '्र', 'ाष्ट्रीय'] (4 tokens)
Llama 4 Scout ['अ', 'ंतर', '्र', 'ाष्ट्रीय'] (4 tokens)
DeepSeek V3 ['अ', 'ंत', 'र', '्र', 'ाष', '्ट', '्री', 'य'] (8 tokens)
'कृत्रिम बुद्धिमत्ता':
SGPE ['कृत्रिम', ' बुद्धिमत्ता'] (2 tokens)
OpenAI (o200k_base) ['क', 'ृ', 'त्र', 'िम', ' बुद्ध', 'िम', 'त्ता'] (7 tokens)
Llama 4 Scout ['क', 'ृ', 'त्र', 'िम', ' ब', 'ुद्ध', 'िम', 'त्ता'] (8 tokens)
DeepSeek V3 ['क', 'ृ', 'त्र', 'िम', ' ब', 'ुद', '्ध', 'िम', 'त्त', 'ा'] (10 tokens)
Evaluating 1,499,950 sentences...
====== Sinhala Results ======
Tokenizer | Tokens | TWR | Chr/Tok | % Reduction
----------------------------------------------------------------------
SGPE | 6,654,288 | 1.274 | 4.83 | -
OpenAI (o200k_base) | 17,360,196 | 3.324 | 1.85 | 61.7%
Llama 4 Scout | 18,157,707 | 3.476 | 1.77 | 63.4%
DeepSeek V3 | 29,152,698 | 5.581 | 1.10 | 77.2%
====== Hindi Results ======
Tokenizer | Tokens | TWR | Chr/Tok | % Reduction
----------------------------------------------------------------------
SGPE | 13,433,554 | 1.181 | 4.29 | -
OpenAI (o200k_base) | 18,394,075 | 1.617 | 3.13 | 27.0%
Llama 4 Scout | 19,566,121 | 1.720 | 2.94 | 31.3%
DeepSeek V3 | 31,682,218 | 2.786 | 1.82 | 57.6%
====== English Results ======
Tokenizer | Tokens | TWR | Chr/Tok | % Reduction
----------------------------------------------------------------------
SGPE | 7,240,147 | 1.330 | 4.46 | -
OpenAI (o200k_base) | 7,420,527 | 1.364 | 4.35 | 2.4%
Llama 4 Scout | 7,512,843 | 1.381 | 4.30 | 3.6%
DeepSeek V3 | 7,904,670 | 1.453 | 4.09 | 8.4%
========================= OVERALL Results =========================
Tokenizer | Tokens | TWR | Chr/Tok | % Reduction
----------------------------------------------------------------------
SGPE | 27,327,989 | 1.240 | 4.47 | -
OpenAI (o200k_base) | 43,174,798 | 1.959 | 2.83 | 36.7%
Llama 4 Scout | 45,236,671 | 2.053 | 2.70 | 39.6%
DeepSeek V3 | 68,739,586 | 3.119 | 1.78 | 60.2%
================================================================================
BATTERY 4: ROUND-TRIP CONSISTENCY
================================================================================
Sentences tested: 1,499,950
Total words: 22,190,730
Total characters tested: 122,274,117
Total tokens generated: 27,503,859
Mismatches (non-UNK): 0
Mismatches (with UNK loss): 19,320
Crashes: 0
Result: PASS — Tested 1,499,950 sentences (122,274,117 chars). Non-UNK mismatches: 0, UNK-caused losses: 19320, Crashes: 0
================================================================================
BATTERY 5: BOUNDARY & LEADING SPACE EDGE-CASES
================================================================================
[✓] [B01-Sinhala-leading-space ] ' සිංහල' -> '[UNK]හල'
[✓] [B02-Sinhala-no-leading-space] 'සිංහල' -> '[UNK]හල'
[✓] [B03-Sinhala-trailing-punct ] 'සිංහල.' -> '[UNK]හල.'
[✓] [B04-Sinhala-multi-word ] 'දරුවන් පාසලට' -> 'දරුවන් පාසලට'
[✓] [D01-Devanagari-leading-space] ' हिंदी' -> '[UNK]दी'
[✓] [D02-Devanagari-no-leading ] 'नमस्ते' -> 'नमस्ते'
[✓] [D03-Devanagari-trailing-danda] 'नमस्ते।' -> 'नमस्ते।'
[✓] [D04-Devanagari-multi-word ] 'भारत देश' -> 'भारत देश'
[✓] [D05-Devanagari-anusvara ] 'संस्कृत' -> 'संस्कृत'
[✓] [F01-SinhalaEng ] 'සිංහලදABC' -> '[UNK]හලදABC'
[✓] [F02-DevanagariEng ] 'हिंदीDEF' -> '[UNK]दीDEF'
[✓] [F03-Sinhala-Devanagari ] 'සිංහල हिंदी' -> '[UNK]හල[UNK]दी'
[✓] [G01-Mixed-3-scripts ] ' සිංහල123ABCहिंदी ' -> '[UNK]හල123ABC[UNK]दी '
Result: PASS — Violations: 0
================================================================================
BATTERY 6: ZERO-BREAKAGE GUARANTEE (Sinhala)
================================================================================
Testing all C + HAL + ZWJ + C pairs...
Testing C + HAL + C pairs (implicit conjuncts)...
Testing C + vowel_sign (all combinations)...
Testing C + HAL (terminal virama)...
Testing C + anusvara / visarga...
Testing C + pili + anusvara...
Testing triple stacks...
Testing conjuncts with leading space...
Result: PASS — Ran 1,703 exhaustive breakage tests. Violations: 0
================================================================================
BATTERY 6B: ZERO-BREAKAGE GUARANTEE (Devanagari)
================================================================================
Testing Devanagari C + HAL + C pairs (implicit conjuncts)...
Testing Devanagari C + vowel_sign...
Testing Devanagari C + HAL (terminal virama)...
Testing Devanagari C + anusvara / visarga / chandrabindu...
Testing Devanagari C + vowel_sign + modifier...
Result: PASS — Ran 1,078 exhaustive breakage tests. Violations: 0
================================================================================
BATTERY 7: DEVANAGARI LINGUISTIC COMPLEXITY
================================================================================
Category Total Pass Fail
----------------------------------------------------
anusvara 1 1 0
anusvara_prefix 5 5 0
complex 2 2 0
conjunct 3 3 0
conjunct_anusvara 4 4 0
double_conjunct 1 1 0
double_conjunct_gen 470 470 0
extreme_compound 1 1 0
matra 3 3 0
sanskrit 4 4 0
simple 4 4 0
super_compound 1 1 0
very_complex 1 1 0
Result: PASS — Tested 500 Devanagari words. Violations: 0
================================================================================
BATTERY 8: CODE-SWITCHING INTEGRITY
================================================================================
[simple_sinhala_english ] 5 tokens | ['Hello', ',', ' ශ්\u200dරී', ' ලංකාව', '!']
[code_sinhala ] 5 tokens | ['const', ' x', ' =', ' ප්\u200dරකාශය', ';']
[devanagari_english ] 7 tokens | ['मेरा', ' नाम', ' है', ' और', ' I', ' love', ' Python']
[code_sinhala_mixed ] 9 tokens | ['function', ' foo', '()', ' {', ' return', " '", 'ශ්\u200dරී', "';"]
[sinhala_english_mixed ] 8 tokens | ['ශ', '\u200d', '්', '\u200d', 'රී', ' ලංකාව', ' is', ' beautiful']
[python_devanagari_comment ] 7 tokens | ['print', "('", 'नमस्ते', "')", ' #', ' Say', ' Hello']
[sinhala_english_complex ] 8 tokens | ['ඒ', ' කියන්නේ', ',', ' G', 'PE', ' Token', 'izer', ' English']
[python_sinhala_comment ] 10 tokens | ['for', ' i', ' in', ' range', '(', '10', '):', ' #']
[sql_devanagari ] 9 tokens | ['SELECT', ' *', ' FROM', ' users', ' WHERE', ' नाम', "='", 'राम']
[arrow_fn_sinhala ] 22 tokens | ['const', ' create', '_func', ' =', ' (', 'p', '1', ',']
[math_sinhala ] 6 tokens | ['123', ' +', ' ', '456', ' =', ' ෆ']
Result: PASS — Tested 13 code-switching cases. Violations: 0, Crashes: 0
================================================================================
BATTERY 9: META-VOCAB ROUND-TRIP
================================================================================
Sentences: 1,499,950
Round-trip failures: 0 (100.00% lossless)
Avg tokens/sentence: 18.3
UNK rate: 0.08%
Result: PASS — Tested 1,499,950 sentences. Failures: 0, Crashes: 0, Lossless: 100.00%, UNK rate: 0.08%
████████████████████████████████████████████████████████████████████████████████
█ █
█ SGPE - BATTLE TEST REPORT █
█ █
████████████████████████████████████████████████████████████████████████████████
Test Battery Status Key Metric
────────────────────────────────────────────────────────────────────────────────
Linguistic Complexity (2K Sanskrit/Pali Words) ✓ PASS 0 violations
Glitched Token Detection (v2) ✓ PASS
Frontier Benchmarking (Stratified) ✓ PASS
Round-Trip Consistency (v2) ✓ PASS 0 mismatches
Boundary Edge-Cases (v2) ✓ PASS
Zero-Breakage Guarantee (Extended) ✓ PASS 0 violations
Zero-Breakage Guarantee (v2 Devanagari) ✓ PASS
Devanagari Linguistic Complexity ✓ PASS 0 violations
Code-Switching Integrity ✓ PASS 0 violations
Meta-Vocab Round-Trip (SGPEMetaEncoder) ✓ PASS
────────────────────────────────────────────────────────────────────────────────
TOTAL P:10 F:0 W:0