================================================================================ BATTERY 1: SINHALA LINGUISTIC COMPLEXITY ================================================================================ Category Total Pass Fail ------------------------------------------------------ aadhyaathmika 1 1 0 aakhyaanaya 1 1 0 aathmaya 1 1 0 abhidhamma 1 1 0 adhyaapanaya 1 1 0 adhyaksha 1 1 0 aitihaasika 1 1 0 aniccataava 1 1 0 antahpuraya 1 1 0 antharjaathika 1 1 0 ashvayaa 1 1 0 aushadhaya 1 1 0 bare_hal_zwj 1 1 0 bare_virama 1 1 0 bare_zwj 1 1 0 braahmana 1 1 0 brahmaya 1 1 0 chandrikaa 1 1 0 chhandas 1 1 0 conjunct_anusvara 28 28 0 conjunct_pili_anusvara 22 22 0 constructed_multisyllable 252 252 0 cricket 1 1 0 dangling_zwj 1 1 0 dhammachakka 1 1 0 dhyaanaya 1 1 0 double_conjunct 29 29 0 dravyaya 1 1 0 duhkhaya 1 1 0 grahanaya 1 1 0 granthaya 1 1 0 indriya 1 1 0 jyotishya 1 1 0 kramaya 1 1 0 kshatriya 1 1 0 kshetraya 1 1 0 kshitija 1 1 0 mahaparinibbana 1 1 0 manahkalpita 1 1 0 mantraya 1 1 0 mrutyuva 1 1 0 multi_conjunct_sequence 1 1 0 nibbaanaya 1 1 0 nirvachanaathmaka 1 1 0 nishkriya 1 1 0 paticcasamuppaada 1 1 0 praadeshiiyakaranaya 1 1 0 praatibhaasika 1 1 0 prajaava 1 1 0 prakaashaya 1 1 0 prashast 1 1 0 pratipattiya 1 1 0 prativyuuhaathmaka 1 1 0 pratyaksha 1 1 0 pratyayaya 1 1 0 pratyuthpanna 1 1 0 praudha 1 1 0 premaya 1 1 0 quad_stack 1 1 0 quad_virama_chain 1 1 0 rakaransaya_form 3 3 0 ritvija 1 1 0 saammpradaayika 1 1 0 samasth 1 1 0 sammaasambuddha 1 1 0 samskrutaya 1 1 0 samudraya 1 1 0 sankhaaraya 1 1 0 sanskaaraya 1 1 0 sansthaapanaya 1 1 0 satyaya 1 1 0 saundarya 1 1 0 shaastraya 1 1 0 shaastriya 1 1 0 shraddhaava 1 1 0 shreemath 1 1 0 shreshtha 1 1 0 svaamiyaa 1 1 0 svabhaavaya 1 1 0 svachchhand 1 1 0 tantraya 1 1 0 triple_conjunct 1 1 0 triple_conjunct_gen 64 64 0 trividha 1 1 0 udghoshanaya 1 1 0 upaadaanaya 1 1 0 upanishad 1 1 0 vaichitrya 1 1 0 vaidya 1 1 0 vastraya 1 1 0 very_long_compound 1 1 0 vipassanaava 1 1 0 vishvaasaya 1 1 0 vowel_prefix_conjunct 1 1 0 vyaakaranaya 1 1 0 vyaapaaraya 1 1 0 vyatirekaya 1 1 0 vyavahaarika 1 1 0 vyavasthaava 1 1 0 yansaya_form 7 7 0 yantraya 1 1 0 zwnj_middle 1 1 0 Result: PASS — Tested 500 complex words. Violations: 0, Leading-space violations: 0 ================================================================================ BATTERY 2: GLITCHED TOKEN DETECTION ================================================================================ Total unified vocab size: 328,020 (SGPE component: 128,001) Zero-usage SGPE tokens: 1,394 Near-zero (< 3) tokens: 3,163 Result: PASS — Zero: 1394, Near-Zero: 3163, Glitched: 0 ================================================================================ BATTERY 3: FRONTIER BENCHMARKING ================================================================================ 1. Tokenization Anatomy (Visual Examples) 'ව්යාකරණය': SGPE ['ව්යා', 'කරණය'] (2 tokens) OpenAI (o200k_base) ['ව්', 'යා', 'ක', 'රණ', 'ය'] (5 tokens) Llama 4 Scout ['ව්', 'යා', 'කර', 'ණය'] (4 tokens) DeepSeek V3 ['ව', '්', 'ය', 'ා', 'ක', 'ර', '�', '�', 'ය'] (9 tokens) 'ශ්‍රී ලංකාව': SGPE ['ශ්\u200dරී', ' ලංකාව'] (2 tokens) OpenAI (o200k_base) ['ශ්', '\u200dරී', ' ලංක', 'ාව'] (4 tokens) Llama 4 Scout ['ශ්', '\u200dර', 'ී', ' ල', 'ං', 'ක', 'ාව'] (7 tokens) DeepSeek V3 ['�', '�', '්', '\u200d', 'ර', 'ී', ' �', '�', '�', '�', 'ක', 'ා', 'ව'] (13 tokens) 'अंतर्राष्ट्रीय': SGPE ['अंतर्राष्ट्रीय'] (1 tokens) OpenAI (o200k_base) ['अ', 'ंतर', '्र', 'ाष्ट्रीय'] (4 tokens) Llama 4 Scout ['अ', 'ंतर', '्र', 'ाष्ट्रीय'] (4 tokens) DeepSeek V3 ['अ', 'ंत', 'र', '्र', 'ाष', '्ट', '्री', 'य'] (8 tokens) 'कृत्रिम बुद्धिमत्ता': SGPE ['कृत्रिम', ' बुद्धिमत्ता'] (2 tokens) OpenAI (o200k_base) ['क', 'ृ', 'त्र', 'िम', ' बुद्ध', 'िम', 'त्ता'] (7 tokens) Llama 4 Scout ['क', 'ृ', 'त्र', 'िम', ' ब', 'ुद्ध', 'िम', 'त्ता'] (8 tokens) DeepSeek V3 ['क', 'ृ', 'त्र', 'िम', ' ब', 'ुद', '्ध', 'िम', 'त्त', 'ा'] (10 tokens) Evaluating 1,499,950 sentences... ====== Sinhala Results ====== Tokenizer | Tokens | TWR | Chr/Tok | % Reduction ---------------------------------------------------------------------- SGPE | 6,654,288 | 1.274 | 4.83 | - OpenAI (o200k_base) | 17,360,196 | 3.324 | 1.85 | 61.7% Llama 4 Scout | 18,157,707 | 3.476 | 1.77 | 63.4% DeepSeek V3 | 29,152,698 | 5.581 | 1.10 | 77.2% ====== Hindi Results ====== Tokenizer | Tokens | TWR | Chr/Tok | % Reduction ---------------------------------------------------------------------- SGPE | 13,433,554 | 1.181 | 4.29 | - OpenAI (o200k_base) | 18,394,075 | 1.617 | 3.13 | 27.0% Llama 4 Scout | 19,566,121 | 1.720 | 2.94 | 31.3% DeepSeek V3 | 31,682,218 | 2.786 | 1.82 | 57.6% ====== English Results ====== Tokenizer | Tokens | TWR | Chr/Tok | % Reduction ---------------------------------------------------------------------- SGPE | 7,240,147 | 1.330 | 4.46 | - OpenAI (o200k_base) | 7,420,527 | 1.364 | 4.35 | 2.4% Llama 4 Scout | 7,512,843 | 1.381 | 4.30 | 3.6% DeepSeek V3 | 7,904,670 | 1.453 | 4.09 | 8.4% ========================= OVERALL Results ========================= Tokenizer | Tokens | TWR | Chr/Tok | % Reduction ---------------------------------------------------------------------- SGPE | 27,327,989 | 1.240 | 4.47 | - OpenAI (o200k_base) | 43,174,798 | 1.959 | 2.83 | 36.7% Llama 4 Scout | 45,236,671 | 2.053 | 2.70 | 39.6% DeepSeek V3 | 68,739,586 | 3.119 | 1.78 | 60.2% ================================================================================ BATTERY 4: ROUND-TRIP CONSISTENCY ================================================================================ Sentences tested: 1,499,950 Total words: 22,190,730 Total characters tested: 122,274,117 Total tokens generated: 27,503,859 Mismatches (non-UNK): 0 Mismatches (with UNK loss): 19,320 Crashes: 0 Result: PASS — Tested 1,499,950 sentences (122,274,117 chars). Non-UNK mismatches: 0, UNK-caused losses: 19320, Crashes: 0 ================================================================================ BATTERY 5: BOUNDARY & LEADING SPACE EDGE-CASES ================================================================================ [✓] [B01-Sinhala-leading-space ] ' සිංහල' -> '[UNK]හල' [✓] [B02-Sinhala-no-leading-space] 'සිංහල' -> '[UNK]හල' [✓] [B03-Sinhala-trailing-punct ] 'සිංහල.' -> '[UNK]හල.' [✓] [B04-Sinhala-multi-word ] 'දරුවන් පාසලට' -> 'දරුවන් පාසලට' [✓] [D01-Devanagari-leading-space] ' हिंदी' -> '[UNK]दी' [✓] [D02-Devanagari-no-leading ] 'नमस्ते' -> 'नमस्ते' [✓] [D03-Devanagari-trailing-danda] 'नमस्ते।' -> 'नमस्ते।' [✓] [D04-Devanagari-multi-word ] 'भारत देश' -> 'भारत देश' [✓] [D05-Devanagari-anusvara ] 'संस्कृत' -> 'संस्कृत' [✓] [F01-SinhalaEng ] 'සිංහලදABC' -> '[UNK]හලදABC' [✓] [F02-DevanagariEng ] 'हिंदीDEF' -> '[UNK]दीDEF' [✓] [F03-Sinhala-Devanagari ] 'සිංහල हिंदी' -> '[UNK]හල[UNK]दी' [✓] [G01-Mixed-3-scripts ] ' සිංහල123ABCहिंदी ' -> '[UNK]හල123ABC[UNK]दी ' Result: PASS — Violations: 0 ================================================================================ BATTERY 6: ZERO-BREAKAGE GUARANTEE (Sinhala) ================================================================================ Testing all C + HAL + ZWJ + C pairs... Testing C + HAL + C pairs (implicit conjuncts)... Testing C + vowel_sign (all combinations)... Testing C + HAL (terminal virama)... Testing C + anusvara / visarga... Testing C + pili + anusvara... Testing triple stacks... Testing conjuncts with leading space... Result: PASS — Ran 1,703 exhaustive breakage tests. Violations: 0 ================================================================================ BATTERY 6B: ZERO-BREAKAGE GUARANTEE (Devanagari) ================================================================================ Testing Devanagari C + HAL + C pairs (implicit conjuncts)... Testing Devanagari C + vowel_sign... Testing Devanagari C + HAL (terminal virama)... Testing Devanagari C + anusvara / visarga / chandrabindu... Testing Devanagari C + vowel_sign + modifier... Result: PASS — Ran 1,078 exhaustive breakage tests. Violations: 0 ================================================================================ BATTERY 7: DEVANAGARI LINGUISTIC COMPLEXITY ================================================================================ Category Total Pass Fail ---------------------------------------------------- anusvara 1 1 0 anusvara_prefix 5 5 0 complex 2 2 0 conjunct 3 3 0 conjunct_anusvara 4 4 0 double_conjunct 1 1 0 double_conjunct_gen 470 470 0 extreme_compound 1 1 0 matra 3 3 0 sanskrit 4 4 0 simple 4 4 0 super_compound 1 1 0 very_complex 1 1 0 Result: PASS — Tested 500 Devanagari words. Violations: 0 ================================================================================ BATTERY 8: CODE-SWITCHING INTEGRITY ================================================================================ [simple_sinhala_english ] 5 tokens | ['Hello', ',', ' ශ්\u200dරී', ' ලංකාව', '!'] [code_sinhala ] 5 tokens | ['const', ' x', ' =', ' ප්\u200dරකාශය', ';'] [devanagari_english ] 7 tokens | ['मेरा', ' नाम', ' है', ' और', ' I', ' love', ' Python'] [code_sinhala_mixed ] 9 tokens | ['function', ' foo', '()', ' {', ' return', " '", 'ශ්\u200dරී', "';"] [sinhala_english_mixed ] 8 tokens | ['ශ', '\u200d', '්', '\u200d', 'රී', ' ලංකාව', ' is', ' beautiful'] [python_devanagari_comment ] 7 tokens | ['print', "('", 'नमस्ते', "')", ' #', ' Say', ' Hello'] [sinhala_english_complex ] 8 tokens | ['ඒ', ' කියන්නේ', ',', ' G', 'PE', ' Token', 'izer', ' English'] [python_sinhala_comment ] 10 tokens | ['for', ' i', ' in', ' range', '(', '10', '):', ' #'] [sql_devanagari ] 9 tokens | ['SELECT', ' *', ' FROM', ' users', ' WHERE', ' नाम', "='", 'राम'] [arrow_fn_sinhala ] 22 tokens | ['const', ' create', '_func', ' =', ' (', 'p', '1', ','] [math_sinhala ] 6 tokens | ['123', ' +', ' ', '456', ' =', ' ෆ'] Result: PASS — Tested 13 code-switching cases. Violations: 0, Crashes: 0 ================================================================================ BATTERY 9: META-VOCAB ROUND-TRIP ================================================================================ Sentences: 1,499,950 Round-trip failures: 0 (100.00% lossless) Avg tokens/sentence: 18.3 UNK rate: 0.08% Result: PASS — Tested 1,499,950 sentences. Failures: 0, Crashes: 0, Lossless: 100.00%, UNK rate: 0.08% ████████████████████████████████████████████████████████████████████████████████ █ █ █ SGPE - BATTLE TEST REPORT █ █ █ ████████████████████████████████████████████████████████████████████████████████ Test Battery Status Key Metric ──────────────────────────────────────────────────────────────────────────────── Linguistic Complexity (2K Sanskrit/Pali Words) ✓ PASS 0 violations Glitched Token Detection (v2) ✓ PASS Frontier Benchmarking (Stratified) ✓ PASS Round-Trip Consistency (v2) ✓ PASS 0 mismatches Boundary Edge-Cases (v2) ✓ PASS Zero-Breakage Guarantee (Extended) ✓ PASS 0 violations Zero-Breakage Guarantee (v2 Devanagari) ✓ PASS Devanagari Linguistic Complexity ✓ PASS 0 violations Code-Switching Integrity ✓ PASS 0 violations Meta-Vocab Round-Trip (SGPEMetaEncoder) ✓ PASS ──────────────────────────────────────────────────────────────────────────────── TOTAL P:10 F:0 W:0