File size: 5,932 Bytes
3ca29ed e51bea7 3ca29ed e51bea7 3ca29ed e51bea7 3ca29ed e51bea7 3ca29ed e51bea7 3ca29ed e51bea7 3ca29ed e51bea7 3ca29ed e51bea7 3ca29ed e51bea7 3ca29ed e51bea7 3ca29ed e51bea7 89fc0dc e51bea7 89fc0dc e51bea7 89fc0dc e51bea7 89fc0dc e51bea7 89fc0dc e51bea7 89fc0dc e51bea7 89fc0dc e51bea7 b3a398c e51bea7 b3a398c e51bea7 b3a398c e51bea7 b3a398c e51bea7 89fc0dc e51bea7 89fc0dc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | ---
license: apache-2.0
datasets:
- Remeinium/WWHO_30m
language:
- si
- hi
- en
pipeline_tag: feature-extraction
library_name: transformers
tags:
- tokenizer
- WWHO
- SGPE
- linguis_trie
- token
- tokenization
- Syllable
- remeinium
- transformer
- linguistics
- NLP
- sinhala
- hindi
- english
- BPE
- GPE
model-index:
- name: WWHO
results:
- task:
type: feature-extraction
dataset:
name: WWHO_30m
type: Remeinium/WWHO_30m
metrics:
- name: Token-to-Word Ratio (TWR) - Sinhala
type: twr
value: 1.274
verified: false
- name: Token-to-Word Ratio (TWR) - Hindi
type: twr
value: 1.181
verified: false
- name: Token-to-Word Ratio (TWR) - Overall
type: twr
value: 1.240
verified: false
---
# Separate before you Compress
<!-- **Remeinium Research**
[remeinium.com](https://remeinium.com) | [Paper](https://arxiv.org/abs/...) | [Tokenizer](https://huggingface.co/remeinium/WWHO) | [Dataset](https://huggingface.co/datasets/remeinium/WWHO_Cleaned_30m)
--- -->
## The Next Architectural Primitive in Tokenization
Large language models remain linguistically blind to Abugida scripts. Byte-Pair Encoding and its descendants routinely shatter complex conjuncts — atomic multi-codepoint grapheme clusters that constitute the fundamental phonetic units of Indic and Southeast Asian writing systems — into meaningless sub-character fragments. The result is degraded reasoning, inflated inference costs, and a systemic “Token Tax” that disproportionately burdens more than one billion speakers.
**WWHO (Where-What-How Often) introduces the clean separation of concerns the field has been missing.**
By decoupling linguistic structural constraints from statistical compression, WWHO builds a unified meta-vocabulary space:
1. **Layer 1 (Where): Code-Switching Router**
A linear $O(N)$ block scanner that evaluates characters in $O(1)$ time to inherently identify script boundaries, routing Latin text to proven frontier tokenizers (like `o200k_base`) while sending Abugida text for specialized processing.
2. **Layer 2 (What): LinguisTrie**
Enforces linguistic integrity by construction: a DFA based syllabifier segments raw Unicode into well-formed syllables with a formal zero-breakage guarantee.
3. **Layer 3 (How Often): SGPE & Meta-Vocabulary**
Performs statistical pair merging exclusively over this linguistically sound stream, safely projecting the resulting tokens into a unified, mathematically offset ID space.
Sinhala and Devanagari serve as the high-complexity proofs-of-concept. The same architecture generalizes directly to Tamil, Khmer, Myanmar, and the broader Abugida family.
---
## Multi-Script Stratified Benchmarks (122.2M Characters)
We evaluated WWHO against frontier models across a 1.5 million sentence code-switched corpus containing Sinhala, Hindi (Devanagari), and English.
### 1. Sinhala Efficiency
| Tokenizer | Tokens | TWR | Chr/Tok | % Reduction |
|---|---|---|---|---|
| **SGPE(WWHO)** | **6,654,288** | **1.274** | **4.83** | **-** |
| OpenAI (o200k_base) | 17,360,196 | 3.324 | 1.85 | 61.7% |
| Llama 4 Scout | 18,157,707 | 3.476 | 1.77 | 63.4% |
| DeepSeek V3 | 29,152,698 | 5.581 | 1.10 | 77.2% |
### 2. Hindi (Devanagari) Efficiency
| Tokenizer | Tokens | TWR | Chr/Tok | % Reduction |
|---|---|---|---|---|
| **SGPE(WWHO)** | **13,433,554** | **1.181** | **4.29** | **-** |
| OpenAI (o200k_base) | 18,394,075 | 1.617 | 3.13 | 27.0% |
| Llama 4 Scout | 19,566,121 | 1.720 | 2.94 | 31.3% |
| DeepSeek V3 | 31,682,218 | 2.786 | 1.82 | 57.6% |
### 3. English
| Tokenizer | Tokens | TWR | Chr/Tok | % Reduction |
|---|---|---|---|---|
| **SGPE(WWHO)** | **7,240,147** | **1.330** | **4.46** | **-** |
| OpenAI (o200k_base) | 7,420,527 | 1.364 | 4.35 | 2.4% |
| Llama 4 Scout | 7,512,843 | 1.381 | 4.30 | 3.6% |
| DeepSeek V3 | 7,904,670 | 1.453 | 4.09 | 8.4% |
*(Note: Because WWHO routes Latin text directly to the native Tiktoken sequence, English performance is mathematically identical. The minor delta in total tokens emerges solely from boundary crossing mechanics.)*
### 4. Overall (Mixed-Script)
| Tokenizer | Tokens | TWR | Chr/Tok | % Reduction |
|---|---|---|---|---|
| **SGPE(WWHO)** | **27,327,989** | **1.240** | **4.47** | **-** |
| OpenAI (o200k_base) | 43,174,798 | 1.959 | 2.83 | 36.7% |
| Llama 4 Scout | 45,236,671 | 2.053 | 2.70 | 39.6% |
| DeepSeek V3 | 68,739,586 | 3.119 | 1.78 | 60.2% |
- **Zero-Breakage Guarantee**: Validated through exhaustive testing permutations across all supported Abugida scripts (0 violations).
- **Full-corpus reconstruction**: 1.5M code-switched sentences encoded and decoded with 0 non-UNK mismatches.
- **UNK rate**: 0.08 % (restricted strictly to rare compounds without violating structural boundaries).
WWHO radically compresses the context window for Abugida text, effectively ending the Token Tax without penalizing existing state-of-the-art programming and reasoning capabilities.
---
## Quick Start with Hugging Face
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("remeinium/SGPE")
text = "ආයුබෝවන් ශ්රී ලංකා"
tokens = tokenizer.tokenize(text)
# ['ආයුබෝවන්', ' ශ්රී', ' ලංකා']
print(tokenizer.encode(text))
```
---
## Resources
<!--
- **Research Paper**: “The Syllable is the Token: Breaking the Token Tax with SGPE” (Remeinium Research, February 2026) -->
- **Pre-trained Tokenizer**: [Hugging Face](https://huggingface.co/remeinium/WWHO)
- **Cleaned Training Corpus**: [Hugging Face](https://huggingface.co/datasets/remeinium/WWHO_30m)
- **Full Code & Evaluation Harness**: [GitHub](https://github.com/remeinium/WWHO)
---
## License
Apache License 2.0 — see [LICENSE](LICENSE).
**Remeinium Research | Remeinium AI | Intelligence for a Greater Tomorrow**
--- |