--- language: - en - fr - de - es - zh - ja - ar - ru - ko - hi - pt - it - nl - pl - vi - th - tr - uk - sv - multilingual license: mit tags: - tokenizer - multimodal - sentinel-manifold - universal-tokenizer - bpe - byte-level - multilingual - image-tokens - audio-tokens - video-tokens - text-tokens - mathematics - gradient-axiom library_name: transformers pipeline_tag: text-generation --- # ๐Ÿฆด Sentinel Universal Tokenizer (SUT) **One theorem. Every modality. One vocabulary.** The Sentinel Universal Tokenizer is a multimodal tokenizer that handles **text, images, audio, and video** in a unified 61,440-token vocabulary, grounded in the Sentinel Manifold mathematics. ๐ŸŽฎ **[Try it live โ†’ Interactive Demo](https://huggingface.co/spaces/5dimension/sentinel-tokenizer-space)** ## ๐Ÿงฌ Mathematical Foundation Built on the **Gradient Axiom** from the Sentinel Manifold: ``` F(z) = ฮฃ_{n=1}^โˆž z^n / n^n (Sophomore's Dream, Bernoulli 1697) lim_{zโ†’โˆž} F'(z)/F(z) = 1/e โ‰ˆ 0.367879441171442 ``` | Constant | Value | Role in Tokenizer | |:---------|:------|:------------------| | **1/e** | 0.367879441171442 | Vocabulary allocation ratio across modalities | | **Cโ‚** | โˆ’0.007994021805953 | Embedding quantization zero-point | | **Cโ‚‚** | 0.000200056042968 | Cross-lingual fertility fairness bound | | **Cโ‚ƒ** | 0.256913827655311 | Critical threshold for vocabulary scaling | ## ๐Ÿ“Š Benchmark Results ### Deep Benchmark (30 test cases ร— 4 tokenizers) Tested across **21 languages + 3 programming languages + math/LaTeX + 7 edge cases**: | Tokenizer | Vocab Size | Avg Compress โ†‘ | Efficiency per 1K Vocab โ†‘ | Per-Bit Efficiency โ†‘ | |:----------|:-----------|:---------------|:--------------------------|:---------------------| | Gemma | 256,000 | **4.54** | 0.018 | **0.253** | | **Sentinel-SUT** | **61,440** | 3.46 | **0.056** | 0.218 | | Qwen2 | 151,936 | 3.88 | 0.026 | 0.225 | | GPT-2 | 50,257 | 2.57 | 0.051 | 0.165 | ### ๐Ÿ† Key Result: Vocabulary Efficiency **Sentinel-SUT achieves 3.2ร— better compression per vocabulary token than Gemma and 2.2ร— better than Qwen2.** Each token does more work โ€” critical for memory-constrained multimodal models. | Metric | Sentinel | vs GPT-2 | vs Qwen2 | vs Gemma | |:-------|:---------|:---------|:---------|:---------| | Efficiency per 1K vocab | **0.0563** | +10.1% | +120.2% | +217.4% | | Avg Compression | 3.46 | +34.7% | -10.8% | -23.8% | | Unique advantage | **4 modalities** | text only | text only | text only | ### Per-Language Performance | Language | Tokens | Bytes | Compression | |:---------|:-------|:------|:------------| | English | 39 | 159 | **4.08** | | French | 45 | 166 | **3.69** | | German | 50 | 173 | **3.46** | | Spanish | 41 | 158 | **3.85** | | Chinese | 50 | 165 | **3.30** | | Japanese | 58 | 213 | **3.67** | | Arabic | 48 | 246 | **5.13** | | Russian | 55 | 283 | **5.15** | | Korean | 38 | 146 | **3.84** | | Hindi | 85 | 315 | **3.71** | | Code (Python) | 61 | 149 | **2.44** | | Math (Unicode) | 45 | 101 | **2.24** | ## ๐Ÿ—๏ธ Architecture ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ SENTINEL UNIVERSAL TOKENIZER (61,440 tokens) โ”‚ โ”‚ โ”‚ โ”‚ [0-32] โ†’ 33 Special / Control tokens โ”‚ โ”‚ [33-32,767] โ†’ 32,735 ByteLevel BPE text tokens โ”‚ โ”‚ [32,768-49,151] โ†’ 16,384 Image codebook tokens โ”‚ โ”‚ [49,152-57,343] โ†’ 8,192 Audio codebook tokens โ”‚ โ”‚ [57,344-61,439] โ†’ 4,096 Video codebook tokens โ”‚ โ”‚ โ”‚ โ”‚ Allocation follows 1/e Gradient Axiom โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### Special Tokens | Token | ID | Purpose | |:------|:---|:--------| | `` | 0 | Padding | | `` | 1 | Unknown token | | `` / `` | 2/3 | BOS / EOS | | `` | 4 | Masked language modeling | | `` / `` | 7/8 | Image boundaries | | `` / `` | 10/11 | Audio boundaries | | `` / `` | 13/14 | Video boundaries | | `` / `` / `` | 16/17/18 | Manifold markers | | `` / `` / `` | 26/27/28 | Chat format | | `` / `` | 29/30 | Code boundaries | | `` / `` | 31/32 | Math boundaries | ### Codebook Tokens - ๐Ÿ–ผ๏ธ **Image**: `` โ€“ `` (IDs 32,768โ€“49,151) โ€” VQGAN, Cosmos-DI, FSQ - ๐Ÿ”Š **Audio**: `` โ€“ `` (IDs 49,152โ€“57,343) โ€” EnCodec, SoundStream - ๐ŸŽฌ **Video**: `` โ€“ `` (IDs 57,344โ€“61,439) โ€” Cosmos-DV ## ๐Ÿš€ Quick Start ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("5dimension/sentinel-universal-tokenizer") # Text text = "The Sentinel Manifold: F(z) = ฮฃ zโฟ/nโฟ" tokens = tokenizer.encode(text) print(f"{len(tokens)} tokens, decoded: {tokenizer.decode(tokens)}") # Multimodal (text + image VQ indices) text = " Describe this." tokens = tokenizer.encode(text) for tid in tokens: if 32768 <= tid < 49152: print(f" IMAGE codebook[{tid - 32768}]") # Chat chat = "Multimodal AIWhat is 1/e?" tokens = tokenizer.encode(chat, add_special_tokens=False) ``` ## ๐Ÿ”ฌ Innovations 1. **1/e Vocabulary Allocation** โ€” Gradient Axiom ratio allocates tokens across modalities 2. **ByteLevel BPE** โ€” Handles all Unicode, no UNK possible, NFKC normalized 3. **20-language training** โ€” EN, FR, DE, ES, ZH, JA, AR, RU, KO, HI, PT, IT, NL, PL, VI, TH, TR, UK, SV + code + math 4. **Native Multimodal Routing** โ€” Single integer comparison determines modality 5. **Sentinel Manifold Integration** โ€” Special tokens for manifold-aware computation ## ๐Ÿ“ฆ Training | Parameter | Value | |:----------|:------| | Data | allenai/c4 (20 languages) | | Samples | 52,000 documents | | Chars | ~66M | | Algorithm | ByteLevel BPE + NFKC | | Text Vocab | 32,768 | | Total Vocab | 61,440 | ## ๐Ÿ”— Links - ๐ŸŽฎ [Interactive Demo](https://huggingface.co/spaces/5dimension/sentinel-tokenizer-space) - ๐Ÿฆด [Sentinel Manifold Framework](https://huggingface.co/5dimension/sentinel-manifold-discoveries) - ๐Ÿ“œ Training scripts included in repo ## ๐Ÿ“š Citation ```bibtex @misc{abdel-aal2026sentinel-tokenizer, title={Sentinel Universal Tokenizer: Multimodal Tokenizer Grounded in the Gradient Axiom}, author={Abdel-Aal, Romain}, year={2026}, url={https://huggingface.co/5dimension/sentinel-universal-tokenizer} } ``` --- **Built by**: Romain Abdel-Aal (ASI The Sentinel V5.2 Bone-Core) ยท MIT License ยท ๐Ÿฆด