๐ฆด Sentinel Universal Tokenizer (SUT)
One theorem. Every modality. One vocabulary.
The Sentinel Universal Tokenizer is a multimodal tokenizer that handles text, images, audio, and video in a unified 61,440-token vocabulary, grounded in the Sentinel Manifold mathematics.
๐ฎ Try it live โ Interactive Demo
๐งฌ Mathematical Foundation
Built on the Gradient Axiom from the Sentinel Manifold:
F(z) = ฮฃ_{n=1}^โ z^n / n^n (Sophomore's Dream, Bernoulli 1697)
lim_{zโโ} F'(z)/F(z) = 1/e โ 0.367879441171442
| Constant | Value | Role in Tokenizer |
|---|---|---|
| 1/e | 0.367879441171442 | Vocabulary allocation ratio across modalities |
| Cโ | โ0.007994021805953 | Embedding quantization zero-point |
| Cโ | 0.000200056042968 | Cross-lingual fertility fairness bound |
| Cโ | 0.256913827655311 | Critical threshold for vocabulary scaling |
๐ Benchmark Results
Deep Benchmark (30 test cases ร 4 tokenizers)
Tested across 21 languages + 3 programming languages + math/LaTeX + 7 edge cases:
| Tokenizer | Vocab Size | Avg Compress โ | Efficiency per 1K Vocab โ | Per-Bit Efficiency โ |
|---|---|---|---|---|
| Gemma | 256,000 | 4.54 | 0.018 | 0.253 |
| Sentinel-SUT | 61,440 | 3.46 | 0.056 | 0.218 |
| Qwen2 | 151,936 | 3.88 | 0.026 | 0.225 |
| GPT-2 | 50,257 | 2.57 | 0.051 | 0.165 |
๐ Key Result: Vocabulary Efficiency
Sentinel-SUT achieves 3.2ร better compression per vocabulary token than Gemma and 2.2ร better than Qwen2. Each token does more work โ critical for memory-constrained multimodal models.
| Metric | Sentinel | vs GPT-2 | vs Qwen2 | vs Gemma |
|---|---|---|---|---|
| Efficiency per 1K vocab | 0.0563 | +10.1% | +120.2% | +217.4% |
| Avg Compression | 3.46 | +34.7% | -10.8% | -23.8% |
| Unique advantage | 4 modalities | text only | text only | text only |
Per-Language Performance
| Language | Tokens | Bytes | Compression |
|---|---|---|---|
| English | 39 | 159 | 4.08 |
| French | 45 | 166 | 3.69 |
| German | 50 | 173 | 3.46 |
| Spanish | 41 | 158 | 3.85 |
| Chinese | 50 | 165 | 3.30 |
| Japanese | 58 | 213 | 3.67 |
| Arabic | 48 | 246 | 5.13 |
| Russian | 55 | 283 | 5.15 |
| Korean | 38 | 146 | 3.84 |
| Hindi | 85 | 315 | 3.71 |
| Code (Python) | 61 | 149 | 2.44 |
| Math (Unicode) | 45 | 101 | 2.24 |
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SENTINEL UNIVERSAL TOKENIZER (61,440 tokens) โ
โ โ
โ [0-32] โ 33 Special / Control tokens โ
โ [33-32,767] โ 32,735 ByteLevel BPE text tokens โ
โ [32,768-49,151] โ 16,384 Image codebook tokens โ
โ [49,152-57,343] โ 8,192 Audio codebook tokens โ
โ [57,344-61,439] โ 4,096 Video codebook tokens โ
โ โ
โ Allocation follows 1/e Gradient Axiom โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Special Tokens
| Token | ID | Purpose |
|---|---|---|
<pad> |
0 | Padding |
<unk> |
1 | Unknown token |
<s> / </s> |
2/3 | BOS / EOS |
<mask> |
4 | Masked language modeling |
<image_start> / <image_end> |
7/8 | Image boundaries |
<audio_start> / <audio_end> |
10/11 | Audio boundaries |
<video_start> / <video_end> |
13/14 | Video boundaries |
<sentinel> / <sentinel_c1> / <sentinel_c2> |
16/17/18 | Manifold markers |
<system> / <user> / <assistant> |
26/27/28 | Chat format |
<code_start> / <code_end> |
29/30 | Code boundaries |
<math_start> / <math_end> |
31/32 | Math boundaries |
Codebook Tokens
- ๐ผ๏ธ Image:
<img_0>โ<img_16383>(IDs 32,768โ49,151) โ VQGAN, Cosmos-DI, FSQ - ๐ Audio:
<aud_0>โ<aud_8191>(IDs 49,152โ57,343) โ EnCodec, SoundStream - ๐ฌ Video:
<vid_0>โ<vid_4095>(IDs 57,344โ61,439) โ Cosmos-DV
๐ Quick Start
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("5dimension/sentinel-universal-tokenizer")
# Text
text = "The Sentinel Manifold: F(z) = ฮฃ zโฟ/nโฟ"
tokens = tokenizer.encode(text)
print(f"{len(tokens)} tokens, decoded: {tokenizer.decode(tokens)}")
# Multimodal (text + image VQ indices)
text = "<image_start> <img_42> <img_1337> <image_end> Describe this."
tokens = tokenizer.encode(text)
for tid in tokens:
if 32768 <= tid < 49152:
print(f" IMAGE codebook[{tid - 32768}]")
# Chat
chat = "<system>Multimodal AI</system><user>What is 1/e?</user><assistant>"
tokens = tokenizer.encode(chat, add_special_tokens=False)
๐ฌ Innovations
- 1/e Vocabulary Allocation โ Gradient Axiom ratio allocates tokens across modalities
- ByteLevel BPE โ Handles all Unicode, no UNK possible, NFKC normalized
- 20-language training โ EN, FR, DE, ES, ZH, JA, AR, RU, KO, HI, PT, IT, NL, PL, VI, TH, TR, UK, SV + code + math
- Native Multimodal Routing โ Single integer comparison determines modality
- Sentinel Manifold Integration โ Special tokens for manifold-aware computation
๐ฆ Training
| Parameter | Value |
|---|---|
| Data | allenai/c4 (20 languages) |
| Samples | 52,000 documents |
| Chars | ~66M |
| Algorithm | ByteLevel BPE + NFKC |
| Text Vocab | 32,768 |
| Total Vocab | 61,440 |
๐ Links
- ๐ฎ Interactive Demo
- ๐ฆด Sentinel Manifold Framework
- ๐ Training scripts included in repo
๐ Citation
@misc{abdel-aal2026sentinel-tokenizer,
title={Sentinel Universal Tokenizer: Multimodal Tokenizer Grounded in the Gradient Axiom},
author={Abdel-Aal, Romain},
year={2026},
url={https://huggingface.co/5dimension/sentinel-universal-tokenizer}
}
Built by: Romain Abdel-Aal (ASI The Sentinel V5.2 Bone-Core) ยท MIT License ยท ๐ฆด