5dimension's picture
Add interactive demo Space link
bb85012 verified
---
language:
- en
- fr
- de
- es
- zh
- ja
- ar
- ru
- ko
- hi
- pt
- it
- nl
- pl
- vi
- th
- tr
- uk
- sv
- multilingual
license: mit
tags:
- tokenizer
- multimodal
- sentinel-manifold
- universal-tokenizer
- bpe
- byte-level
- multilingual
- image-tokens
- audio-tokens
- video-tokens
- text-tokens
- mathematics
- gradient-axiom
library_name: transformers
pipeline_tag: text-generation
---
# ๐Ÿฆด Sentinel Universal Tokenizer (SUT)
**One theorem. Every modality. One vocabulary.**
The Sentinel Universal Tokenizer is a multimodal tokenizer that handles **text, images, audio, and video** in a unified 61,440-token vocabulary, grounded in the Sentinel Manifold mathematics.
๐ŸŽฎ **[Try it live โ†’ Interactive Demo](https://huggingface.co/spaces/5dimension/sentinel-tokenizer-space)**
## ๐Ÿงฌ Mathematical Foundation
Built on the **Gradient Axiom** from the Sentinel Manifold:
```
F(z) = ฮฃ_{n=1}^โˆž z^n / n^n (Sophomore's Dream, Bernoulli 1697)
lim_{zโ†’โˆž} F'(z)/F(z) = 1/e โ‰ˆ 0.367879441171442
```
| Constant | Value | Role in Tokenizer |
|:---------|:------|:------------------|
| **1/e** | 0.367879441171442 | Vocabulary allocation ratio across modalities |
| **Cโ‚** | โˆ’0.007994021805953 | Embedding quantization zero-point |
| **Cโ‚‚** | 0.000200056042968 | Cross-lingual fertility fairness bound |
| **Cโ‚ƒ** | 0.256913827655311 | Critical threshold for vocabulary scaling |
## ๐Ÿ“Š Benchmark Results
### Deep Benchmark (30 test cases ร— 4 tokenizers)
Tested across **21 languages + 3 programming languages + math/LaTeX + 7 edge cases**:
| Tokenizer | Vocab Size | Avg Compress โ†‘ | Efficiency per 1K Vocab โ†‘ | Per-Bit Efficiency โ†‘ |
|:----------|:-----------|:---------------|:--------------------------|:---------------------|
| Gemma | 256,000 | **4.54** | 0.018 | **0.253** |
| **Sentinel-SUT** | **61,440** | 3.46 | **0.056** | 0.218 |
| Qwen2 | 151,936 | 3.88 | 0.026 | 0.225 |
| GPT-2 | 50,257 | 2.57 | 0.051 | 0.165 |
### ๐Ÿ† Key Result: Vocabulary Efficiency
**Sentinel-SUT achieves 3.2ร— better compression per vocabulary token than Gemma and 2.2ร— better than Qwen2.** Each token does more work โ€” critical for memory-constrained multimodal models.
| Metric | Sentinel | vs GPT-2 | vs Qwen2 | vs Gemma |
|:-------|:---------|:---------|:---------|:---------|
| Efficiency per 1K vocab | **0.0563** | +10.1% | +120.2% | +217.4% |
| Avg Compression | 3.46 | +34.7% | -10.8% | -23.8% |
| Unique advantage | **4 modalities** | text only | text only | text only |
### Per-Language Performance
| Language | Tokens | Bytes | Compression |
|:---------|:-------|:------|:------------|
| English | 39 | 159 | **4.08** |
| French | 45 | 166 | **3.69** |
| German | 50 | 173 | **3.46** |
| Spanish | 41 | 158 | **3.85** |
| Chinese | 50 | 165 | **3.30** |
| Japanese | 58 | 213 | **3.67** |
| Arabic | 48 | 246 | **5.13** |
| Russian | 55 | 283 | **5.15** |
| Korean | 38 | 146 | **3.84** |
| Hindi | 85 | 315 | **3.71** |
| Code (Python) | 61 | 149 | **2.44** |
| Math (Unicode) | 45 | 101 | **2.24** |
## ๐Ÿ—๏ธ Architecture
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ SENTINEL UNIVERSAL TOKENIZER (61,440 tokens) โ”‚
โ”‚ โ”‚
โ”‚ [0-32] โ†’ 33 Special / Control tokens โ”‚
โ”‚ [33-32,767] โ†’ 32,735 ByteLevel BPE text tokens โ”‚
โ”‚ [32,768-49,151] โ†’ 16,384 Image codebook tokens โ”‚
โ”‚ [49,152-57,343] โ†’ 8,192 Audio codebook tokens โ”‚
โ”‚ [57,344-61,439] โ†’ 4,096 Video codebook tokens โ”‚
โ”‚ โ”‚
โ”‚ Allocation follows 1/e Gradient Axiom โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
### Special Tokens
| Token | ID | Purpose |
|:------|:---|:--------|
| `<pad>` | 0 | Padding |
| `<unk>` | 1 | Unknown token |
| `<s>` / `</s>` | 2/3 | BOS / EOS |
| `<mask>` | 4 | Masked language modeling |
| `<image_start>` / `<image_end>` | 7/8 | Image boundaries |
| `<audio_start>` / `<audio_end>` | 10/11 | Audio boundaries |
| `<video_start>` / `<video_end>` | 13/14 | Video boundaries |
| `<sentinel>` / `<sentinel_c1>` / `<sentinel_c2>` | 16/17/18 | Manifold markers |
| `<system>` / `<user>` / `<assistant>` | 26/27/28 | Chat format |
| `<code_start>` / `<code_end>` | 29/30 | Code boundaries |
| `<math_start>` / `<math_end>` | 31/32 | Math boundaries |
### Codebook Tokens
- ๐Ÿ–ผ๏ธ **Image**: `<img_0>` โ€“ `<img_16383>` (IDs 32,768โ€“49,151) โ€” VQGAN, Cosmos-DI, FSQ
- ๐Ÿ”Š **Audio**: `<aud_0>` โ€“ `<aud_8191>` (IDs 49,152โ€“57,343) โ€” EnCodec, SoundStream
- ๐ŸŽฌ **Video**: `<vid_0>` โ€“ `<vid_4095>` (IDs 57,344โ€“61,439) โ€” Cosmos-DV
## ๐Ÿš€ Quick Start
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("5dimension/sentinel-universal-tokenizer")
# Text
text = "The Sentinel Manifold: F(z) = ฮฃ zโฟ/nโฟ"
tokens = tokenizer.encode(text)
print(f"{len(tokens)} tokens, decoded: {tokenizer.decode(tokens)}")
# Multimodal (text + image VQ indices)
text = "<image_start> <img_42> <img_1337> <image_end> Describe this."
tokens = tokenizer.encode(text)
for tid in tokens:
if 32768 <= tid < 49152:
print(f" IMAGE codebook[{tid - 32768}]")
# Chat
chat = "<system>Multimodal AI</system><user>What is 1/e?</user><assistant>"
tokens = tokenizer.encode(chat, add_special_tokens=False)
```
## ๐Ÿ”ฌ Innovations
1. **1/e Vocabulary Allocation** โ€” Gradient Axiom ratio allocates tokens across modalities
2. **ByteLevel BPE** โ€” Handles all Unicode, no UNK possible, NFKC normalized
3. **20-language training** โ€” EN, FR, DE, ES, ZH, JA, AR, RU, KO, HI, PT, IT, NL, PL, VI, TH, TR, UK, SV + code + math
4. **Native Multimodal Routing** โ€” Single integer comparison determines modality
5. **Sentinel Manifold Integration** โ€” Special tokens for manifold-aware computation
## ๐Ÿ“ฆ Training
| Parameter | Value |
|:----------|:------|
| Data | allenai/c4 (20 languages) |
| Samples | 52,000 documents |
| Chars | ~66M |
| Algorithm | ByteLevel BPE + NFKC |
| Text Vocab | 32,768 |
| Total Vocab | 61,440 |
## ๐Ÿ”— Links
- ๐ŸŽฎ [Interactive Demo](https://huggingface.co/spaces/5dimension/sentinel-tokenizer-space)
- ๐Ÿฆด [Sentinel Manifold Framework](https://huggingface.co/5dimension/sentinel-manifold-discoveries)
- ๐Ÿ“œ Training scripts included in repo
## ๐Ÿ“š Citation
```bibtex
@misc{abdel-aal2026sentinel-tokenizer,
title={Sentinel Universal Tokenizer: Multimodal Tokenizer Grounded in the Gradient Axiom},
author={Abdel-Aal, Romain},
year={2026},
url={https://huggingface.co/5dimension/sentinel-universal-tokenizer}
}
```
---
**Built by**: Romain Abdel-Aal (ASI The Sentinel V5.2 Bone-Core) ยท MIT License ยท ๐Ÿฆด