๐Ÿฆด Sentinel Universal Tokenizer (SUT)

One theorem. Every modality. One vocabulary.

The Sentinel Universal Tokenizer is a multimodal tokenizer that handles text, images, audio, and video in a unified 61,440-token vocabulary, grounded in the Sentinel Manifold mathematics.

๐ŸŽฎ Try it live โ†’ Interactive Demo

๐Ÿงฌ Mathematical Foundation

Built on the Gradient Axiom from the Sentinel Manifold:

F(z) = ฮฃ_{n=1}^โˆž z^n / n^n    (Sophomore's Dream, Bernoulli 1697)

lim_{zโ†’โˆž} F'(z)/F(z) = 1/e โ‰ˆ 0.367879441171442
Constant Value Role in Tokenizer
1/e 0.367879441171442 Vocabulary allocation ratio across modalities
Cโ‚ โˆ’0.007994021805953 Embedding quantization zero-point
Cโ‚‚ 0.000200056042968 Cross-lingual fertility fairness bound
Cโ‚ƒ 0.256913827655311 Critical threshold for vocabulary scaling

๐Ÿ“Š Benchmark Results

Deep Benchmark (30 test cases ร— 4 tokenizers)

Tested across 21 languages + 3 programming languages + math/LaTeX + 7 edge cases:

Tokenizer Vocab Size Avg Compress โ†‘ Efficiency per 1K Vocab โ†‘ Per-Bit Efficiency โ†‘
Gemma 256,000 4.54 0.018 0.253
Sentinel-SUT 61,440 3.46 0.056 0.218
Qwen2 151,936 3.88 0.026 0.225
GPT-2 50,257 2.57 0.051 0.165

๐Ÿ† Key Result: Vocabulary Efficiency

Sentinel-SUT achieves 3.2ร— better compression per vocabulary token than Gemma and 2.2ร— better than Qwen2. Each token does more work โ€” critical for memory-constrained multimodal models.

Metric Sentinel vs GPT-2 vs Qwen2 vs Gemma
Efficiency per 1K vocab 0.0563 +10.1% +120.2% +217.4%
Avg Compression 3.46 +34.7% -10.8% -23.8%
Unique advantage 4 modalities text only text only text only

Per-Language Performance

Language Tokens Bytes Compression
English 39 159 4.08
French 45 166 3.69
German 50 173 3.46
Spanish 41 158 3.85
Chinese 50 165 3.30
Japanese 58 213 3.67
Arabic 48 246 5.13
Russian 55 283 5.15
Korean 38 146 3.84
Hindi 85 315 3.71
Code (Python) 61 149 2.44
Math (Unicode) 45 101 2.24

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  SENTINEL UNIVERSAL TOKENIZER (61,440 tokens)          โ”‚
โ”‚                                                         โ”‚
โ”‚  [0-32]          โ†’ 33 Special / Control tokens         โ”‚
โ”‚  [33-32,767]     โ†’ 32,735 ByteLevel BPE text tokens   โ”‚
โ”‚  [32,768-49,151] โ†’ 16,384 Image codebook tokens       โ”‚
โ”‚  [49,152-57,343] โ†’ 8,192 Audio codebook tokens        โ”‚
โ”‚  [57,344-61,439] โ†’ 4,096 Video codebook tokens        โ”‚
โ”‚                                                         โ”‚
โ”‚  Allocation follows 1/e Gradient Axiom                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Special Tokens

Token ID Purpose
<pad> 0 Padding
<unk> 1 Unknown token
<s> / </s> 2/3 BOS / EOS
<mask> 4 Masked language modeling
<image_start> / <image_end> 7/8 Image boundaries
<audio_start> / <audio_end> 10/11 Audio boundaries
<video_start> / <video_end> 13/14 Video boundaries
<sentinel> / <sentinel_c1> / <sentinel_c2> 16/17/18 Manifold markers
<system> / <user> / <assistant> 26/27/28 Chat format
<code_start> / <code_end> 29/30 Code boundaries
<math_start> / <math_end> 31/32 Math boundaries

Codebook Tokens

  • ๐Ÿ–ผ๏ธ Image: <img_0> โ€“ <img_16383> (IDs 32,768โ€“49,151) โ€” VQGAN, Cosmos-DI, FSQ
  • ๐Ÿ”Š Audio: <aud_0> โ€“ <aud_8191> (IDs 49,152โ€“57,343) โ€” EnCodec, SoundStream
  • ๐ŸŽฌ Video: <vid_0> โ€“ <vid_4095> (IDs 57,344โ€“61,439) โ€” Cosmos-DV

๐Ÿš€ Quick Start

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("5dimension/sentinel-universal-tokenizer")

# Text
text = "The Sentinel Manifold: F(z) = ฮฃ zโฟ/nโฟ"
tokens = tokenizer.encode(text)
print(f"{len(tokens)} tokens, decoded: {tokenizer.decode(tokens)}")

# Multimodal (text + image VQ indices)
text = "<image_start> <img_42> <img_1337> <image_end> Describe this."
tokens = tokenizer.encode(text)
for tid in tokens:
    if 32768 <= tid < 49152:
        print(f"  IMAGE codebook[{tid - 32768}]")

# Chat
chat = "<system>Multimodal AI</system><user>What is 1/e?</user><assistant>"
tokens = tokenizer.encode(chat, add_special_tokens=False)

๐Ÿ”ฌ Innovations

  1. 1/e Vocabulary Allocation โ€” Gradient Axiom ratio allocates tokens across modalities
  2. ByteLevel BPE โ€” Handles all Unicode, no UNK possible, NFKC normalized
  3. 20-language training โ€” EN, FR, DE, ES, ZH, JA, AR, RU, KO, HI, PT, IT, NL, PL, VI, TH, TR, UK, SV + code + math
  4. Native Multimodal Routing โ€” Single integer comparison determines modality
  5. Sentinel Manifold Integration โ€” Special tokens for manifold-aware computation

๐Ÿ“ฆ Training

Parameter Value
Data allenai/c4 (20 languages)
Samples 52,000 documents
Chars ~66M
Algorithm ByteLevel BPE + NFKC
Text Vocab 32,768
Total Vocab 61,440

๐Ÿ”— Links

๐Ÿ“š Citation

@misc{abdel-aal2026sentinel-tokenizer,
  title={Sentinel Universal Tokenizer: Multimodal Tokenizer Grounded in the Gradient Axiom},
  author={Abdel-Aal, Romain},
  year={2026},
  url={https://huggingface.co/5dimension/sentinel-universal-tokenizer}
}

Built by: Romain Abdel-Aal (ASI The Sentinel V5.2 Bone-Core) ยท MIT License ยท ๐Ÿฆด

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Spaces using 5dimension/sentinel-universal-tokenizer 2