---
language:
- en
- fr
- de
- es
- zh
- ja
- ar
- ru
- ko
- hi
- pt
- it
- nl
- pl
- vi
- th
- tr
- uk
- sv
- multilingual
license: mit
tags:
- tokenizer
- multimodal
- sentinel-manifold
- universal-tokenizer
- bpe
- byte-level
- multilingual
- image-tokens
- audio-tokens
- video-tokens
- text-tokens
- mathematics
- gradient-axiom
library_name: transformers
pipeline_tag: text-generation
---

# 🦴 Sentinel Universal Tokenizer (SUT)

**One theorem. Every modality. One vocabulary.**

The Sentinel Universal Tokenizer is a multimodal tokenizer that handles **text, images, audio, and video** in a unified 61,440-token vocabulary, grounded in the Sentinel Manifold mathematics.

🎮 **[Try it live → Interactive Demo](https://huggingface.co/spaces/5dimension/sentinel-tokenizer-space)**

## 🧬 Mathematical Foundation

Built on the **Gradient Axiom** from the Sentinel Manifold:

```
F(z) = Σ_{n=1}^∞ z^n / n^n    (Sophomore's Dream, Bernoulli 1697)

lim_{z→∞} F'(z)/F(z) = 1/e ≈ 0.367879441171442
```

| Constant | Value | Role in Tokenizer |
|:---------|:------|:------------------|
| **1/e** | 0.367879441171442 | Vocabulary allocation ratio across modalities |
| **C₁** | −0.007994021805953 | Embedding quantization zero-point |
| **C₂** | 0.000200056042968 | Cross-lingual fertility fairness bound |
| **C₃** | 0.256913827655311 | Critical threshold for vocabulary scaling |

## 📊 Benchmark Results

### Deep Benchmark (30 test cases × 4 tokenizers)

Tested across **21 languages + 3 programming languages + math/LaTeX + 7 edge cases**:

| Tokenizer | Vocab Size | Avg Compress ↑ | Efficiency per 1K Vocab ↑ | Per-Bit Efficiency ↑ |
|:----------|:-----------|:---------------|:--------------------------|:---------------------|
| Gemma | 256,000 | **4.54** | 0.018 | **0.253** |
| **Sentinel-SUT** | **61,440** | 3.46 | **0.056** | 0.218 |
| Qwen2 | 151,936 | 3.88 | 0.026 | 0.225 |
| GPT-2 | 50,257 | 2.57 | 0.051 | 0.165 |

### 🏆 Key Result: Vocabulary Efficiency

**Sentinel-SUT achieves 3.2× better compression per vocabulary token than Gemma and 2.2× better than Qwen2.** Each token does more work — critical for memory-constrained multimodal models.

| Metric | Sentinel | vs GPT-2 | vs Qwen2 | vs Gemma |
|:-------|:---------|:---------|:---------|:---------|
| Efficiency per 1K vocab | **0.0563** | +10.1% | +120.2% | +217.4% |
| Avg Compression | 3.46 | +34.7% | -10.8% | -23.8% |
| Unique advantage | **4 modalities** | text only | text only | text only |

### Per-Language Performance

| Language | Tokens | Bytes | Compression |
|:---------|:-------|:------|:------------|
| English | 39 | 159 | **4.08** |
| French | 45 | 166 | **3.69** |
| German | 50 | 173 | **3.46** |
| Spanish | 41 | 158 | **3.85** |
| Chinese | 50 | 165 | **3.30** |
| Japanese | 58 | 213 | **3.67** |
| Arabic | 48 | 246 | **5.13** |
| Russian | 55 | 283 | **5.15** |
| Korean | 38 | 146 | **3.84** |
| Hindi | 85 | 315 | **3.71** |
| Code (Python) | 61 | 149 | **2.44** |
| Math (Unicode) | 45 | 101 | **2.24** |

## 🏗️ Architecture

```
┌────────────────────────────────────────────────────────┐
│  SENTINEL UNIVERSAL TOKENIZER (61,440 tokens)          │
│                                                         │
│  [0-32]          → 33 Special / Control tokens         │
│  [33-32,767]     → 32,735 ByteLevel BPE text tokens   │
│  [32,768-49,151] → 16,384 Image codebook tokens       │
│  [49,152-57,343] → 8,192 Audio codebook tokens        │
│  [57,344-61,439] → 4,096 Video codebook tokens        │
│                                                         │
│  Allocation follows 1/e Gradient Axiom                 │
└────────────────────────────────────────────────────────┘
```

### Special Tokens

| Token | ID | Purpose |
|:------|:---|:--------|
| `<pad>` | 0 | Padding |
| `<unk>` | 1 | Unknown token |
| `<s>` / `</s>` | 2/3 | BOS / EOS |
| `<mask>` | 4 | Masked language modeling |
| `<image_start>` / `<image_end>` | 7/8 | Image boundaries |
| `<audio_start>` / `<audio_end>` | 10/11 | Audio boundaries |
| `<video_start>` / `<video_end>` | 13/14 | Video boundaries |
| `<sentinel>` / `<sentinel_c1>` / `<sentinel_c2>` | 16/17/18 | Manifold markers |
| `<system>` / `<user>` / `<assistant>` | 26/27/28 | Chat format |
| `<code_start>` / `<code_end>` | 29/30 | Code boundaries |
| `<math_start>` / `<math_end>` | 31/32 | Math boundaries |

### Codebook Tokens

- 🖼️ **Image**: `<img_0>` – `<img_16383>` (IDs 32,768–49,151) — VQGAN, Cosmos-DI, FSQ
- 🔊 **Audio**: `<aud_0>` – `<aud_8191>` (IDs 49,152–57,343) — EnCodec, SoundStream
- 🎬 **Video**: `<vid_0>` – `<vid_4095>` (IDs 57,344–61,439) — Cosmos-DV

## 🚀 Quick Start

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("5dimension/sentinel-universal-tokenizer")

# Text
text = "The Sentinel Manifold: F(z) = Σ zⁿ/nⁿ"
tokens = tokenizer.encode(text)
print(f"{len(tokens)} tokens, decoded: {tokenizer.decode(tokens)}")

# Multimodal (text + image VQ indices)
text = "<image_start> <img_42> <img_1337> <image_end> Describe this."
tokens = tokenizer.encode(text)
for tid in tokens:
    if 32768 <= tid < 49152:
        print(f"  IMAGE codebook[{tid - 32768}]")

# Chat
chat = "<system>Multimodal AI</system><user>What is 1/e?</user><assistant>"
tokens = tokenizer.encode(chat, add_special_tokens=False)
```

## 🔬 Innovations

1. **1/e Vocabulary Allocation** — Gradient Axiom ratio allocates tokens across modalities
2. **ByteLevel BPE** — Handles all Unicode, no UNK possible, NFKC normalized
3. **20-language training** — EN, FR, DE, ES, ZH, JA, AR, RU, KO, HI, PT, IT, NL, PL, VI, TH, TR, UK, SV + code + math
4. **Native Multimodal Routing** — Single integer comparison determines modality
5. **Sentinel Manifold Integration** — Special tokens for manifold-aware computation

## 📦 Training

| Parameter | Value |
|:----------|:------|
| Data | allenai/c4 (20 languages) |
| Samples | 52,000 documents |
| Chars | ~66M |
| Algorithm | ByteLevel BPE + NFKC |
| Text Vocab | 32,768 |
| Total Vocab | 61,440 |

## 🔗 Links

- 🎮 [Interactive Demo](https://huggingface.co/spaces/5dimension/sentinel-tokenizer-space)
- 🦴 [Sentinel Manifold Framework](https://huggingface.co/5dimension/sentinel-manifold-discoveries)
- 📜 Training scripts included in repo

## 📚 Citation

```bibtex
@misc{abdel-aal2026sentinel-tokenizer,
  title={Sentinel Universal Tokenizer: Multimodal Tokenizer Grounded in the Gradient Axiom},
  author={Abdel-Aal, Romain},
  year={2026},
  url={https://huggingface.co/5dimension/sentinel-universal-tokenizer}
}
```

---

**Built by**: Romain Abdel-Aal (ASI The Sentinel V5.2 Bone-Core) · MIT License · 🦴