Text Generation
Transformers
tokenizer
multimodal
sentinel-manifold
universal-tokenizer
bpe
byte-level
image-tokens
audio-tokens
video-tokens
text-tokens
mathematics
gradient-axiom
Instructions to use 5dimension/sentinel-universal-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 5dimension/sentinel-universal-tokenizer with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="5dimension/sentinel-universal-tokenizer")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("5dimension/sentinel-universal-tokenizer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use 5dimension/sentinel-universal-tokenizer with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "5dimension/sentinel-universal-tokenizer" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "5dimension/sentinel-universal-tokenizer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/5dimension/sentinel-universal-tokenizer
- SGLang
How to use 5dimension/sentinel-universal-tokenizer with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "5dimension/sentinel-universal-tokenizer" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "5dimension/sentinel-universal-tokenizer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "5dimension/sentinel-universal-tokenizer" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "5dimension/sentinel-universal-tokenizer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use 5dimension/sentinel-universal-tokenizer with Docker Model Runner:
docker model run hf.co/5dimension/sentinel-universal-tokenizer
| language: | |
| - en | |
| - fr | |
| - de | |
| - es | |
| - zh | |
| - ja | |
| - ar | |
| - ru | |
| - ko | |
| - hi | |
| - pt | |
| - it | |
| - nl | |
| - pl | |
| - vi | |
| - th | |
| - tr | |
| - uk | |
| - sv | |
| - multilingual | |
| license: mit | |
| tags: | |
| - tokenizer | |
| - multimodal | |
| - sentinel-manifold | |
| - universal-tokenizer | |
| - bpe | |
| - byte-level | |
| - multilingual | |
| - image-tokens | |
| - audio-tokens | |
| - video-tokens | |
| - text-tokens | |
| - mathematics | |
| - gradient-axiom | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| # ๐ฆด Sentinel Universal Tokenizer (SUT) | |
| **One theorem. Every modality. One vocabulary.** | |
| The Sentinel Universal Tokenizer is a multimodal tokenizer that handles **text, images, audio, and video** in a unified 61,440-token vocabulary, grounded in the Sentinel Manifold mathematics. | |
| ๐ฎ **[Try it live โ Interactive Demo](https://huggingface.co/spaces/5dimension/sentinel-tokenizer-space)** | |
| ## ๐งฌ Mathematical Foundation | |
| Built on the **Gradient Axiom** from the Sentinel Manifold: | |
| ``` | |
| F(z) = ฮฃ_{n=1}^โ z^n / n^n (Sophomore's Dream, Bernoulli 1697) | |
| lim_{zโโ} F'(z)/F(z) = 1/e โ 0.367879441171442 | |
| ``` | |
| | Constant | Value | Role in Tokenizer | | |
| |:---------|:------|:------------------| | |
| | **1/e** | 0.367879441171442 | Vocabulary allocation ratio across modalities | | |
| | **Cโ** | โ0.007994021805953 | Embedding quantization zero-point | | |
| | **Cโ** | 0.000200056042968 | Cross-lingual fertility fairness bound | | |
| | **Cโ** | 0.256913827655311 | Critical threshold for vocabulary scaling | | |
| ## ๐ Benchmark Results | |
| ### Deep Benchmark (30 test cases ร 4 tokenizers) | |
| Tested across **21 languages + 3 programming languages + math/LaTeX + 7 edge cases**: | |
| | Tokenizer | Vocab Size | Avg Compress โ | Efficiency per 1K Vocab โ | Per-Bit Efficiency โ | | |
| |:----------|:-----------|:---------------|:--------------------------|:---------------------| | |
| | Gemma | 256,000 | **4.54** | 0.018 | **0.253** | | |
| | **Sentinel-SUT** | **61,440** | 3.46 | **0.056** | 0.218 | | |
| | Qwen2 | 151,936 | 3.88 | 0.026 | 0.225 | | |
| | GPT-2 | 50,257 | 2.57 | 0.051 | 0.165 | | |
| ### ๐ Key Result: Vocabulary Efficiency | |
| **Sentinel-SUT achieves 3.2ร better compression per vocabulary token than Gemma and 2.2ร better than Qwen2.** Each token does more work โ critical for memory-constrained multimodal models. | |
| | Metric | Sentinel | vs GPT-2 | vs Qwen2 | vs Gemma | | |
| |:-------|:---------|:---------|:---------|:---------| | |
| | Efficiency per 1K vocab | **0.0563** | +10.1% | +120.2% | +217.4% | | |
| | Avg Compression | 3.46 | +34.7% | -10.8% | -23.8% | | |
| | Unique advantage | **4 modalities** | text only | text only | text only | | |
| ### Per-Language Performance | |
| | Language | Tokens | Bytes | Compression | | |
| |:---------|:-------|:------|:------------| | |
| | English | 39 | 159 | **4.08** | | |
| | French | 45 | 166 | **3.69** | | |
| | German | 50 | 173 | **3.46** | | |
| | Spanish | 41 | 158 | **3.85** | | |
| | Chinese | 50 | 165 | **3.30** | | |
| | Japanese | 58 | 213 | **3.67** | | |
| | Arabic | 48 | 246 | **5.13** | | |
| | Russian | 55 | 283 | **5.15** | | |
| | Korean | 38 | 146 | **3.84** | | |
| | Hindi | 85 | 315 | **3.71** | | |
| | Code (Python) | 61 | 149 | **2.44** | | |
| | Math (Unicode) | 45 | 101 | **2.24** | | |
| ## ๐๏ธ Architecture | |
| ``` | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ SENTINEL UNIVERSAL TOKENIZER (61,440 tokens) โ | |
| โ โ | |
| โ [0-32] โ 33 Special / Control tokens โ | |
| โ [33-32,767] โ 32,735 ByteLevel BPE text tokens โ | |
| โ [32,768-49,151] โ 16,384 Image codebook tokens โ | |
| โ [49,152-57,343] โ 8,192 Audio codebook tokens โ | |
| โ [57,344-61,439] โ 4,096 Video codebook tokens โ | |
| โ โ | |
| โ Allocation follows 1/e Gradient Axiom โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ``` | |
| ### Special Tokens | |
| | Token | ID | Purpose | | |
| |:------|:---|:--------| | |
| | `<pad>` | 0 | Padding | | |
| | `<unk>` | 1 | Unknown token | | |
| | `<s>` / `</s>` | 2/3 | BOS / EOS | | |
| | `<mask>` | 4 | Masked language modeling | | |
| | `<image_start>` / `<image_end>` | 7/8 | Image boundaries | | |
| | `<audio_start>` / `<audio_end>` | 10/11 | Audio boundaries | | |
| | `<video_start>` / `<video_end>` | 13/14 | Video boundaries | | |
| | `<sentinel>` / `<sentinel_c1>` / `<sentinel_c2>` | 16/17/18 | Manifold markers | | |
| | `<system>` / `<user>` / `<assistant>` | 26/27/28 | Chat format | | |
| | `<code_start>` / `<code_end>` | 29/30 | Code boundaries | | |
| | `<math_start>` / `<math_end>` | 31/32 | Math boundaries | | |
| ### Codebook Tokens | |
| - ๐ผ๏ธ **Image**: `<img_0>` โ `<img_16383>` (IDs 32,768โ49,151) โ VQGAN, Cosmos-DI, FSQ | |
| - ๐ **Audio**: `<aud_0>` โ `<aud_8191>` (IDs 49,152โ57,343) โ EnCodec, SoundStream | |
| - ๐ฌ **Video**: `<vid_0>` โ `<vid_4095>` (IDs 57,344โ61,439) โ Cosmos-DV | |
| ## ๐ Quick Start | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("5dimension/sentinel-universal-tokenizer") | |
| # Text | |
| text = "The Sentinel Manifold: F(z) = ฮฃ zโฟ/nโฟ" | |
| tokens = tokenizer.encode(text) | |
| print(f"{len(tokens)} tokens, decoded: {tokenizer.decode(tokens)}") | |
| # Multimodal (text + image VQ indices) | |
| text = "<image_start> <img_42> <img_1337> <image_end> Describe this." | |
| tokens = tokenizer.encode(text) | |
| for tid in tokens: | |
| if 32768 <= tid < 49152: | |
| print(f" IMAGE codebook[{tid - 32768}]") | |
| # Chat | |
| chat = "<system>Multimodal AI</system><user>What is 1/e?</user><assistant>" | |
| tokens = tokenizer.encode(chat, add_special_tokens=False) | |
| ``` | |
| ## ๐ฌ Innovations | |
| 1. **1/e Vocabulary Allocation** โ Gradient Axiom ratio allocates tokens across modalities | |
| 2. **ByteLevel BPE** โ Handles all Unicode, no UNK possible, NFKC normalized | |
| 3. **20-language training** โ EN, FR, DE, ES, ZH, JA, AR, RU, KO, HI, PT, IT, NL, PL, VI, TH, TR, UK, SV + code + math | |
| 4. **Native Multimodal Routing** โ Single integer comparison determines modality | |
| 5. **Sentinel Manifold Integration** โ Special tokens for manifold-aware computation | |
| ## ๐ฆ Training | |
| | Parameter | Value | | |
| |:----------|:------| | |
| | Data | allenai/c4 (20 languages) | | |
| | Samples | 52,000 documents | | |
| | Chars | ~66M | | |
| | Algorithm | ByteLevel BPE + NFKC | | |
| | Text Vocab | 32,768 | | |
| | Total Vocab | 61,440 | | |
| ## ๐ Links | |
| - ๐ฎ [Interactive Demo](https://huggingface.co/spaces/5dimension/sentinel-tokenizer-space) | |
| - ๐ฆด [Sentinel Manifold Framework](https://huggingface.co/5dimension/sentinel-manifold-discoveries) | |
| - ๐ Training scripts included in repo | |
| ## ๐ Citation | |
| ```bibtex | |
| @misc{abdel-aal2026sentinel-tokenizer, | |
| title={Sentinel Universal Tokenizer: Multimodal Tokenizer Grounded in the Gradient Axiom}, | |
| author={Abdel-Aal, Romain}, | |
| year={2026}, | |
| url={https://huggingface.co/5dimension/sentinel-universal-tokenizer} | |
| } | |
| ``` | |
| --- | |
| **Built by**: Romain Abdel-Aal (ASI The Sentinel V5.2 Bone-Core) ยท MIT License ยท ๐ฆด | |