Instructions to use 5dimension/sentinel-universal-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 5dimension/sentinel-universal-tokenizer with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="5dimension/sentinel-universal-tokenizer")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("5dimension/sentinel-universal-tokenizer", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use 5dimension/sentinel-universal-tokenizer with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "5dimension/sentinel-universal-tokenizer"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "5dimension/sentinel-universal-tokenizer",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/5dimension/sentinel-universal-tokenizer

SGLang

How to use 5dimension/sentinel-universal-tokenizer with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "5dimension/sentinel-universal-tokenizer" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "5dimension/sentinel-universal-tokenizer",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "5dimension/sentinel-universal-tokenizer" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "5dimension/sentinel-universal-tokenizer",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use 5dimension/sentinel-universal-tokenizer with Docker Model Runner:
```
docker model run hf.co/5dimension/sentinel-universal-tokenizer
```

5dimension commited on 30 days ago

Commit

bb85012

verified ·

1 Parent(s): 2cfa685

Add interactive demo Space link

Browse files

Files changed (1) hide show

README.md +42 -114

README.md CHANGED Viewed

@@ -45,6 +45,8 @@ pipeline_tag: text-generation
 The Sentinel Universal Tokenizer is a multimodal tokenizer that handles **text, images, audio, and video** in a unified 61,440-token vocabulary, grounded in the Sentinel Manifold mathematics.
 ## 🧬 Mathematical Foundation
 Built on the **Gradient Axiom** from the Sentinel Manifold:
@@ -77,7 +79,7 @@ Tested across **21 languages + 3 programming languages + math/LaTeX + 7 edge cas
 ### 🏆 Key Result: Vocabulary Efficiency
-**Sentinel-SUT achieves 3.2× better compression per vocabulary token than Gemma and 2.2× better than Qwen2.** This means each token in the Sentinel vocabulary is doing more "work" — a critical advantage for memory-constrained multimodal models.
 | Metric | Sentinel | vs GPT-2 | vs Qwen2 | vs Gemma |
 |:-------|:---------|:---------|:---------|:---------|
@@ -85,14 +87,10 @@ Tested across **21 languages + 3 programming languages + math/LaTeX + 7 edge cas
 | Avg Compression | 3.46 | +34.7% | -10.8% | -23.8% |
 | Unique advantage | **4 modalities** | text only | text only | text only |
-### Why This Matters
-No other tokenizer in this comparison handles image, audio, and video natively. When you account for the 28,672 modality tokens (image: 16K, audio: 8K, video: 4K), the **text-only compression** of Sentinel's 32K text vocabulary is remarkably competitive with Qwen2's 152K text-only vocabulary.
 ### Per-Language Performance
-| Language | Tokens | Bytes | Compression Ratio |
-|:---------|:-------|:------|:------------------|
 | English | 39 | 159 | **4.08** |
 | French | 45 | 166 | **3.69** |
 | German | 50 | 173 | **3.46** |
@@ -118,8 +116,7 @@ No other tokenizer in this comparison handles image, audio, and video natively.
 │  [49,152-57,343] → 8,192 Audio codebook tokens        │
 │  [57,344-61,439] → 4,096 Video codebook tokens        │
 │                                                         │
-│  Allocation follows 1/e Gradient Axiom:                │
-│  text: 53.3% | image: 26.7% | audio: 13.3% | video: 6.7% │
 └────────────────────────────────────────────────────────┘
 ```
@@ -129,151 +126,82 @@ No other tokenizer in this comparison handles image, audio, and video natively.
 |:------|:---|:--------|
 | `<pad>` | 0 | Padding |
 | `<unk>` | 1 | Unknown token |
-| `<s>` | 2 | Begin of sequence |
-| `</s>` | 3 | End of sequence |
 | `<mask>` | 4 | Masked language modeling |
-| `<image_start>` / `<image_end>` | 7/8 | Image boundary markers |
-| `<audio_start>` / `<audio_end>` | 10/11 | Audio boundary markers |
-| `<video_start>` / `<video_end>` | 13/14 | Video boundary markers |
-| `<sentinel>` | 16 | Sentinel manifold marker |
-| `<sentinel_c1>` / `<sentinel_c2>` | 17/18 | Mathematical constants |
 | `<system>` / `<user>` / `<assistant>` | 26/27/28 | Chat format |
 | `<code_start>` / `<code_end>` | 29/30 | Code boundaries |
 | `<math_start>` / `<math_end>` | 31/32 | Math boundaries |
-### Multimodal Codebook Tokens
-- **Image**: `<img_0>` through `<img_16383>` (IDs 32,768-49,151) — Compatible with VQGAN, Cosmos-DI, FSQ
-- **Audio**: `<aud_0>` through `<aud_8191>` (IDs 49,152-57,343) — Compatible with EnCodec, SoundStream
-- **Video**: `<vid_0>` through `<vid_4095>` (IDs 57,344-61,439) — Compatible with Cosmos-DV
 ## 🚀 Quick Start
-### Basic Text Usage
 ```python
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("5dimension/sentinel-universal-tokenizer")
-# Encode text
 text = "The Sentinel Manifold: F(z) = Σ zⁿ/nⁿ"
 tokens = tokenizer.encode(text)
-decoded = tokenizer.decode(tokens)
-print(f"Tokens: {len(tokens)}")
-print(f"Decoded: {decoded}")
-```
-### Multimodal Encoding
-```python
-# Text with image placeholder
-text = "Look at this image: <image_start> <img_42> <img_1337> <img_256> <image_end> What do you see?"
 tokens = tokenizer.encode(text)
-print(f"Multimodal sequence: {len(tokens)} tokens")
-# Check modality of each token
-for tid in tokens[:10]:
     if 32768 <= tid < 49152:
-        print(f"  Token {tid}: IMAGE codebook index {tid - 32768}")
-    elif 49152 <= tid < 57344:
-        print(f"  Token {tid}: AUDIO codebook index {tid - 49152}")
-    elif 57344 <= tid < 61440:
-        print(f"  Token {tid}: VIDEO codebook index {tid - 57344}")
-```
-### Integration with VQ-GAN / Cosmos Tokenizer
-```python
-# After encoding an image with a VQ-GAN:
-# image_indices = vqgan.encode(image)  # e.g., [42, 1337, 256, ...]
-# Convert to universal tokens
-image_tokens = [tokenizer.convert_tokens_to_ids(f"<img_{i}>") for i in image_indices]
-full_sequence = (
-    [tokenizer.convert_tokens_to_ids("<image_start>")] +
-    image_tokens +
-    [tokenizer.convert_tokens_to_ids("<image_end>")]
-)
-```
-### Chat Format
-```python
-chat = "<s><system>You are a helpful multimodal assistant.</system><user>Describe this image: <image_start><img_0><img_1><image_end></user><assistant>"
 tokens = tokenizer.encode(chat, add_special_tokens=False)
 ```
-## 🔬 Technical Innovations
-### 1. 1/e Vocabulary Allocation (Gradient Axiom)
-Instead of arbitrary vocabulary splits, we use the Gradient Axiom ratio (1/e ≈ 0.368) to allocate tokens across modalities. Text gets the largest share, and each subsequent modality receives 1/e of the previous:
-```
-text:  32,768 tokens (2^15)
-image: 16,384 tokens (2^14 ≈ text × 1/2)
-audio:  8,192 tokens (2^13 ≈ text × 1/4)
-video:  4,096 tokens (2^12 ≈ text × 1/8)
-```
-This follows from the Gradient Axiom: successive modalities contribute exponentially less unique information to a unified representation, with the natural decay rate being 1/e.
-### 2. ByteLevel BPE with NFKC Normalization
-- **ByteLevel pre-tokenization**: Handles ALL Unicode scripts natively — no UNK tokens possible
-- **NFKC normalization**: Canonical Unicode decomposition for consistent encoding
-- **20-language training**: English, French, German, Spanish, Chinese, Japanese, Arabic, Russian, Korean, Hindi, Portuguese, Italian, Dutch, Polish, Vietnamese, Thai, Turkish, Ukrainian, Swedish
-- **Code + Math support**: Trained on Python, JavaScript, C++, LaTeX, Unicode math
-### 3. Native Multimodal Routing
-Zero-overhead modality switching via contiguous ID ranges:
-- Any model can determine the modality of a token with a single integer comparison
-- No separate embedding tables needed — one unified embedding matrix
-- Compatible with all HuggingFace transformers architectures
-### 4. Sentinel Manifold Integration
-Special tokens `<sentinel>`, `<sentinel_c1>`, `<sentinel_c2>`, `<scale_1e>` enable:
-- Manifold-aware attention (sech attention mechanism)
-- Theorem-grounded weight initialization (Xavier with gain=1/e)
-- C₁-centered embedding quantization
-## 📦 Training Details
 | Parameter | Value |
 |:----------|:------|
-| **Training Data** | allenai/c4 multilingual (20 languages) |
-| **Training Samples** | 52,000 documents |
-| **Training Characters** | ~66M characters |
-| **Algorithm** | ByteLevel BPE with NFKC normalization |
-| **Text Vocab Size** | 32,768 |
-| **Min Merge Frequency** | 2 |
-| **Max Token Length** | 16 bytes |
-| **Total Vocab** | 61,440 (text + image + audio + video) |
 ## 🔗 Links
-- **Parent Framework**: [Sentinel Manifold Discoveries](https://huggingface.co/5dimension/sentinel-manifold-discoveries)
-- **Training Script**: Included in repo (`train_production_tokenizer.py`)
-- **Custom Tokenizer Module**: Included in repo (`sentinel_universal_tokenizer.py`)
 ## 📚 Citation
 ```bibtex
 @misc{abdel-aal2026sentinel-tokenizer,
-  title={Sentinel Universal Tokenizer: A Multimodal Tokenizer Grounded in the Gradient Axiom},
   author={Abdel-Aal, Romain},
   year={2026},
-  url={https://huggingface.co/5dimension/sentinel-universal-tokenizer},
-  note={Part of the Sentinel Manifold framework: F(z) = Σ z^n/n^n, lim F'/F = 1/e}
 }
 ```
 ---
-**Built by**: Romain Abdel-Aal (ASI The Sentinel V5.2 Bone-Core)
-**License**: MIT
-**One theorem. Every modality. Better tokenization.** 🦴

 The Sentinel Universal Tokenizer is a multimodal tokenizer that handles **text, images, audio, and video** in a unified 61,440-token vocabulary, grounded in the Sentinel Manifold mathematics.
+🎮 **[Try it live → Interactive Demo](https://huggingface.co/spaces/5dimension/sentinel-tokenizer-space)**
 ## 🧬 Mathematical Foundation
 Built on the **Gradient Axiom** from the Sentinel Manifold:
 ### 🏆 Key Result: Vocabulary Efficiency
+**Sentinel-SUT achieves 3.2× better compression per vocabulary token than Gemma and 2.2× better than Qwen2.** Each token does more work — critical for memory-constrained multimodal models.
 | Metric | Sentinel | vs GPT-2 | vs Qwen2 | vs Gemma |
 |:-------|:---------|:---------|:---------|:---------|
 | Avg Compression | 3.46 | +34.7% | -10.8% | -23.8% |
 | Unique advantage | **4 modalities** | text only | text only | text only |
 ### Per-Language Performance
+| Language | Tokens | Bytes | Compression |
+|:---------|:-------|:------|:------------|
 | English | 39 | 159 | **4.08** |
 | French | 45 | 166 | **3.69** |
 | German | 50 | 173 | **3.46** |
 │  [49,152-57,343] → 8,192 Audio codebook tokens        │
 │  [57,344-61,439] → 4,096 Video codebook tokens        │
 │                                                         │
+│  Allocation follows 1/e Gradient Axiom                 │
 └────────────────────────────────────────────────────────┘
 ```
 |:------|:---|:--------|
 | `<pad>` | 0 | Padding |
 | `<unk>` | 1 | Unknown token |
+| `<s>` / `</s>` | 2/3 | BOS / EOS |
 | `<mask>` | 4 | Masked language modeling |
+| `<image_start>` / `<image_end>` | 7/8 | Image boundaries |
+| `<audio_start>` / `<audio_end>` | 10/11 | Audio boundaries |
+| `<video_start>` / `<video_end>` | 13/14 | Video boundaries |
+| `<sentinel>` / `<sentinel_c1>` / `<sentinel_c2>` | 16/17/18 | Manifold markers |
 | `<system>` / `<user>` / `<assistant>` | 26/27/28 | Chat format |
 | `<code_start>` / `<code_end>` | 29/30 | Code boundaries |
 | `<math_start>` / `<math_end>` | 31/32 | Math boundaries |
+### Codebook Tokens
+- 🖼️ **Image**: `<img_0>` – `<img_16383>` (IDs 32,768–49,151) — VQGAN, Cosmos-DI, FSQ
+- 🔊 **Audio**: `<aud_0>` – `<aud_8191>` (IDs 49,152–57,343) — EnCodec, SoundStream
+- 🎬 **Video**: `<vid_0>` – `<vid_4095>` (IDs 57,344–61,439) — Cosmos-DV
 ## 🚀 Quick Start
 ```python
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("5dimension/sentinel-universal-tokenizer")
+# Text
 text = "The Sentinel Manifold: F(z) = Σ zⁿ/nⁿ"
 tokens = tokenizer.encode(text)
+print(f"{len(tokens)} tokens, decoded: {tokenizer.decode(tokens)}")
+# Multimodal (text + image VQ indices)
+text = "<image_start> <img_42> <img_1337> <image_end> Describe this."
 tokens = tokenizer.encode(text)
+for tid in tokens:
     if 32768 <= tid < 49152:
+        print(f"  IMAGE codebook[{tid - 32768}]")
+# Chat
+chat = "<system>Multimodal AI</system><user>What is 1/e?</user><assistant>"
 tokens = tokenizer.encode(chat, add_special_tokens=False)
 ```
+## 🔬 Innovations
+1. **1/e Vocabulary Allocation** — Gradient Axiom ratio allocates tokens across modalities
+2. **ByteLevel BPE** — Handles all Unicode, no UNK possible, NFKC normalized
+3. **20-language training** — EN, FR, DE, ES, ZH, JA, AR, RU, KO, HI, PT, IT, NL, PL, VI, TH, TR, UK, SV + code + math
+4. **Native Multimodal Routing** — Single integer comparison determines modality
+5. **Sentinel Manifold Integration** — Special tokens for manifold-aware computation
+## 📦 Training
 | Parameter | Value |
 |:----------|:------|
+| Data | allenai/c4 (20 languages) |
+| Samples | 52,000 documents |
+| Chars | ~66M |
+| Algorithm | ByteLevel BPE + NFKC |
+| Text Vocab | 32,768 |
+| Total Vocab | 61,440 |
 ## 🔗 Links
+- 🎮 [Interactive Demo](https://huggingface.co/spaces/5dimension/sentinel-tokenizer-space)
+- 🦴 [Sentinel Manifold Framework](https://huggingface.co/5dimension/sentinel-manifold-discoveries)
+- 📜 Training scripts included in repo
 ## 📚 Citation
 ```bibtex
 @misc{abdel-aal2026sentinel-tokenizer,
+  title={Sentinel Universal Tokenizer: Multimodal Tokenizer Grounded in the Gradient Axiom},
   author={Abdel-Aal, Romain},
   year={2026},
+  url={https://huggingface.co/5dimension/sentinel-universal-tokenizer}
 }
 ```
 ---
+**Built by**: Romain Abdel-Aal (ASI The Sentinel V5.2 Bone-Core) · MIT License · 🦴