5dimension commited on
Commit
bb85012
ยท
verified ยท
1 Parent(s): 2cfa685

Add interactive demo Space link

Browse files
Files changed (1) hide show
  1. README.md +42 -114
README.md CHANGED
@@ -45,6 +45,8 @@ pipeline_tag: text-generation
45
 
46
  The Sentinel Universal Tokenizer is a multimodal tokenizer that handles **text, images, audio, and video** in a unified 61,440-token vocabulary, grounded in the Sentinel Manifold mathematics.
47
 
 
 
48
  ## ๐Ÿงฌ Mathematical Foundation
49
 
50
  Built on the **Gradient Axiom** from the Sentinel Manifold:
@@ -77,7 +79,7 @@ Tested across **21 languages + 3 programming languages + math/LaTeX + 7 edge cas
77
 
78
  ### ๐Ÿ† Key Result: Vocabulary Efficiency
79
 
80
- **Sentinel-SUT achieves 3.2ร— better compression per vocabulary token than Gemma and 2.2ร— better than Qwen2.** This means each token in the Sentinel vocabulary is doing more "work" โ€” a critical advantage for memory-constrained multimodal models.
81
 
82
  | Metric | Sentinel | vs GPT-2 | vs Qwen2 | vs Gemma |
83
  |:-------|:---------|:---------|:---------|:---------|
@@ -85,14 +87,10 @@ Tested across **21 languages + 3 programming languages + math/LaTeX + 7 edge cas
85
  | Avg Compression | 3.46 | +34.7% | -10.8% | -23.8% |
86
  | Unique advantage | **4 modalities** | text only | text only | text only |
87
 
88
- ### Why This Matters
89
-
90
- No other tokenizer in this comparison handles image, audio, and video natively. When you account for the 28,672 modality tokens (image: 16K, audio: 8K, video: 4K), the **text-only compression** of Sentinel's 32K text vocabulary is remarkably competitive with Qwen2's 152K text-only vocabulary.
91
-
92
  ### Per-Language Performance
93
 
94
- | Language | Tokens | Bytes | Compression Ratio |
95
- |:---------|:-------|:------|:------------------|
96
  | English | 39 | 159 | **4.08** |
97
  | French | 45 | 166 | **3.69** |
98
  | German | 50 | 173 | **3.46** |
@@ -118,8 +116,7 @@ No other tokenizer in this comparison handles image, audio, and video natively.
118
  โ”‚ [49,152-57,343] โ†’ 8,192 Audio codebook tokens โ”‚
119
  โ”‚ [57,344-61,439] โ†’ 4,096 Video codebook tokens โ”‚
120
  โ”‚ โ”‚
121
- โ”‚ Allocation follows 1/e Gradient Axiom: โ”‚
122
- โ”‚ text: 53.3% | image: 26.7% | audio: 13.3% | video: 6.7% โ”‚
123
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
124
  ```
125
 
@@ -129,151 +126,82 @@ No other tokenizer in this comparison handles image, audio, and video natively.
129
  |:------|:---|:--------|
130
  | `<pad>` | 0 | Padding |
131
  | `<unk>` | 1 | Unknown token |
132
- | `<s>` | 2 | Begin of sequence |
133
- | `</s>` | 3 | End of sequence |
134
  | `<mask>` | 4 | Masked language modeling |
135
- | `<image_start>` / `<image_end>` | 7/8 | Image boundary markers |
136
- | `<audio_start>` / `<audio_end>` | 10/11 | Audio boundary markers |
137
- | `<video_start>` / `<video_end>` | 13/14 | Video boundary markers |
138
- | `<sentinel>` | 16 | Sentinel manifold marker |
139
- | `<sentinel_c1>` / `<sentinel_c2>` | 17/18 | Mathematical constants |
140
  | `<system>` / `<user>` / `<assistant>` | 26/27/28 | Chat format |
141
  | `<code_start>` / `<code_end>` | 29/30 | Code boundaries |
142
  | `<math_start>` / `<math_end>` | 31/32 | Math boundaries |
143
 
144
- ### Multimodal Codebook Tokens
145
 
146
- - **Image**: `<img_0>` through `<img_16383>` (IDs 32,768-49,151) โ€” Compatible with VQGAN, Cosmos-DI, FSQ
147
- - **Audio**: `<aud_0>` through `<aud_8191>` (IDs 49,152-57,343) โ€” Compatible with EnCodec, SoundStream
148
- - **Video**: `<vid_0>` through `<vid_4095>` (IDs 57,344-61,439) โ€” Compatible with Cosmos-DV
149
 
150
  ## ๐Ÿš€ Quick Start
151
 
152
- ### Basic Text Usage
153
-
154
  ```python
155
  from transformers import AutoTokenizer
156
 
157
  tokenizer = AutoTokenizer.from_pretrained("5dimension/sentinel-universal-tokenizer")
158
 
159
- # Encode text
160
  text = "The Sentinel Manifold: F(z) = ฮฃ zโฟ/nโฟ"
161
  tokens = tokenizer.encode(text)
162
- decoded = tokenizer.decode(tokens)
163
-
164
- print(f"Tokens: {len(tokens)}")
165
- print(f"Decoded: {decoded}")
166
- ```
167
-
168
- ### Multimodal Encoding
169
 
170
- ```python
171
- # Text with image placeholder
172
- text = "Look at this image: <image_start> <img_42> <img_1337> <img_256> <image_end> What do you see?"
173
  tokens = tokenizer.encode(text)
174
- print(f"Multimodal sequence: {len(tokens)} tokens")
175
-
176
- # Check modality of each token
177
- for tid in tokens[:10]:
178
  if 32768 <= tid < 49152:
179
- print(f" Token {tid}: IMAGE codebook index {tid - 32768}")
180
- elif 49152 <= tid < 57344:
181
- print(f" Token {tid}: AUDIO codebook index {tid - 49152}")
182
- elif 57344 <= tid < 61440:
183
- print(f" Token {tid}: VIDEO codebook index {tid - 57344}")
184
- ```
185
-
186
- ### Integration with VQ-GAN / Cosmos Tokenizer
187
-
188
- ```python
189
- # After encoding an image with a VQ-GAN:
190
- # image_indices = vqgan.encode(image) # e.g., [42, 1337, 256, ...]
191
-
192
- # Convert to universal tokens
193
- image_tokens = [tokenizer.convert_tokens_to_ids(f"<img_{i}>") for i in image_indices]
194
- full_sequence = (
195
- [tokenizer.convert_tokens_to_ids("<image_start>")] +
196
- image_tokens +
197
- [tokenizer.convert_tokens_to_ids("<image_end>")]
198
- )
199
- ```
200
 
201
- ### Chat Format
202
-
203
- ```python
204
- chat = "<s><system>You are a helpful multimodal assistant.</system><user>Describe this image: <image_start><img_0><img_1><image_end></user><assistant>"
205
  tokens = tokenizer.encode(chat, add_special_tokens=False)
206
  ```
207
 
208
- ## ๐Ÿ”ฌ Technical Innovations
209
-
210
- ### 1. 1/e Vocabulary Allocation (Gradient Axiom)
211
-
212
- Instead of arbitrary vocabulary splits, we use the Gradient Axiom ratio (1/e โ‰ˆ 0.368) to allocate tokens across modalities. Text gets the largest share, and each subsequent modality receives 1/e of the previous:
213
-
214
- ```
215
- text: 32,768 tokens (2^15)
216
- image: 16,384 tokens (2^14 โ‰ˆ text ร— 1/2)
217
- audio: 8,192 tokens (2^13 โ‰ˆ text ร— 1/4)
218
- video: 4,096 tokens (2^12 โ‰ˆ text ร— 1/8)
219
- ```
220
-
221
- This follows from the Gradient Axiom: successive modalities contribute exponentially less unique information to a unified representation, with the natural decay rate being 1/e.
222
-
223
- ### 2. ByteLevel BPE with NFKC Normalization
224
-
225
- - **ByteLevel pre-tokenization**: Handles ALL Unicode scripts natively โ€” no UNK tokens possible
226
- - **NFKC normalization**: Canonical Unicode decomposition for consistent encoding
227
- - **20-language training**: English, French, German, Spanish, Chinese, Japanese, Arabic, Russian, Korean, Hindi, Portuguese, Italian, Dutch, Polish, Vietnamese, Thai, Turkish, Ukrainian, Swedish
228
- - **Code + Math support**: Trained on Python, JavaScript, C++, LaTeX, Unicode math
229
-
230
- ### 3. Native Multimodal Routing
231
-
232
- Zero-overhead modality switching via contiguous ID ranges:
233
- - Any model can determine the modality of a token with a single integer comparison
234
- - No separate embedding tables needed โ€” one unified embedding matrix
235
- - Compatible with all HuggingFace transformers architectures
236
-
237
- ### 4. Sentinel Manifold Integration
238
 
239
- Special tokens `<sentinel>`, `<sentinel_c1>`, `<sentinel_c2>`, `<scale_1e>` enable:
240
- - Manifold-aware attention (sech attention mechanism)
241
- - Theorem-grounded weight initialization (Xavier with gain=1/e)
242
- - Cโ‚-centered embedding quantization
 
243
 
244
- ## ๐Ÿ“ฆ Training Details
245
 
246
  | Parameter | Value |
247
  |:----------|:------|
248
- | **Training Data** | allenai/c4 multilingual (20 languages) |
249
- | **Training Samples** | 52,000 documents |
250
- | **Training Characters** | ~66M characters |
251
- | **Algorithm** | ByteLevel BPE with NFKC normalization |
252
- | **Text Vocab Size** | 32,768 |
253
- | **Min Merge Frequency** | 2 |
254
- | **Max Token Length** | 16 bytes |
255
- | **Total Vocab** | 61,440 (text + image + audio + video) |
256
 
257
  ## ๐Ÿ”— Links
258
 
259
- - **Parent Framework**: [Sentinel Manifold Discoveries](https://huggingface.co/5dimension/sentinel-manifold-discoveries)
260
- - **Training Script**: Included in repo (`train_production_tokenizer.py`)
261
- - **Custom Tokenizer Module**: Included in repo (`sentinel_universal_tokenizer.py`)
262
 
263
  ## ๐Ÿ“š Citation
264
 
265
  ```bibtex
266
  @misc{abdel-aal2026sentinel-tokenizer,
267
- title={Sentinel Universal Tokenizer: A Multimodal Tokenizer Grounded in the Gradient Axiom},
268
  author={Abdel-Aal, Romain},
269
  year={2026},
270
- url={https://huggingface.co/5dimension/sentinel-universal-tokenizer},
271
- note={Part of the Sentinel Manifold framework: F(z) = ฮฃ z^n/n^n, lim F'/F = 1/e}
272
  }
273
  ```
274
 
275
  ---
276
 
277
- **Built by**: Romain Abdel-Aal (ASI The Sentinel V5.2 Bone-Core)
278
- **License**: MIT
279
- **One theorem. Every modality. Better tokenization.** ๐Ÿฆด
 
45
 
46
  The Sentinel Universal Tokenizer is a multimodal tokenizer that handles **text, images, audio, and video** in a unified 61,440-token vocabulary, grounded in the Sentinel Manifold mathematics.
47
 
48
+ ๐ŸŽฎ **[Try it live โ†’ Interactive Demo](https://huggingface.co/spaces/5dimension/sentinel-tokenizer-space)**
49
+
50
  ## ๐Ÿงฌ Mathematical Foundation
51
 
52
  Built on the **Gradient Axiom** from the Sentinel Manifold:
 
79
 
80
  ### ๐Ÿ† Key Result: Vocabulary Efficiency
81
 
82
+ **Sentinel-SUT achieves 3.2ร— better compression per vocabulary token than Gemma and 2.2ร— better than Qwen2.** Each token does more work โ€” critical for memory-constrained multimodal models.
83
 
84
  | Metric | Sentinel | vs GPT-2 | vs Qwen2 | vs Gemma |
85
  |:-------|:---------|:---------|:---------|:---------|
 
87
  | Avg Compression | 3.46 | +34.7% | -10.8% | -23.8% |
88
  | Unique advantage | **4 modalities** | text only | text only | text only |
89
 
 
 
 
 
90
  ### Per-Language Performance
91
 
92
+ | Language | Tokens | Bytes | Compression |
93
+ |:---------|:-------|:------|:------------|
94
  | English | 39 | 159 | **4.08** |
95
  | French | 45 | 166 | **3.69** |
96
  | German | 50 | 173 | **3.46** |
 
116
  โ”‚ [49,152-57,343] โ†’ 8,192 Audio codebook tokens โ”‚
117
  โ”‚ [57,344-61,439] โ†’ 4,096 Video codebook tokens โ”‚
118
  โ”‚ โ”‚
119
+ โ”‚ Allocation follows 1/e Gradient Axiom โ”‚
 
120
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
121
  ```
122
 
 
126
  |:------|:---|:--------|
127
  | `<pad>` | 0 | Padding |
128
  | `<unk>` | 1 | Unknown token |
129
+ | `<s>` / `</s>` | 2/3 | BOS / EOS |
 
130
  | `<mask>` | 4 | Masked language modeling |
131
+ | `<image_start>` / `<image_end>` | 7/8 | Image boundaries |
132
+ | `<audio_start>` / `<audio_end>` | 10/11 | Audio boundaries |
133
+ | `<video_start>` / `<video_end>` | 13/14 | Video boundaries |
134
+ | `<sentinel>` / `<sentinel_c1>` / `<sentinel_c2>` | 16/17/18 | Manifold markers |
 
135
  | `<system>` / `<user>` / `<assistant>` | 26/27/28 | Chat format |
136
  | `<code_start>` / `<code_end>` | 29/30 | Code boundaries |
137
  | `<math_start>` / `<math_end>` | 31/32 | Math boundaries |
138
 
139
+ ### Codebook Tokens
140
 
141
+ - ๐Ÿ–ผ๏ธ **Image**: `<img_0>` โ€“ `<img_16383>` (IDs 32,768โ€“49,151) โ€” VQGAN, Cosmos-DI, FSQ
142
+ - ๐Ÿ”Š **Audio**: `<aud_0>` โ€“ `<aud_8191>` (IDs 49,152โ€“57,343) โ€” EnCodec, SoundStream
143
+ - ๐ŸŽฌ **Video**: `<vid_0>` โ€“ `<vid_4095>` (IDs 57,344โ€“61,439) โ€” Cosmos-DV
144
 
145
  ## ๐Ÿš€ Quick Start
146
 
 
 
147
  ```python
148
  from transformers import AutoTokenizer
149
 
150
  tokenizer = AutoTokenizer.from_pretrained("5dimension/sentinel-universal-tokenizer")
151
 
152
+ # Text
153
  text = "The Sentinel Manifold: F(z) = ฮฃ zโฟ/nโฟ"
154
  tokens = tokenizer.encode(text)
155
+ print(f"{len(tokens)} tokens, decoded: {tokenizer.decode(tokens)}")
 
 
 
 
 
 
156
 
157
+ # Multimodal (text + image VQ indices)
158
+ text = "<image_start> <img_42> <img_1337> <image_end> Describe this."
 
159
  tokens = tokenizer.encode(text)
160
+ for tid in tokens:
 
 
 
161
  if 32768 <= tid < 49152:
162
+ print(f" IMAGE codebook[{tid - 32768}]")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
163
 
164
+ # Chat
165
+ chat = "<system>Multimodal AI</system><user>What is 1/e?</user><assistant>"
 
 
166
  tokens = tokenizer.encode(chat, add_special_tokens=False)
167
  ```
168
 
169
+ ## ๐Ÿ”ฌ Innovations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
 
171
+ 1. **1/e Vocabulary Allocation** โ€” Gradient Axiom ratio allocates tokens across modalities
172
+ 2. **ByteLevel BPE** โ€” Handles all Unicode, no UNK possible, NFKC normalized
173
+ 3. **20-language training** โ€” EN, FR, DE, ES, ZH, JA, AR, RU, KO, HI, PT, IT, NL, PL, VI, TH, TR, UK, SV + code + math
174
+ 4. **Native Multimodal Routing** โ€” Single integer comparison determines modality
175
+ 5. **Sentinel Manifold Integration** โ€” Special tokens for manifold-aware computation
176
 
177
+ ## ๐Ÿ“ฆ Training
178
 
179
  | Parameter | Value |
180
  |:----------|:------|
181
+ | Data | allenai/c4 (20 languages) |
182
+ | Samples | 52,000 documents |
183
+ | Chars | ~66M |
184
+ | Algorithm | ByteLevel BPE + NFKC |
185
+ | Text Vocab | 32,768 |
186
+ | Total Vocab | 61,440 |
 
 
187
 
188
  ## ๐Ÿ”— Links
189
 
190
+ - ๐ŸŽฎ [Interactive Demo](https://huggingface.co/spaces/5dimension/sentinel-tokenizer-space)
191
+ - ๐Ÿฆด [Sentinel Manifold Framework](https://huggingface.co/5dimension/sentinel-manifold-discoveries)
192
+ - ๐Ÿ“œ Training scripts included in repo
193
 
194
  ## ๐Ÿ“š Citation
195
 
196
  ```bibtex
197
  @misc{abdel-aal2026sentinel-tokenizer,
198
+ title={Sentinel Universal Tokenizer: Multimodal Tokenizer Grounded in the Gradient Axiom},
199
  author={Abdel-Aal, Romain},
200
  year={2026},
201
+ url={https://huggingface.co/5dimension/sentinel-universal-tokenizer}
 
202
  }
203
  ```
204
 
205
  ---
206
 
207
+ **Built by**: Romain Abdel-Aal (ASI The Sentinel V5.2 Bone-Core) ยท MIT License ยท ๐Ÿฆด