5dimension commited on
Commit
2cfa685
Β·
verified Β·
1 Parent(s): d1551aa

Update README with deep benchmark efficiency analysis

Browse files
Files changed (1) hide show
  1. README.md +21 -12
README.md CHANGED
@@ -64,21 +64,30 @@ lim_{zβ†’βˆž} F'(z)/F(z) = 1/e β‰ˆ 0.367879441171442
64
 
65
  ## πŸ“Š Benchmark Results
66
 
67
- Tested across **21 languages + code + math**, compared against leading tokenizers:
68
 
69
- | Tokenizer | Vocab Size | Avg Fertility ↓ | Fertility Οƒ ↓ | Compression ↑ | Fairness ↑ |
70
- |:----------|:-----------|:----------------|:-------------|:--------------|:-----------|
71
- | **Gemma** | 256,000 | 6.69 | 11.71 | **4.66** | **0.079** |
72
- | **Qwen2** | 151,936 | 8.03 | 13.75 | 3.82 | 0.068 |
73
- | **Sentinel-SUT** | **61,440** | 9.13 | 16.35 | 3.55 | 0.058 |
74
- | GPT-2 | 50,257 | 20.86 | 40.76 | 2.41 | 0.024 |
75
 
76
- ### Key Findings
 
 
 
 
 
77
 
78
- - **47% better compression than GPT-2** with comparable vocab size (61K vs 50K)
79
- - **Competitive with Qwen2 (152K vocab)** despite using **2.5Γ— fewer tokens**
80
- - **Native multimodal support** β€” no other tokenizer in this comparison handles image/audio/video natively
81
- - **20-language multilingual training** on C4 corpus
 
 
 
 
 
 
 
 
 
82
 
83
  ### Per-Language Performance
84
 
 
64
 
65
  ## πŸ“Š Benchmark Results
66
 
67
+ ### Deep Benchmark (30 test cases Γ— 4 tokenizers)
68
 
69
+ Tested across **21 languages + 3 programming languages + math/LaTeX + 7 edge cases**:
 
 
 
 
 
70
 
71
+ | Tokenizer | Vocab Size | Avg Compress ↑ | Efficiency per 1K Vocab ↑ | Per-Bit Efficiency ↑ |
72
+ |:----------|:-----------|:---------------|:--------------------------|:---------------------|
73
+ | Gemma | 256,000 | **4.54** | 0.018 | **0.253** |
74
+ | **Sentinel-SUT** | **61,440** | 3.46 | **0.056** | 0.218 |
75
+ | Qwen2 | 151,936 | 3.88 | 0.026 | 0.225 |
76
+ | GPT-2 | 50,257 | 2.57 | 0.051 | 0.165 |
77
 
78
+ ### πŸ† Key Result: Vocabulary Efficiency
79
+
80
+ **Sentinel-SUT achieves 3.2Γ— better compression per vocabulary token than Gemma and 2.2Γ— better than Qwen2.** This means each token in the Sentinel vocabulary is doing more "work" β€” a critical advantage for memory-constrained multimodal models.
81
+
82
+ | Metric | Sentinel | vs GPT-2 | vs Qwen2 | vs Gemma |
83
+ |:-------|:---------|:---------|:---------|:---------|
84
+ | Efficiency per 1K vocab | **0.0563** | +10.1% | +120.2% | +217.4% |
85
+ | Avg Compression | 3.46 | +34.7% | -10.8% | -23.8% |
86
+ | Unique advantage | **4 modalities** | text only | text only | text only |
87
+
88
+ ### Why This Matters
89
+
90
+ No other tokenizer in this comparison handles image, audio, and video natively. When you account for the 28,672 modality tokens (image: 16K, audio: 8K, video: 4K), the **text-only compression** of Sentinel's 32K text vocabulary is remarkably competitive with Qwen2's 152K text-only vocabulary.
91
 
92
  ### Per-Language Performance
93