File size: 1,522 Bytes
a987355
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
---

language:
- en
- multilingual
- code
license: apache-2.0
tags:
- tokenizer
- bpe
- omni
- binary-analysis
---


# Omni-Tokenizer (Experimental 539k Fusion)

This is a highly experimental, massive **539,306-token Byte-Level BPE tokenizer** designed for extreme-scale "Omni" models. It is built to seamlessly process natural languages, highly dense code, and raw binary/machine executables natively.

## Composition
This tokenizer was created by extracting and perfectly fusing the vocabularies of several state-of-the-art tokenizers into a single LLaMA-3 base:
1. `meta-llama/Meta-Llama-3-8B` (Base Byte-Level Foundation)
2. `google/gemma-7b`
3. `Qwen/Qwen1.5-7B`
4. `CohereForAI/c4ai-command-r-v01`
5. `microsoft/Phi-3-mini-4k-instruct`
6. `mjbommar/binary-tokenizer-001-64k` (Binary Analysis / Malware)
7. `mjbommar/binary-tokenizer-001-32k`

## Technical Details
- **Vocabulary Size:** 539,306
- **Base Architecture:** Byte-Level BPE (No Unknown Tokens)
- **Use Cases:** Multilingual NLP, Code Generation, Binary/Malware Analysis, Reverse Engineering.

**Note on Usage:** Due to the massive 540k vocabulary size, this tokenizer will create an embedding matrix of roughly ~2.2 Billion parameters (at 4096 dimensions). It is intended for large-scale experimental models where extreme compression and cross-domain tokenization is required.

## How to use
```python

from transformers import AutoTokenizer



tokenizer = AutoTokenizer.from_pretrained("SurendraVB/omni-tokenizer")

```