| ---
|
| language:
|
| - en
|
| - multilingual
|
| - code
|
| license: apache-2.0
|
| tags:
|
| - tokenizer
|
| - bpe
|
| - omni
|
| - binary-analysis
|
| ---
|
|
|
| # Omni-Tokenizer (Experimental 539k Fusion)
|
|
|
| This is a highly experimental, massive **539,306-token Byte-Level BPE tokenizer** designed for extreme-scale "Omni" models. It is built to seamlessly process natural languages, highly dense code, and raw binary/machine executables natively.
|
|
|
| ## Composition
|
| This tokenizer was created by extracting and perfectly fusing the vocabularies of several state-of-the-art tokenizers into a single LLaMA-3 base:
|
| 1. `meta-llama/Meta-Llama-3-8B` (Base Byte-Level Foundation)
|
| 2. `google/gemma-7b`
|
| 3. `Qwen/Qwen1.5-7B`
|
| 4. `CohereForAI/c4ai-command-r-v01`
|
| 5. `microsoft/Phi-3-mini-4k-instruct`
|
| 6. `mjbommar/binary-tokenizer-001-64k` (Binary Analysis / Malware)
|
| 7. `mjbommar/binary-tokenizer-001-32k`
|
|
|
| ## Technical Details
|
| - **Vocabulary Size:** 539,306
|
| - **Base Architecture:** Byte-Level BPE (No Unknown Tokens)
|
| - **Use Cases:** Multilingual NLP, Code Generation, Binary/Malware Analysis, Reverse Engineering.
|
|
|
| **Note on Usage:** Due to the massive 540k vocabulary size, this tokenizer will create an embedding matrix of roughly ~2.2 Billion parameters (at 4096 dimensions). It is intended for large-scale experimental models where extreme compression and cross-domain tokenization is required.
|
|
|
| ## How to use
|
| ```python
|
| from transformers import AutoTokenizer
|
|
|
| tokenizer = AutoTokenizer.from_pretrained("SurendraVB/omni-tokenizer")
|
| ```
|
|
|