Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)
Omni-Tokenizer (Experimental 539k Fusion)
This is a highly experimental, massive 539,306-token Byte-Level BPE tokenizer designed for extreme-scale "Omni" models. It is built to seamlessly process natural languages, highly dense code, and raw binary/machine executables natively.
Composition
This tokenizer was created by extracting and perfectly fusing the vocabularies of several state-of-the-art tokenizers into a single LLaMA-3 base:
meta-llama/Meta-Llama-3-8B(Base Byte-Level Foundation)google/gemma-7bQwen/Qwen1.5-7BCohereForAI/c4ai-command-r-v01microsoft/Phi-3-mini-4k-instructmjbommar/binary-tokenizer-001-64k(Binary Analysis / Malware)mjbommar/binary-tokenizer-001-32k
Technical Details
- Vocabulary Size: 539,306
- Base Architecture: Byte-Level BPE (No Unknown Tokens)
- Use Cases: Multilingual NLP, Code Generation, Binary/Malware Analysis, Reverse Engineering.
Note on Usage: Due to the massive 540k vocabulary size, this tokenizer will create an embedding matrix of roughly ~2.2 Billion parameters (at 4096 dimensions). It is intended for large-scale experimental models where extreme compression and cross-domain tokenization is required.
How to use
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("SurendraVB/omni-tokenizer")