omni-tokenizer / README.md
SurendraVB's picture
Initial release of the 539k Omni-Tokenizer
a987355 verified
|
Raw
History Blame Contribute Delete
1.52 kB
metadata
language:
  - en
  - multilingual
  - code
license: apache-2.0
tags:
  - tokenizer
  - bpe
  - omni
  - binary-analysis

Omni-Tokenizer (Experimental 539k Fusion)

This is a highly experimental, massive 539,306-token Byte-Level BPE tokenizer designed for extreme-scale "Omni" models. It is built to seamlessly process natural languages, highly dense code, and raw binary/machine executables natively.

Composition

This tokenizer was created by extracting and perfectly fusing the vocabularies of several state-of-the-art tokenizers into a single LLaMA-3 base:

  1. meta-llama/Meta-Llama-3-8B (Base Byte-Level Foundation)
  2. google/gemma-7b
  3. Qwen/Qwen1.5-7B
  4. CohereForAI/c4ai-command-r-v01
  5. microsoft/Phi-3-mini-4k-instruct
  6. mjbommar/binary-tokenizer-001-64k (Binary Analysis / Malware)
  7. mjbommar/binary-tokenizer-001-32k

Technical Details

  • Vocabulary Size: 539,306
  • Base Architecture: Byte-Level BPE (No Unknown Tokens)
  • Use Cases: Multilingual NLP, Code Generation, Binary/Malware Analysis, Reverse Engineering.

Note on Usage: Due to the massive 540k vocabulary size, this tokenizer will create an embedding matrix of roughly ~2.2 Billion parameters (at 4096 dimensions). It is intended for large-scale experimental models where extreme compression and cross-domain tokenization is required.

How to use

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("SurendraVB/omni-tokenizer")