SurendraVB
/

omni-tokenizer

binary-analysis

Model card Files Files and versions

omni-tokenizer / README.md

SurendraVB's picture

Initial release of the 539k Omni-Tokenizer

a987355 verified 2 days ago

|

History Blame Contribute Delete

1.52 kB

	---
	language:
	- en
	- multilingual
	- code
	license: apache-2.0
	tags:
	- tokenizer
	- bpe
	- omni
	- binary-analysis
	---

	# Omni-Tokenizer (Experimental 539k Fusion)

	This is a highly experimental, massive 539,306-token Byte-Level BPE tokenizer designed for extreme-scale "Omni" models. It is built to seamlessly process natural languages, highly dense code, and raw binary/machine executables natively.

	## Composition
	This tokenizer was created by extracting and perfectly fusing the vocabularies of several state-of-the-art tokenizers into a single LLaMA-3 base:
	1. `meta-llama/Meta-Llama-3-8B` (Base Byte-Level Foundation)
	2. `google/gemma-7b`
	3. `Qwen/Qwen1.5-7B`
	4. `CohereForAI/c4ai-command-r-v01`
	5. `microsoft/Phi-3-mini-4k-instruct`
	6. `mjbommar/binary-tokenizer-001-64k` (Binary Analysis / Malware)
	7. `mjbommar/binary-tokenizer-001-32k`

	## Technical Details
	- Vocabulary Size: 539,306
	- Base Architecture: Byte-Level BPE (No Unknown Tokens)
	- Use Cases: Multilingual NLP, Code Generation, Binary/Malware Analysis, Reverse Engineering.

	Note on Usage: Due to the massive 540k vocabulary size, this tokenizer will create an embedding matrix of roughly ~2.2 Billion parameters (at 4096 dimensions). It is intended for large-scale experimental models where extreme compression and cross-domain tokenization is required.

	## How to use
	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("SurendraVB/omni-tokenizer")
	```