# Binary Tokenizer 005 A BPE tokenizer trained on binary executable files for security research and binary analysis tasks. ## Overview This tokenizer uses Byte Pair Encoding (BPE) trained on latin-1 encoded binary data from system executables across multiple operating systems and architectures (Alpine, Ubuntu, Debian, Rocky Linux, ARM64, x86-64). - **Vocabulary Size**: 65,536 tokens - **Training Data**: System binaries from various OS distributions - **Encoding**: Latin-1 (each byte 0-255 maps to a single character) ## Installation ```bash pip install tokenizers transformers ``` ## Usage ### Method 1: Using the tokenizers library (Recommended) ```python from tokenizers import Tokenizer # Load tokenizer directly from Hugging Face Hub tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-005") # Process binary data - MUST use latin-1 encoding with open("binary_file", "rb") as f: raw_bytes = f.read() text = raw_bytes.decode('latin-1') # Convert bytes to latin-1 string encoded = tokenizer.encode(text) tokens = encoded.ids # Decode back to text decoded = tokenizer.decode(tokens) ``` ### Method 2: Using transformers library ```python from transformers import PreTrainedTokenizerFast from tokenizers import Tokenizer # Load the base tokenizer base_tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-005") # Wrap with PreTrainedTokenizerFast for transformers compatibility tokenizer = PreTrainedTokenizerFast(tokenizer_object=base_tokenizer) # Process binary data with open("binary_file", "rb") as f: raw_bytes = f.read() text = raw_bytes.decode('latin-1') # Tokenize (returns dict with input_ids, attention_mask, etc.) result = tokenizer(text) tokens = result["input_ids"] ``` ## Important: Data Format The tokenizer expects binary data encoded as latin-1 strings, NOT hex strings: ```python # CORRECT - Use latin-1 encoded bytes raw_bytes = b'\x7fELF\x01\x01' text = raw_bytes.decode('latin-1') # → '\x7fELF\x01\x01' encoded = tokenizer.encode(text) # WRONG - Do not use hex strings hex_str = "7f 45 4c 46 01 01" # ❌ Will not work correctly ``` ## Example: Tokenizing an ELF Header ```python from tokenizers import Tokenizer # Load tokenizer tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-005") # ELF header bytes elf_header = b'\x7fELF\x01\x01\x01\x00' text = elf_header.decode('latin-1') # Tokenize encoded = tokenizer.encode(text) print(f"Tokens: {encoded.ids}") # Output: [0, 45689, 205, 22648, 1] # Where: 0='<|start|>', 45689='\x7fEL', 205='F', 22648='\x01\x01\x01\x00', 1='<|end|>' # Special tokens: <|start|> (id=0) and <|end|> (id=1) mark file boundaries # This helps models identify file headers and distinguish between files # Content tokens are: [45689, 205, 22648] # Note: Decoding adds spaces between tokens (BPE tokenizer behavior) decoded = tokenizer.decode(encoded.ids) print(f"Decoded: {repr(decoded)}") # '\x7fEL F \x01\x01\x01\x00' ``` ## Related Projects - [mjbommar/glaurung](https://github.com/mjbommar/glaurung) - Binary analysis framework - [mjbommar/glaurung-models](https://github.com/mjbommar/glaurung-models) - Binary embedding models and tokenizers ## License For research and educational purposes.