Clarify special tokens purpose for file boundaries

3f7388f verified 5 months ago

3.26 kB

	# Binary Tokenizer 005

	A BPE tokenizer trained on binary executable files for security research and binary analysis tasks.

	## Overview

	This tokenizer uses Byte Pair Encoding (BPE) trained on latin-1 encoded binary data from system executables across multiple operating systems and architectures (Alpine, Ubuntu, Debian, Rocky Linux, ARM64, x86-64).

	- Vocabulary Size: 65,536 tokens
	- Training Data: System binaries from various OS distributions
	- Encoding: Latin-1 (each byte 0-255 maps to a single character)

	## Installation

	```bash
	pip install tokenizers transformers
	```

	## Usage

	### Method 1: Using the tokenizers library (Recommended)

	```python
	from tokenizers import Tokenizer

	# Load tokenizer directly from Hugging Face Hub
	tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-005")

	# Process binary data - MUST use latin-1 encoding
	with open("binary_file", "rb") as f:
	raw_bytes = f.read()
	text = raw_bytes.decode('latin-1') # Convert bytes to latin-1 string
	encoded = tokenizer.encode(text)
	tokens = encoded.ids

	# Decode back to text
	decoded = tokenizer.decode(tokens)
	```

	### Method 2: Using transformers library

	```python
	from transformers import PreTrainedTokenizerFast
	from tokenizers import Tokenizer

	# Load the base tokenizer
	base_tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-005")

	# Wrap with PreTrainedTokenizerFast for transformers compatibility
	tokenizer = PreTrainedTokenizerFast(tokenizer_object=base_tokenizer)

	# Process binary data
	with open("binary_file", "rb") as f:
	raw_bytes = f.read()
	text = raw_bytes.decode('latin-1')

	# Tokenize (returns dict with input_ids, attention_mask, etc.)
	result = tokenizer(text)
	tokens = result["input_ids"]
	```

	## Important: Data Format

	The tokenizer expects binary data encoded as latin-1 strings, NOT hex strings:

	```python
	# CORRECT - Use latin-1 encoded bytes
	raw_bytes = b'\x7fELF\x01\x01'
	text = raw_bytes.decode('latin-1') # → '\x7fELF\x01\x01'
	encoded = tokenizer.encode(text)

	# WRONG - Do not use hex strings
	hex_str = "7f 45 4c 46 01 01" # ❌ Will not work correctly
	```

	## Example: Tokenizing an ELF Header

	```python
	from tokenizers import Tokenizer

	# Load tokenizer
	tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-005")

	# ELF header bytes
	elf_header = b'\x7fELF\x01\x01\x01\x00'
	text = elf_header.decode('latin-1')

	# Tokenize
	encoded = tokenizer.encode(text)
	print(f"Tokens: {encoded.ids}")
	# Output: [0, 45689, 205, 22648, 1]
	# Where: 0='<\|start\|>', 45689='\x7fEL', 205='F', 22648='\x01\x01\x01\x00', 1='<\|end\|>'

	# Special tokens: <\|start\|> (id=0) and <\|end\|> (id=1) mark file boundaries
	# This helps models identify file headers and distinguish between files
	# Content tokens are: [45689, 205, 22648]

	# Note: Decoding adds spaces between tokens (BPE tokenizer behavior)
	decoded = tokenizer.decode(encoded.ids)
	print(f"Decoded: {repr(decoded)}") # '\x7fEL F \x01\x01\x01\x00'
	```

	## Related Projects

	- [mjbommar/glaurung](https://github.com/mjbommar/glaurung) - Binary analysis framework
	- [mjbommar/glaurung-models](https://github.com/mjbommar/glaurung-models) - Binary embedding models and tokenizers

	## License

	For research and educational purposes.