glaurung-binary-tokenizer-001 / README.md

Add README

5e85a28 verified 3 months ago

13.4 kB

	---
	language:
	- code
	license: apache-2.0
	tags:
	- binary-analysis
	- tokenizer
	- bpe
	- malware-analysis
	- reverse-engineering
	- security
	- x86-64
	- arm64
	- elf
	- pe
	library_name: tokenizers
	pipeline_tag: feature-extraction
	---

	# Glaurung Binary Tokenizer 001

	Production-ready 64K vocabulary BPE tokenizer for binary executables

	🔗 GitHub: [mjbommar/glaurung](https://github.com/mjbommar/glaurung)

	---

	## Overview

	Glaurung Binary Tokenizer 001 is a specialized Byte Pair Encoding (BPE) tokenizer optimized for compiled binary data across multiple architectures (x86-64, ARM64, Windows PE, Linux ELF). This is the production successor to [binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005).

	### Key Specifications

	- Vocabulary Size: 65,536 tokens (exactly 2^16 = 64K)
	- Compression: 2.849 bytes/token average
	- Training Data: 13GB corpus, 30,738 binaries
	- Architectures: x86-64, x86-32, ARM64
	- Platforms: Linux (Alpine, Debian, Ubuntu), Windows (8, 10, 11)
	- Encoding: Latin-1 (each byte 0-255 maps to a single character)

	### Performance Highlights

	- 9-10% better compression than 32K baseline
	- 86% of theoretical maximum compression efficiency
	- Instruction-aware: Captures complete x86-64 instructions (REX + opcode + ModR/M)
	- String-rich: 5.76% of vocabulary contains function names, paths, library references

	---

	## Installation

	```bash
	pip install tokenizers transformers
	```

	---

	## Quick Start

	### Method 1: Using the tokenizers library (Recommended)

	```python
	from tokenizers import Tokenizer
	from pathlib import Path

	# Load tokenizer directly from Hugging Face Hub
	tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")

	# Process binary data - MUST use latin-1 encoding
	binary_path = Path("/usr/bin/ls")
	raw_bytes = binary_path.read_bytes()
	text = raw_bytes.decode('latin-1') # Convert bytes to latin-1 string

	# Tokenize
	encoded = tokenizer.encode(text)
	tokens = encoded.ids

	print(f"File size: {len(raw_bytes):,} bytes")
	print(f"Tokens: {len(tokens):,}")
	print(f"Compression: {len(raw_bytes) / len(tokens):.3f} bytes/token")

	# Decode back to text (note: adds spaces between tokens due to BPE behavior)
	decoded = tokenizer.decode(tokens)
	```

	Expected Output (for `/usr/bin/ls`):
	```
	File size: 142,144 bytes
	Tokens: 49,574
	Compression: 2.866 bytes/token
	```

	### Method 2: Using transformers library

	```python
	from transformers import PreTrainedTokenizerFast
	from tokenizers import Tokenizer

	# Load the base tokenizer
	base_tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")

	# Wrap with PreTrainedTokenizerFast for transformers compatibility
	tokenizer = PreTrainedTokenizerFast(tokenizer_object=base_tokenizer)

	# Process binary data
	with open("/usr/bin/ls", "rb") as f:
	raw_bytes = f.read()
	text = raw_bytes.decode('latin-1')

	# Tokenize (returns dict with input_ids, attention_mask, etc.)
	result = tokenizer(text)
	tokens = result["input_ids"]
	```

	---

	## Important: Data Format

	The tokenizer expects binary data encoded as latin-1 strings, NOT hex strings:

	```python
	# ✅ CORRECT - Use latin-1 encoded bytes
	raw_bytes = b'\x7fELF\x01\x01'
	text = raw_bytes.decode('latin-1') # → '\x7fELF\x01\x01'
	encoded = tokenizer.encode(text)

	# ❌ WRONG - Do not use hex strings
	hex_str = "7f 45 4c 46 01 01" # Will not work correctly
	```

	Why latin-1? Every byte value (0-255) maps to exactly one latin-1 character, ensuring lossless round-trip conversion between bytes and text.

	---

	## Performance Benchmarks

	### Compression on Real-World Binaries

	Tested on `/usr/bin` binaries (not in training set):

	\| Binary \| Size \| Tokens \| bytes/token \|
	\|--------\|------\|--------\|-------------\|
	\| bash \| 1.38 MB \| 535,541 \| 2.698 \|
	\| python3.12 \| 7.65 MB \| 2,801,226 \| 2.863 \|
	\| gcc-13 \| 0.98 MB \| 344,201 \| 2.986 \|
	\| ls \| 0.14 MB \| 49,574 \| 2.866 \|
	\| grep \| 0.18 MB \| 67,567 \| 2.667 \|

	Average: 2.849 bytes/token

	### Information-Theoretic Efficiency

	- Binary entropy: ~6.5 bits/byte
	- Theoretical optimal: 2.46 bytes/token
	- Our performance: 2.849 bytes/token
	- Efficiency: 86% of theoretical optimum

	---

	## Example: Tokenizing an ELF Header

	```python
	from tokenizers import Tokenizer

	# Load tokenizer
	tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")

	# ELF header bytes
	elf_header = b'\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00'
	text = elf_header.decode('latin-1')

	# Tokenize
	encoded = tokenizer.encode(text)
	print(f"Original bytes: {elf_header.hex()}")
	print(f"Tokens: {encoded.ids}")
	print(f"Token count: {len(encoded.ids)}")
	print(f"Compression: {len(elf_header) / len(encoded.ids):.2f} bytes/token")

	# Examine individual tokens
	for token_id, token_str in zip(encoded.ids, encoded.tokens):
	token_bytes = token_str.encode('latin-1')
	print(f" Token {token_id:5d}: {token_bytes.hex():16s} ({len(token_bytes)} bytes)")
	```

	---

	## Token Distribution

	\| Length \| Count \| Percentage \| Examples \|
	\|--------\|-------\|------------\|----------\|
	\| 2 bytes \| 31,528 \| 48.3% \| `0x48 0x8b` (REX.W prefix), `0xcc 0xcc` (int3 padding) \|
	\| 3 bytes \| 9,261 \| 14.2% \| `0x48 0x8b 0xc0` (MOV rax, rax) \|
	\| 4 bytes \| 11,520 \| 17.6% \| `0x48 0x89 0x45 0xf8` (MOV [rbp-8], rax) \|
	\| 5+ bytes \| 13,164 \| 20.2% \| Multi-instruction sequences, string literals \|

	Average token length: 3.651 bytes

	---

	## Training Details

	### Dataset

	Source: `/nas4/data/glaurung-data/binaries-small/`

	- Size: 13 GB
	- Files: 30,738 binaries
	- Content: Real-world compiled binaries including system utilities, libraries, and applications

	Platform Distribution:
	- Linux: Alpine, Debian, Ubuntu (ELF format)
	- Windows: 8, 10, 11 (PE format)

	Architecture Distribution:
	- x86-64 (primary)
	- x86-32
	- ARM64

	### Training Parameters

	```bash
	cargo run --release --bin train -- \
	--output glaurung-tokenizer-002.json \
	/nas4/data/glaurung-data/binaries-small/ \
	--vocab-size 65536 \
	--min-frequency 4 \
	--chunk-size 8192
	```

	Training Duration: 8.46 hours on 24 cores
	Peak Memory: 70 GB

	---

	## Use Cases

	### ✅ Recommended For

	- Binary neural language models
	- Malware analysis and classification
	- Reverse engineering tools
	- Binary similarity detection
	- Code pattern recognition
	- Vulnerability research
	- Firmware analysis

	### ❌ Not Recommended For

	- Text/source code (use text tokenizer like GPT-2, 100%+ penalty)
	- Very small binaries <1KB (overhead too high)
	- Real-time streaming (load time ~100ms)

	---

	## Comparison with Predecessor

	\| Metric \| binary-tokenizer-005 \| glaurung-binary-tokenizer-001 \| Improvement \|
	\|--------\|---------------------\|------------------------------\|-------------\|
	\| Vocabulary \| 65,536 \| 65,536 \| Same \|
	\| Training data \| ~5GB mixed \| 13GB binaries-small \| 2.6x larger \|
	\| bytes/token \| ~2.6 \| 2.849 \| +9.6% \|
	\| Platforms \| Mixed \| Multi-OS (Linux, Windows) \| More diverse \|
	\| Architecture awareness \| Basic \| Advanced (instruction-aware) \| Significant \|
	\| Documentation \| Basic \| Comprehensive \| Extensive \|

	Key Improvements:
	- Larger, more diverse training corpus
	- Better cross-platform coverage
	- Instruction-boundary awareness
	- Production-ready quality

	---

	## Advanced Usage

	### Batch Processing Multiple Files

	```python
	from tokenizers import Tokenizer
	from pathlib import Path
	import numpy as np

	tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")

	def tokenize_binary_file(file_path):
	"""Tokenize a single binary file."""
	raw_bytes = Path(file_path).read_bytes()
	text = raw_bytes.decode('latin-1')
	encoded = tokenizer.encode(text)
	return {
	'file': file_path,
	'size_bytes': len(raw_bytes),
	'token_count': len(encoded.ids),
	'compression_ratio': len(raw_bytes) / len(encoded.ids),
	'token_ids': encoded.ids
	}

	# Process directory
	binary_dir = Path("/usr/bin")
	results = []
	for binary_path in binary_dir.glob("*"):
	if binary_path.is_file():
	try:
	result = tokenize_binary_file(binary_path)
	results.append(result)
	except Exception as e:
	print(f"Error processing {binary_path}: {e}")

	# Analyze compression statistics
	compression_ratios = [r['compression_ratio'] for r in results]
	print(f"Mean compression: {np.mean(compression_ratios):.3f} bytes/token")
	print(f"Std deviation: {np.std(compression_ratios):.3f}")
	```

	### Using with PyTorch/TensorFlow Models

	```python
	from tokenizers import Tokenizer
	import torch
	from pathlib import Path

	tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")

	def prepare_binary_for_model(file_path, max_length=512):
	"""Prepare binary data for neural network input."""
	raw_bytes = Path(file_path).read_bytes()
	text = raw_bytes.decode('latin-1')

	# Tokenize
	encoded = tokenizer.encode(text)
	token_ids = encoded.ids

	# Truncate or pad to max_length
	if len(token_ids) > max_length:
	token_ids = token_ids[:max_length]
	else:
	# Pad with token ID 0 (or use a dedicated padding token)
	token_ids = token_ids + [0] * (max_length - len(token_ids))

	# Convert to tensor
	return torch.tensor(token_ids, dtype=torch.long)

	# Use in model
	binary_tensor = prepare_binary_for_model("/usr/bin/ls", max_length=1024)
	print(f"Tensor shape: {binary_tensor.shape}") # torch.Size([1024])
	```

	---

	## Troubleshooting

	### Issue: UnicodeDecodeError when processing binary

	Solution: Always use `latin-1` encoding, never `utf-8`:
	```python
	# ✅ Correct
	text = raw_bytes.decode('latin-1')

	# ❌ Wrong
	text = raw_bytes.decode('utf-8') # Will fail on non-UTF-8 bytes
	```

	### Issue: Decoded output doesn't match original

	Cause: BPE tokenizers add spaces between tokens during decoding.

	Solution: Use the raw token IDs and decode manually if exact byte recovery is needed:
	```python
	# Get tokens without spaces
	tokens_no_spaces = ''.join(encoded.tokens)
	original_bytes = tokens_no_spaces.encode('latin-1')
	```

	### Issue: Poor compression on specific binary types

	Cause: The tokenizer may not be optimized for highly specialized formats (e.g., bytecode for Python .pyc, Java .class).

	Solution: Consider domain-specific tokenizers for specialized formats, or use this as a general-purpose baseline.

	---

	## Related Projects

	- Predecessor: [mjbommar/binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005) - Earlier 64K binary tokenizer
	- Framework: [mjbommar/glaurung](https://github.com/mjbommar/glaurung) - Binary analysis framework
	- Training Code: [mjbommar/glaurung-models](https://github.com/mjbommar/glaurung-models) - Binary embedding models and tokenizers

	---

	## Technical Architecture

	### Vocabulary Structure

	- Base tokens: 256 single-byte tokens (0x00 to 0xFF)
	- Merged tokens: 65,280 learned byte-pair combinations
	- Total: 65,536 tokens (exactly 2^16)

	### Special Tokens

	The tokenizer includes boundary markers for file-level segmentation:
	- `<\|start\|>` (ID: 0)
	- `<\|end\|>` (ID: 1)

	These help models distinguish between concatenated files and identify file headers.

	### Token Properties

	Instruction-aware patterns (x86-64 examples):
	- REX prefixes: `0x48`, `0x4c`, `0x4d`
	- Common opcodes: `0x8b` (MOV), `0x89` (MOV), `0xe8` (CALL)
	- ModR/M patterns: `0xc0`, `0x45`, `0x5d`

	Common patterns:
	- Padding: `0xcc 0xcc` (int3), `0x90 0x90` (nop)
	- Alignment: `0x00 0x00 0x00 0x00`
	- String terminators: `0x00` at word boundaries

	---

	## Performance Characteristics

	### Load Time

	- Tokenizer size: 2.3 MB on disk
	- Load time: ~100ms (cold), ~20ms (cached)
	- Memory footprint: ~15 MB in RAM

	### Encoding Speed

	On a modern CPU (tested on Intel i9-12900K):

	\| Operation \| Speed \|
	\|-----------\|-------\|
	\| Encode 1 MB binary \| ~50 ms \|
	\| Encode 10 MB binary \| ~450 ms \|
	\| Encode 100 MB binary \| ~4.2 s \|

	Throughput: ~20-25 MB/second

	---

	## Limitations

	1. Cross-domain penalty: Using on text data causes 100-140% efficiency loss
	2. Small file overhead: Files <1KB have proportionally higher tokenization overhead
	3. Deterministic decoding: Spaces inserted between tokens during decode (BPE behavior)
	4. Architecture bias: Trained primarily on x86-64; may be less optimal for RISC-V, MIPS, etc.

	---

	## Citation

	If you use this tokenizer in research, please cite:

	```
	Glaurung Binary Tokenizer 001
	64K Binary Tokenizer for Neural Language Models
	Vocabulary: 65,536 tokens (exactly 2^16)
	Training: October 2025
	Dataset: 13GB binaries-small (30,738 files)
	Performance: 2.849 bytes/token (86% of theoretical optimum)
	HuggingFace: mjbommar/glaurung-binary-tokenizer-001
	```

	---

	## License

	Apache License 2.0

	This tokenizer is part of the [Glaurung](https://github.com/mjbommar/glaurung) project. See the [glaurung-models repository](https://github.com/mjbommar/glaurung-models) for full license details.

	---

	## Support & Issues

	- GitHub Issues: [mjbommar/glaurung-models/issues](https://github.com/mjbommar/glaurung-models/issues)
	- Documentation: Full training report available in the [glaurung-models repository](https://github.com/mjbommar/glaurung-models/tree/master/tokenizers/tokenizer-002)
	- Email: Contact maintainer via GitHub

	---

	Production Status: ✅ Ready for deployment
	Version: 1.0.0
	Release Date: October 2025