--- language: - code license: apache-2.0 tags: - binary-analysis - tokenizer - bpe - malware-analysis - reverse-engineering - security - x86-64 - arm64 - elf - pe library_name: tokenizers pipeline_tag: feature-extraction --- # Glaurung Binary Tokenizer 001 **Production-ready 64K vocabulary BPE tokenizer for binary executables** 🔗 **GitHub**: [mjbommar/glaurung](https://github.com/mjbommar/glaurung) --- ## Overview **Glaurung Binary Tokenizer 001** is a specialized Byte Pair Encoding (BPE) tokenizer optimized for compiled binary data across multiple architectures (x86-64, ARM64, Windows PE, Linux ELF). This is the production successor to [binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005). ### Key Specifications - **Vocabulary Size**: 65,536 tokens (exactly 2^16 = 64K) - **Compression**: 2.849 bytes/token average - **Training Data**: 13GB corpus, 30,738 binaries - **Architectures**: x86-64, x86-32, ARM64 - **Platforms**: Linux (Alpine, Debian, Ubuntu), Windows (8, 10, 11) - **Encoding**: Latin-1 (each byte 0-255 maps to a single character) ### Performance Highlights - **9-10% better compression** than 32K baseline - **86% of theoretical maximum** compression efficiency - **Instruction-aware**: Captures complete x86-64 instructions (REX + opcode + ModR/M) - **String-rich**: 5.76% of vocabulary contains function names, paths, library references --- ## Installation ```bash pip install tokenizers transformers ``` --- ## Quick Start ### Method 1: Using the tokenizers library (Recommended) ```python from tokenizers import Tokenizer from pathlib import Path # Load tokenizer directly from Hugging Face Hub tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001") # Process binary data - MUST use latin-1 encoding binary_path = Path("/usr/bin/ls") raw_bytes = binary_path.read_bytes() text = raw_bytes.decode('latin-1') # Convert bytes to latin-1 string # Tokenize encoded = tokenizer.encode(text) tokens = encoded.ids print(f"File size: {len(raw_bytes):,} bytes") print(f"Tokens: {len(tokens):,}") print(f"Compression: {len(raw_bytes) / len(tokens):.3f} bytes/token") # Decode back to text (note: adds spaces between tokens due to BPE behavior) decoded = tokenizer.decode(tokens) ``` **Expected Output** (for `/usr/bin/ls`): ``` File size: 142,144 bytes Tokens: 49,574 Compression: 2.866 bytes/token ``` ### Method 2: Using transformers library ```python from transformers import PreTrainedTokenizerFast from tokenizers import Tokenizer # Load the base tokenizer base_tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001") # Wrap with PreTrainedTokenizerFast for transformers compatibility tokenizer = PreTrainedTokenizerFast(tokenizer_object=base_tokenizer) # Process binary data with open("/usr/bin/ls", "rb") as f: raw_bytes = f.read() text = raw_bytes.decode('latin-1') # Tokenize (returns dict with input_ids, attention_mask, etc.) result = tokenizer(text) tokens = result["input_ids"] ``` --- ## Important: Data Format The tokenizer expects binary data encoded as **latin-1 strings**, NOT hex strings: ```python # ✅ CORRECT - Use latin-1 encoded bytes raw_bytes = b'\x7fELF\x01\x01' text = raw_bytes.decode('latin-1') # → '\x7fELF\x01\x01' encoded = tokenizer.encode(text) # ❌ WRONG - Do not use hex strings hex_str = "7f 45 4c 46 01 01" # Will not work correctly ``` **Why latin-1?** Every byte value (0-255) maps to exactly one latin-1 character, ensuring lossless round-trip conversion between bytes and text. --- ## Performance Benchmarks ### Compression on Real-World Binaries Tested on `/usr/bin` binaries (not in training set): | Binary | Size | Tokens | bytes/token | |--------|------|--------|-------------| | bash | 1.38 MB | 535,541 | 2.698 | | python3.12 | 7.65 MB | 2,801,226 | 2.863 | | gcc-13 | 0.98 MB | 344,201 | 2.986 | | ls | 0.14 MB | 49,574 | 2.866 | | grep | 0.18 MB | 67,567 | 2.667 | **Average**: 2.849 bytes/token ### Information-Theoretic Efficiency - Binary entropy: ~6.5 bits/byte - Theoretical optimal: 2.46 bytes/token - Our performance: 2.849 bytes/token - **Efficiency: 86%** of theoretical optimum --- ## Example: Tokenizing an ELF Header ```python from tokenizers import Tokenizer # Load tokenizer tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001") # ELF header bytes elf_header = b'\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' text = elf_header.decode('latin-1') # Tokenize encoded = tokenizer.encode(text) print(f"Original bytes: {elf_header.hex()}") print(f"Tokens: {encoded.ids}") print(f"Token count: {len(encoded.ids)}") print(f"Compression: {len(elf_header) / len(encoded.ids):.2f} bytes/token") # Examine individual tokens for token_id, token_str in zip(encoded.ids, encoded.tokens): token_bytes = token_str.encode('latin-1') print(f" Token {token_id:5d}: {token_bytes.hex():16s} ({len(token_bytes)} bytes)") ``` --- ## Token Distribution | Length | Count | Percentage | Examples | |--------|-------|------------|----------| | 2 bytes | 31,528 | 48.3% | `0x48 0x8b` (REX.W prefix), `0xcc 0xcc` (int3 padding) | | 3 bytes | 9,261 | 14.2% | `0x48 0x8b 0xc0` (MOV rax, rax) | | 4 bytes | 11,520 | 17.6% | `0x48 0x89 0x45 0xf8` (MOV [rbp-8], rax) | | 5+ bytes | 13,164 | 20.2% | Multi-instruction sequences, string literals | **Average token length**: 3.651 bytes --- ## Training Details ### Dataset **Source**: `/nas4/data/glaurung-data/binaries-small/` - **Size**: 13 GB - **Files**: 30,738 binaries - **Content**: Real-world compiled binaries including system utilities, libraries, and applications **Platform Distribution**: - Linux: Alpine, Debian, Ubuntu (ELF format) - Windows: 8, 10, 11 (PE format) **Architecture Distribution**: - x86-64 (primary) - x86-32 - ARM64 ### Training Parameters ```bash cargo run --release --bin train -- \ --output glaurung-tokenizer-002.json \ /nas4/data/glaurung-data/binaries-small/ \ --vocab-size 65536 \ --min-frequency 4 \ --chunk-size 8192 ``` **Training Duration**: 8.46 hours on 24 cores **Peak Memory**: 70 GB --- ## Use Cases ### ✅ Recommended For - Binary neural language models - Malware analysis and classification - Reverse engineering tools - Binary similarity detection - Code pattern recognition - Vulnerability research - Firmware analysis ### ❌ Not Recommended For - Text/source code (use text tokenizer like GPT-2, 100%+ penalty) - Very small binaries <1KB (overhead too high) - Real-time streaming (load time ~100ms) --- ## Comparison with Predecessor | Metric | binary-tokenizer-005 | glaurung-binary-tokenizer-001 | Improvement | |--------|---------------------|------------------------------|-------------| | Vocabulary | 65,536 | 65,536 | Same | | Training data | ~5GB mixed | 13GB binaries-small | 2.6x larger | | bytes/token | ~2.6 | 2.849 | +9.6% | | Platforms | Mixed | Multi-OS (Linux, Windows) | More diverse | | Architecture awareness | Basic | Advanced (instruction-aware) | Significant | | Documentation | Basic | Comprehensive | Extensive | **Key Improvements**: - Larger, more diverse training corpus - Better cross-platform coverage - Instruction-boundary awareness - Production-ready quality --- ## Advanced Usage ### Batch Processing Multiple Files ```python from tokenizers import Tokenizer from pathlib import Path import numpy as np tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001") def tokenize_binary_file(file_path): """Tokenize a single binary file.""" raw_bytes = Path(file_path).read_bytes() text = raw_bytes.decode('latin-1') encoded = tokenizer.encode(text) return { 'file': file_path, 'size_bytes': len(raw_bytes), 'token_count': len(encoded.ids), 'compression_ratio': len(raw_bytes) / len(encoded.ids), 'token_ids': encoded.ids } # Process directory binary_dir = Path("/usr/bin") results = [] for binary_path in binary_dir.glob("*"): if binary_path.is_file(): try: result = tokenize_binary_file(binary_path) results.append(result) except Exception as e: print(f"Error processing {binary_path}: {e}") # Analyze compression statistics compression_ratios = [r['compression_ratio'] for r in results] print(f"Mean compression: {np.mean(compression_ratios):.3f} bytes/token") print(f"Std deviation: {np.std(compression_ratios):.3f}") ``` ### Using with PyTorch/TensorFlow Models ```python from tokenizers import Tokenizer import torch from pathlib import Path tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001") def prepare_binary_for_model(file_path, max_length=512): """Prepare binary data for neural network input.""" raw_bytes = Path(file_path).read_bytes() text = raw_bytes.decode('latin-1') # Tokenize encoded = tokenizer.encode(text) token_ids = encoded.ids # Truncate or pad to max_length if len(token_ids) > max_length: token_ids = token_ids[:max_length] else: # Pad with token ID 0 (or use a dedicated padding token) token_ids = token_ids + [0] * (max_length - len(token_ids)) # Convert to tensor return torch.tensor(token_ids, dtype=torch.long) # Use in model binary_tensor = prepare_binary_for_model("/usr/bin/ls", max_length=1024) print(f"Tensor shape: {binary_tensor.shape}") # torch.Size([1024]) ``` --- ## Troubleshooting ### Issue: UnicodeDecodeError when processing binary **Solution**: Always use `latin-1` encoding, never `utf-8`: ```python # ✅ Correct text = raw_bytes.decode('latin-1') # ❌ Wrong text = raw_bytes.decode('utf-8') # Will fail on non-UTF-8 bytes ``` ### Issue: Decoded output doesn't match original **Cause**: BPE tokenizers add spaces between tokens during decoding. **Solution**: Use the raw token IDs and decode manually if exact byte recovery is needed: ```python # Get tokens without spaces tokens_no_spaces = ''.join(encoded.tokens) original_bytes = tokens_no_spaces.encode('latin-1') ``` ### Issue: Poor compression on specific binary types **Cause**: The tokenizer may not be optimized for highly specialized formats (e.g., bytecode for Python .pyc, Java .class). **Solution**: Consider domain-specific tokenizers for specialized formats, or use this as a general-purpose baseline. --- ## Related Projects - **Predecessor**: [mjbommar/binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005) - Earlier 64K binary tokenizer - **Framework**: [mjbommar/glaurung](https://github.com/mjbommar/glaurung) - Binary analysis framework - **Training Code**: [mjbommar/glaurung-models](https://github.com/mjbommar/glaurung-models) - Binary embedding models and tokenizers --- ## Technical Architecture ### Vocabulary Structure - **Base tokens**: 256 single-byte tokens (0x00 to 0xFF) - **Merged tokens**: 65,280 learned byte-pair combinations - **Total**: 65,536 tokens (exactly 2^16) ### Special Tokens The tokenizer includes boundary markers for file-level segmentation: - `<|start|>` (ID: 0) - `<|end|>` (ID: 1) These help models distinguish between concatenated files and identify file headers. ### Token Properties **Instruction-aware patterns** (x86-64 examples): - REX prefixes: `0x48`, `0x4c`, `0x4d` - Common opcodes: `0x8b` (MOV), `0x89` (MOV), `0xe8` (CALL) - ModR/M patterns: `0xc0`, `0x45`, `0x5d` **Common patterns**: - Padding: `0xcc 0xcc` (int3), `0x90 0x90` (nop) - Alignment: `0x00 0x00 0x00 0x00` - String terminators: `0x00` at word boundaries --- ## Performance Characteristics ### Load Time - **Tokenizer size**: 2.3 MB on disk - **Load time**: ~100ms (cold), ~20ms (cached) - **Memory footprint**: ~15 MB in RAM ### Encoding Speed On a modern CPU (tested on Intel i9-12900K): | Operation | Speed | |-----------|-------| | Encode 1 MB binary | ~50 ms | | Encode 10 MB binary | ~450 ms | | Encode 100 MB binary | ~4.2 s | **Throughput**: ~20-25 MB/second --- ## Limitations 1. **Cross-domain penalty**: Using on text data causes 100-140% efficiency loss 2. **Small file overhead**: Files <1KB have proportionally higher tokenization overhead 3. **Deterministic decoding**: Spaces inserted between tokens during decode (BPE behavior) 4. **Architecture bias**: Trained primarily on x86-64; may be less optimal for RISC-V, MIPS, etc. --- ## Citation If you use this tokenizer in research, please cite: ``` Glaurung Binary Tokenizer 001 64K Binary Tokenizer for Neural Language Models Vocabulary: 65,536 tokens (exactly 2^16) Training: October 2025 Dataset: 13GB binaries-small (30,738 files) Performance: 2.849 bytes/token (86% of theoretical optimum) HuggingFace: mjbommar/glaurung-binary-tokenizer-001 ``` --- ## License Apache License 2.0 This tokenizer is part of the [Glaurung](https://github.com/mjbommar/glaurung) project. See the [glaurung-models repository](https://github.com/mjbommar/glaurung-models) for full license details. --- ## Support & Issues - **GitHub Issues**: [mjbommar/glaurung-models/issues](https://github.com/mjbommar/glaurung-models/issues) - **Documentation**: Full training report available in the [glaurung-models repository](https://github.com/mjbommar/glaurung-models/tree/master/tokenizers/tokenizer-002) - **Email**: Contact maintainer via GitHub --- **Production Status**: ✅ Ready for deployment **Version**: 1.0.0 **Release Date**: October 2025