--- language: - code license: apache-2.0 tags: - tokenizer - binary-analysis - binary-tokenization - bpe - byte-pair-encoding - malware-analysis - reverse-engineering - security - x86-64 - arm64 - elf - pe library_name: tokenizers pipeline_tag: feature-extraction --- # glaurung-binary-tokenizer-002 A cross-platform BPE tokenizer for binary executables and machine code. Trained using advanced chunked training with deduplication on 23 GB of diverse binaries spanning Linux and Windows platforms. **🔗 Model**: [`mjbommar/glaurung-binary-tokenizer-002`](https://huggingface.co/mjbommar/glaurung-binary-tokenizer-002) **📊 Dataset**: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized) **📄 Paper**: *Binary BPE: Cross-Platform Tokenization for Binary Analysis* (arXiv preprint coming soon) ## Overview - **Vocabulary Size**: 65,536 tokens (2^16) - **Token Composition**: 256 base bytes + 65,273 learned merges + 7 special tokens - **Average Token Length**: 3.749 bytes - **3-byte Instructions**: 16.5% of vocabulary (10,800 tokens) - **Compression Ratio**: ~2.6 bytes/token on typical binaries --- ## Training Configuration **Training Corpus**: - Source: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized) - Size: ~23 GB (24.7 billion bytes) - Processed Chunks: 40,574 total (37,083 unique + 8,454 duplicates reused) - Platforms: Linux (Alpine, Debian, Ubuntu - ELF), Windows (8, 10, 11 - PE) - Architectures: x86-64, x86-32, ARM64 **Training Parameters**: - Vocabulary size: 65,536 (including 7 special tokens) - Min frequency: 4 - Chunk size: 4,194,304 bytes (4 MB) - Training method: Chunked BPE with deduplication and support-based merge combination - Allowed lengths: DEFAULT (1-16 bytes) - Training duration: ~8-9 hours --- ## Vocabulary Statistics **Composition**: - Base bytes (0-255): 256 tokens - Learned merges: 65,273 tokens - Special tokens: 7 tokens (`<|start|>`, `<|end|>`, `<|pad|>`, `<|unk|>`, `<|cls|>`, `<|sep|>`, `<|mask|>`) - **Total**: 65,536 tokens **Quality Metrics**: - All tokens reachable: ✓ Yes - Valid merges: 65,273 / 65,273 - Power-of-2 size: ✓ Yes (2^16) --- ## Token Length Distribution | Length | Count | Percentage | Description | |--------|-------|------------|-------------| | 1 byte | 256 | 0.4% | Base bytes | | 2 bytes | 28,561 | 43.6% | Byte pairs (most common patterns) | | 3 bytes | 10,800 | 16.5% | Complete x86-64 instructions | | 4 bytes | 14,376 | 21.9% | Instructions with operands | | 5 bytes | 2,780 | 4.2% | Complex patterns | | 6 bytes | 2,213 | 3.4% | Complex patterns | | 7 bytes | 1,167 | 1.8% | Complex patterns | | 8 bytes | 2,329 | 3.6% | Multi-byte sequences | | 9+ bytes | 3,045 | 4.6% | Long patterns | **Average Token Length**: 3.749 bytes --- ## Byte Content Analysis **Content Categories**: - Contains NULL byte (0x00): 17,418 tokens (26.6%) - ASCII printable (0x20-0x7E): 9,478 tokens (14.5%) - All ASCII (<0x80): 20,816 tokens (31.8%) - High bytes (≥0x80): 44,711 tokens (68.2%) **Most Common Bytes in Tokens**: - `0x00` (NULL): 34,482 occurrences - Padding and alignment - `0xFF`: 6,545 occurrences - Sentinel values - `0x48` (REX.W): 3,419 occurrences - x86-64 REX prefix - `0x8B` (MOV): 2,486 occurrences - x86-64 MOV opcode - `0x40` (@): 4,538 occurrences - ASCII and instruction patterns --- ## Sequence Coverage **N-byte Sequence Diversity**: | Length | Learned Tokens | Possible Sequences | Coverage | |--------|----------------|-------------------|----------| | 1-byte | 256 | 256 | 100.00% | | 2-byte | 28,561 | 65,536 | 43.58% | | 3-byte | 10,800 | 16,777,216 | 0.064% | | 4-byte | 14,376 | 4,294,967,296 | 0.00034% | **Notable Achievement**: 43.6% coverage of all possible 2-byte sequences - excellent for pattern recognition. --- ## Files - `tokenizer-65536.json` - Trained tokenizer model (2.4 MB) - `analysis_results.json` - Detailed analysis statistics - `original_README.md` - Original README from HuggingFace --- ## Usage **Load from HuggingFace Hub**: ```python from tokenizers import Tokenizer # Load directly from HuggingFace tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002") ``` **Load from local file**: ```bash # With bbpe CLI bbpe encode --tokenizer tokenizer-65536.json /path/to/binary bbpe info tokenizer-65536.json ``` **Complete Python Example**: ```python from tokenizers import Tokenizer # Load from HuggingFace or local file tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002") # OR: tokenizer = Tokenizer.from_file("tokenizer-65536.json") # Read binary file and decode as latin-1 (preserves all byte values 0-255) with open("/usr/bin/ls", "rb") as f: data = f.read() data_str = data.decode("latin-1") # Encode the binary data encoding = tokenizer.encode(data_str) print(f"File size: {len(data)} bytes") print(f"Total tokens: {len(encoding.ids)}") print(f"Compression: {len(data) / len(encoding.ids):.3f} bytes/token") # First 10 tokens for i, (token_id, token) in enumerate(zip(encoding.ids[:10], encoding.tokens[:10])): token_bytes = token.encode("latin-1") print(f" Token {i}: ID={token_id:5d} hex={token_bytes.hex():20s} ({len(token_bytes)} bytes)") # Decode tokens back to bytes decoded_str = tokenizer.decode(encoding.ids) decoded_bytes = decoded_str.encode("latin-1") assert decoded_bytes == data # Perfect reconstruction ``` **Example output for `/usr/bin/ls` (142,312 bytes)**: ``` File size: 142312 bytes Total tokens: 54537 Compression: 2.609 bytes/token First 10 tokens: Token 0: ID= 127 hex=7f (1 bytes) Token 1: ID= 2382 hex=454c (2 bytes) Token 2: ID= 5923 hex=4602 (2 bytes) Token 3: ID= 394 hex=0101 (2 bytes) Token 4: ID= 268 hex=000000000000 (6 bytes) Token 5: ID= 259 hex=000000 (3 bytes) Token 6: ID= 295 hex=0300 (2 bytes) Token 7: ID= 2124 hex=3e00 (2 bytes) Token 8: ID= 271 hex=01000000 (4 bytes) Token 9: ID=59106 hex=306d (2 bytes) Decoded: 7f454c4602010100000000000000000003003e0001000000306d... (ELF header: 7f 45 4c 46 = ELF magic bytes) ``` --- ## Performance Characteristics **Compression on Real-World Binaries**: | Binary | Size | Tokens | bytes/token | |--------|------|--------|-------------| | bash | 1.38 MB | 602,719 | 2.399 | | python3.12 | 7.65 MB | 2,997,303 | 2.676 | | gcc-13 | 0.98 MB | 375,331 | 2.726 | | ls | 0.14 MB | 54,537 | 2.609 | | grep | 0.18 MB | 73,500 | 2.542 | **Average**: 2.590 bytes/token **Information-Theoretic Efficiency**: - Binary entropy: ~6.5 bits/byte - Theoretical optimal: 2.46 bytes/token - Actual performance: 2.590 bytes/token - **Efficiency: 95.0%** of theoretical optimum --- ## Key Features **Instruction-Aware Patterns**: - REX prefixes: `0x48`, `0x4c`, `0x4d` (x86-64 64-bit operands) - Common opcodes: `0x8b` (MOV), `0x89` (MOV), `0xe8` (CALL) - ModR/M patterns: `0xc0`, `0x45`, `0x5d` **Common Binary Patterns**: - Padding: `0xcc 0xcc` (INT3 debug breakpoints), `0x90 0x90` (NOP sleds) - Alignment: `0x00 0x00 0x00 0x00` (NULL padding) - String terminators: `0x00` at word boundaries **String-Rich Vocabulary**: - 11.81% of vocabulary contains function names, paths, and library references - Better semantic understanding than standard BPE --- ## Comparison with Other Tokenizers **vs. binary-tokenizer-001 Series** (this repository): | Metric | 4K | 8K | 16K | 64K (this) | Improvement | |--------|----|----|-----|------------|-------------| | Vocab size | 4,096 | 8,192 | 16,384 | 65,536 | 4-16x larger | | Avg token length | 3.000 | 3.312 | 3.498 | 3.749 | +25% vs 4K | | 3-byte tokens % | 20.6% | 21.7% | 20.5% | 16.5% | Different focus | | 2-byte coverage | 3.0% | 5.6% | 10.9% | 43.6% | 14x better | | Compression (ls) | 2.00 | 2.17 | 2.39 | 2.61 | +30% vs 4K | | Training method | Standard | Standard | Standard | Chunked+dedup | Advanced | **Key Advantages of 64K Vocabulary**: - **43.6% 2-byte coverage**: Captures nearly half of all possible byte pairs - **Chunked training**: Deduplication-aware training improves merge quality - **Better compression**: 2.609 bytes/token vs 2.0 bytes/token (4K) - **Longer patterns**: 3.749 byte average vs 3.0 bytes (4K) - **String-rich**: 11.81% vocabulary contains semantic strings --- ## Citation If you use this tokenizer in your research, please cite: ```bibtex @article{bommarito2025binarybpe, title={Binary BPE: Cross-Platform Tokenization for Binary Analysis}, author={Bommarito II, Michael J.}, journal={arXiv preprint}, year={2025}, note={Preprint coming soon} } ``` **Also cite the original Glaurung tokenizer**: ``` Glaurung Binary Tokenizer 002 64K Binary Tokenizer for Neural Language Models Vocabulary: 65,536 tokens (256 base + 65,273 merges + 7 special) Training: October-November 2025 Training Method: Chunked BPE with deduplication (bbpe v0.3.2) Dataset: 23GB binaries-small (40,574 chunks, 8,454 duplicates) Performance: 2.590 bytes/token (95% of theoretical optimum) HuggingFace: mjbommar/glaurung-binary-tokenizer-002 ``` **Author**: Michael J. Bommarito II ([michael.bommarito@gmail.com](mailto:michael.bommarito@gmail.com)) --- ## License **Apache License 2.0** This tokenizer is part of the [Glaurung](https://github.com/mjbommar/glaurung) project. --- **Generated**: November 13, 2025 **Original Model**: `mjbommar/glaurung-binary-tokenizer-002` **Training Tool**: bbpe v0.3.2 **Analysis Script**: `analyze_tokenizer.py`