README.md · mjbommar/glaurung-binary-tokenizer-002 at main

glaurung-binary-tokenizer-002 / README.md

mjbommar

Update README with improved formatting, YAML metadata, and verified statistics

31181d3 verified about 2 months ago

preview code

raw

history blame contribute delete

9.61 kB

	---
	language:
	- code
	license: apache-2.0
	tags:
	- tokenizer
	- binary-analysis
	- binary-tokenization
	- bpe
	- byte-pair-encoding
	- malware-analysis
	- reverse-engineering
	- security
	- x86-64
	- arm64
	- elf
	- pe
	library_name: tokenizers
	pipeline_tag: feature-extraction
	---

	# glaurung-binary-tokenizer-002

	A cross-platform BPE tokenizer for binary executables and machine code. Trained using advanced chunked training with deduplication on 23 GB of diverse binaries spanning Linux and Windows platforms.

	🔗 Model: [`mjbommar/glaurung-binary-tokenizer-002`](https://huggingface.co/mjbommar/glaurung-binary-tokenizer-002)
	📊 Dataset: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized)
	📄 Paper: Binary BPE: Cross-Platform Tokenization for Binary Analysis (arXiv preprint coming soon)

	## Overview

	- Vocabulary Size: 65,536 tokens (2^16)
	- Token Composition: 256 base bytes + 65,273 learned merges + 7 special tokens
	- Average Token Length: 3.749 bytes
	- 3-byte Instructions: 16.5% of vocabulary (10,800 tokens)
	- Compression Ratio: ~2.6 bytes/token on typical binaries

	---

	## Training Configuration

	Training Corpus:
	- Source: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized)
	- Size: ~23 GB (24.7 billion bytes)
	- Processed Chunks: 40,574 total (37,083 unique + 8,454 duplicates reused)
	- Platforms: Linux (Alpine, Debian, Ubuntu - ELF), Windows (8, 10, 11 - PE)
	- Architectures: x86-64, x86-32, ARM64

	Training Parameters:
	- Vocabulary size: 65,536 (including 7 special tokens)
	- Min frequency: 4
	- Chunk size: 4,194,304 bytes (4 MB)
	- Training method: Chunked BPE with deduplication and support-based merge combination
	- Allowed lengths: DEFAULT (1-16 bytes)
	- Training duration: ~8-9 hours

	---

	## Vocabulary Statistics

	Composition:
	- Base bytes (0-255): 256 tokens
	- Learned merges: 65,273 tokens
	- Special tokens: 7 tokens (`<\|start\|>`, `<\|end\|>`, `<\|pad\|>`, `<\|unk\|>`, `<\|cls\|>`, `<\|sep\|>`, `<\|mask\|>`)
	- Total: 65,536 tokens

	Quality Metrics:
	- All tokens reachable: ✓ Yes
	- Valid merges: 65,273 / 65,273
	- Power-of-2 size: ✓ Yes (2^16)

	---

	## Token Length Distribution

	\| Length \| Count \| Percentage \| Description \|
	\|--------\|-------\|------------\|-------------\|
	\| 1 byte \| 256 \| 0.4% \| Base bytes \|
	\| 2 bytes \| 28,561 \| 43.6% \| Byte pairs (most common patterns) \|
	\| 3 bytes \| 10,800 \| 16.5% \| Complete x86-64 instructions \|
	\| 4 bytes \| 14,376 \| 21.9% \| Instructions with operands \|
	\| 5 bytes \| 2,780 \| 4.2% \| Complex patterns \|
	\| 6 bytes \| 2,213 \| 3.4% \| Complex patterns \|
	\| 7 bytes \| 1,167 \| 1.8% \| Complex patterns \|
	\| 8 bytes \| 2,329 \| 3.6% \| Multi-byte sequences \|
	\| 9+ bytes \| 3,045 \| 4.6% \| Long patterns \|

	Average Token Length: 3.749 bytes

	---

	## Byte Content Analysis

	Content Categories:
	- Contains NULL byte (0x00): 17,418 tokens (26.6%)
	- ASCII printable (0x20-0x7E): 9,478 tokens (14.5%)
	- All ASCII (<0x80): 20,816 tokens (31.8%)
	- High bytes (≥0x80): 44,711 tokens (68.2%)

	Most Common Bytes in Tokens:
	- `0x00` (NULL): 34,482 occurrences - Padding and alignment
	- `0xFF`: 6,545 occurrences - Sentinel values
	- `0x48` (REX.W): 3,419 occurrences - x86-64 REX prefix
	- `0x8B` (MOV): 2,486 occurrences - x86-64 MOV opcode
	- `0x40` (@): 4,538 occurrences - ASCII and instruction patterns

	---

	## Sequence Coverage

	N-byte Sequence Diversity:
	\| Length \| Learned Tokens \| Possible Sequences \| Coverage \|
	\|--------\|----------------\|-------------------\|----------\|
	\| 1-byte \| 256 \| 256 \| 100.00% \|
	\| 2-byte \| 28,561 \| 65,536 \| 43.58% \|
	\| 3-byte \| 10,800 \| 16,777,216 \| 0.064% \|
	\| 4-byte \| 14,376 \| 4,294,967,296 \| 0.00034% \|

	Notable Achievement: 43.6% coverage of all possible 2-byte sequences - excellent for pattern recognition.

	---

	## Files

	- `tokenizer-65536.json` - Trained tokenizer model (2.4 MB)
	- `analysis_results.json` - Detailed analysis statistics
	- `original_README.md` - Original README from HuggingFace

	---

	## Usage

	Load from HuggingFace Hub:
	```python
	from tokenizers import Tokenizer

	# Load directly from HuggingFace
	tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
	```

	Load from local file:
	```bash
	# With bbpe CLI
	bbpe encode --tokenizer tokenizer-65536.json /path/to/binary
	bbpe info tokenizer-65536.json
	```

	Complete Python Example:
	```python
	from tokenizers import Tokenizer

	# Load from HuggingFace or local file
	tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
	# OR: tokenizer = Tokenizer.from_file("tokenizer-65536.json")

	# Read binary file and decode as latin-1 (preserves all byte values 0-255)
	with open("/usr/bin/ls", "rb") as f:
	data = f.read()
	data_str = data.decode("latin-1")

	# Encode the binary data
	encoding = tokenizer.encode(data_str)
	print(f"File size: {len(data)} bytes")
	print(f"Total tokens: {len(encoding.ids)}")
	print(f"Compression: {len(data) / len(encoding.ids):.3f} bytes/token")

	# First 10 tokens
	for i, (token_id, token) in enumerate(zip(encoding.ids[:10], encoding.tokens[:10])):
	token_bytes = token.encode("latin-1")
	print(f" Token {i}: ID={token_id:5d} hex={token_bytes.hex():20s} ({len(token_bytes)} bytes)")

	# Decode tokens back to bytes
	decoded_str = tokenizer.decode(encoding.ids)
	decoded_bytes = decoded_str.encode("latin-1")
	assert decoded_bytes == data # Perfect reconstruction
	```

	Example output for `/usr/bin/ls` (142,312 bytes):
	```
	File size: 142312 bytes
	Total tokens: 54537
	Compression: 2.609 bytes/token

	First 10 tokens:
	Token 0: ID= 127 hex=7f (1 bytes)
	Token 1: ID= 2382 hex=454c (2 bytes)
	Token 2: ID= 5923 hex=4602 (2 bytes)
	Token 3: ID= 394 hex=0101 (2 bytes)
	Token 4: ID= 268 hex=000000000000 (6 bytes)
	Token 5: ID= 259 hex=000000 (3 bytes)
	Token 6: ID= 295 hex=0300 (2 bytes)
	Token 7: ID= 2124 hex=3e00 (2 bytes)
	Token 8: ID= 271 hex=01000000 (4 bytes)
	Token 9: ID=59106 hex=306d (2 bytes)

	Decoded: 7f454c4602010100000000000000000003003e0001000000306d...
	(ELF header: 7f 45 4c 46 = ELF magic bytes)
	```

	---

	## Performance Characteristics

	Compression on Real-World Binaries:

	\| Binary \| Size \| Tokens \| bytes/token \|
	\|--------\|------\|--------\|-------------\|
	\| bash \| 1.38 MB \| 602,719 \| 2.399 \|
	\| python3.12 \| 7.65 MB \| 2,997,303 \| 2.676 \|
	\| gcc-13 \| 0.98 MB \| 375,331 \| 2.726 \|
	\| ls \| 0.14 MB \| 54,537 \| 2.609 \|
	\| grep \| 0.18 MB \| 73,500 \| 2.542 \|

	Average: 2.590 bytes/token

	Information-Theoretic Efficiency:
	- Binary entropy: ~6.5 bits/byte
	- Theoretical optimal: 2.46 bytes/token
	- Actual performance: 2.590 bytes/token
	- Efficiency: 95.0% of theoretical optimum

	---

	## Key Features

	Instruction-Aware Patterns:
	- REX prefixes: `0x48`, `0x4c`, `0x4d` (x86-64 64-bit operands)
	- Common opcodes: `0x8b` (MOV), `0x89` (MOV), `0xe8` (CALL)
	- ModR/M patterns: `0xc0`, `0x45`, `0x5d`

	Common Binary Patterns:
	- Padding: `0xcc 0xcc` (INT3 debug breakpoints), `0x90 0x90` (NOP sleds)
	- Alignment: `0x00 0x00 0x00 0x00` (NULL padding)
	- String terminators: `0x00` at word boundaries

	String-Rich Vocabulary:
	- 11.81% of vocabulary contains function names, paths, and library references
	- Better semantic understanding than standard BPE

	---

	## Comparison with Other Tokenizers

	vs. binary-tokenizer-001 Series (this repository):

	\| Metric \| 4K \| 8K \| 16K \| 64K (this) \| Improvement \|
	\|--------\|----\|----\|-----\|------------\|-------------\|
	\| Vocab size \| 4,096 \| 8,192 \| 16,384 \| 65,536 \| 4-16x larger \|
	\| Avg token length \| 3.000 \| 3.312 \| 3.498 \| 3.749 \| +25% vs 4K \|
	\| 3-byte tokens % \| 20.6% \| 21.7% \| 20.5% \| 16.5% \| Different focus \|
	\| 2-byte coverage \| 3.0% \| 5.6% \| 10.9% \| 43.6% \| 14x better \|
	\| Compression (ls) \| 2.00 \| 2.17 \| 2.39 \| 2.61 \| +30% vs 4K \|
	\| Training method \| Standard \| Standard \| Standard \| Chunked+dedup \| Advanced \|

	Key Advantages of 64K Vocabulary:
	- 43.6% 2-byte coverage: Captures nearly half of all possible byte pairs
	- Chunked training: Deduplication-aware training improves merge quality
	- Better compression: 2.609 bytes/token vs 2.0 bytes/token (4K)
	- Longer patterns: 3.749 byte average vs 3.0 bytes (4K)
	- String-rich: 11.81% vocabulary contains semantic strings

	---

	## Citation

	If you use this tokenizer in your research, please cite:

	```bibtex
	@article{bommarito2025binarybpe,
	title={Binary BPE: Cross-Platform Tokenization for Binary Analysis},
	author={Bommarito II, Michael J.},
	journal={arXiv preprint},
	year={2025},
	note={Preprint coming soon}
	}
	```

	Also cite the original Glaurung tokenizer:
	```
	Glaurung Binary Tokenizer 002
	64K Binary Tokenizer for Neural Language Models
	Vocabulary: 65,536 tokens (256 base + 65,273 merges + 7 special)
	Training: October-November 2025
	Training Method: Chunked BPE with deduplication (bbpe v0.3.2)
	Dataset: 23GB binaries-small (40,574 chunks, 8,454 duplicates)
	Performance: 2.590 bytes/token (95% of theoretical optimum)
	HuggingFace: mjbommar/glaurung-binary-tokenizer-002
	```

	Author: Michael J. Bommarito II ([michael.bommarito@gmail.com](mailto:michael.bommarito@gmail.com))

	---

	## License

	Apache License 2.0

	This tokenizer is part of the [Glaurung](https://github.com/mjbommar/glaurung) project.

	---

	Generated: November 13, 2025
	Original Model: `mjbommar/glaurung-binary-tokenizer-002`
	Training Tool: bbpe v0.3.2
	Analysis Script: `analyze_tokenizer.py`