Update README with improved formatting and verified statistics

Browse files

Files changed (1) hide show

README.md +184 -405

README.md CHANGED Viewed

@@ -1,475 +1,260 @@
----
-language:
-- code
-license: apache-2.0
-tags:
-- binary-analysis
-- tokenizer
-- bpe
-- malware-analysis
-- reverse-engineering
-- security
-- x86-64
-- arm64
-- elf
-- pe
-library_name: tokenizers
-pipeline_tag: feature-extraction
----
-# Glaurung Binary Tokenizer 002
-**BPE tokenizer for binary executables**
-🔗 **GitHub**: [mjbommar/glaurung](https://github.com/mjbommar/glaurung)
----
 ## Overview
-**Glaurung Binary Tokenizer 002** is a specialized Byte Pair Encoding (BPE) tokenizer optimized for compiled binary data across multiple architectures (x86-64, ARM64, Windows PE, Linux ELF). This tokenizer uses advanced chunked training with deduplication and support-based merge combination to achieve state-of-the-art compression on binary executables.
-### Key Specifications
-- **Vocabulary Size**: 65,536 tokens (256 base + 65,273 merges + 7 special tokens)
-- **Compression**: 2.590 bytes/token average (tested on real-world binaries)
-- **Training Data**: 23.0 GB corpus, 40,574 processed chunks with deduplication
-- **Training Method**: Chunked training with 4MB chunks, support-based merge combination
-- **Architectures**: x86-64, x86-32, ARM64
-- **Platforms**: Linux (Alpine, Debian, Ubuntu), Windows (8, 10, 11)
-- **Encoding**: Latin-1 (each byte 0-255 maps to a single character)
-### Performance Highlights
-- **36.6% better compression** than 32K baseline (uses 36.6% fewer tokens)
-- **95% of theoretical maximum** compression efficiency
-- **Instruction-aware**: Captures complete x86-64 instructions (REX + opcode + ModR/M)
-- **String-rich**: 11.81% of vocabulary contains function names, paths, library references
----
-## Installation
-```bash
-pip install tokenizers transformers
-```
----
-## Quick Start
-### Method 1: Using the tokenizers library (Recommended)
-```python
-from tokenizers import Tokenizer
-from pathlib import Path
-# Load tokenizer directly from Hugging Face Hub
-tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
-# Process binary data - MUST use latin-1 encoding
-binary_path = Path("/usr/bin/ls")
-raw_bytes = binary_path.read_bytes()
-text = raw_bytes.decode('latin-1')  # Convert bytes to latin-1 string
-# Tokenize
-encoded = tokenizer.encode(text)
-tokens = encoded.ids
-print(f"File size: {len(raw_bytes):,} bytes")
-print(f"Tokens: {len(tokens):,}")
-print(f"Compression: {len(raw_bytes) / len(tokens):.3f} bytes/token")
-# Decode back to text (note: adds spaces between tokens due to BPE behavior)
-decoded = tokenizer.decode(tokens)
-```
-**Expected Output** (for `/usr/bin/ls`):
-```
-File size: 142,312 bytes
-Tokens: 54,537
-Compression: 2.609 bytes/token
-```
-### Method 2: Using transformers library
-```python
-from transformers import PreTrainedTokenizerFast
-from tokenizers import Tokenizer
-# Load the base tokenizer
-base_tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
-# Wrap with PreTrainedTokenizerFast for transformers compatibility
-tokenizer = PreTrainedTokenizerFast(tokenizer_object=base_tokenizer)
-# Process binary data
-with open("/usr/bin/ls", "rb") as f:
-    raw_bytes = f.read()
-    text = raw_bytes.decode('latin-1')
-# Tokenize (returns dict with input_ids, attention_mask, etc.)
-result = tokenizer(text)
-tokens = result["input_ids"]
-```
 ---
-## Important: Data Format
-The tokenizer expects binary data encoded as **latin-1 strings**, NOT hex strings:
-```python
-# ✅ CORRECT - Use latin-1 encoded bytes
-raw_bytes = b'\x7fELF\x01\x01'
-text = raw_bytes.decode('latin-1')  # → '\x7fELF\x01\x01'
-encoded = tokenizer.encode(text)
-# ❌ WRONG - Do not use hex strings
-hex_str = "7f 45 4c 46 01 01"  # Will not work correctly
-```
-**Why latin-1?** Every byte value (0-255) maps to exactly one latin-1 character, ensuring lossless round-trip conversion between bytes and text.
 ---
-## Performance Benchmarks
-### Compression on Real-World Binaries
-Tested on `/usr/bin` binaries (not in training set):
-| Binary | Size | Tokens | bytes/token |
-|--------|------|--------|-------------|
-| bash | 1.38 MB | 602,719 | 2.399 |
-| python3.12 | 7.65 MB | 2,997,303 | 2.676 |
-| gcc-13 | 0.98 MB | 375,331 | 2.726 |
-| ls | 0.14 MB | 54,537 | 2.609 |
-| grep | 0.18 MB | 73,500 | 2.542 |
-**Average**: 2.590 bytes/token
-### Information-Theoretic Efficiency
-- Binary entropy: ~6.5 bits/byte (estimated from empirical binary distributions)
-- Theoretical optimal: 2.46 bytes/token (16 bits / 6.5 bits/byte)
-- Our performance: 2.590 bytes/token
-- **Efficiency: 95.0%** of theoretical optimum
----
-## Example: Tokenizing an ELF Header
-```python
-from tokenizers import Tokenizer
-# Load tokenizer
-tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
-# ELF header bytes
-elf_header = b'\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00'
-text = elf_header.decode('latin-1')
-# Tokenize
-encoded = tokenizer.encode(text)
-print(f"Original bytes: {elf_header.hex()}")
-print(f"Tokens: {encoded.ids}")
-print(f"Token count: {len(encoded.ids)}")
-print(f"Compression: {len(elf_header) / len(encoded.ids):.2f} bytes/token")
-# Examine individual tokens
-for token_id, token_str in zip(encoded.ids, encoded.tokens):
-    token_bytes = token_str.encode('latin-1')
-    print(f"  Token {token_id:5d}: {token_bytes.hex():16s} ({len(token_bytes)} bytes)")
-```
 ---
-## Token Distribution
-| Length | Count | Percentage | Examples |
-|--------|-------|------------|----------|
-| 1 byte | 256 | 0.4% | Base vocabulary (0x00-0xFF) |
-| 2 bytes | 28,562 | 43.6% | `0x48 0x8b` (REX.W prefix), `0xcc 0xcc` (int3 padding) |
-| 3 bytes | 10,801 | 16.5% | `0x48 0x8b 0xc0` (MOV rax, rax) |
-| 4 bytes | 14,376 | 21.9% | `0x48 0x89 0x45 0xf8` (MOV [rbp-8], rax) |
-| 5-8 bytes | 8,489 | 13.0% | Multi-byte instructions, short strings |
-| 9+ bytes | 3,045 | 4.6% | Multi-instruction sequences, string literals |
-**Average token length**: 3.749 bytes
 ---
-## Training Details
-### Dataset
-**Source**: `/nas4/data/glaurung-data/binaries-small/`
-- **Size**: 23.0 GB (24,737,776,360 bytes)
-- **Processed Chunks**: 40,574 total (37,083 unique + 8,454 duplicates reused)
-- **Content**: Real-world compiled binaries including system utilities, libraries, and applications
-**Platform Distribution**:
-- Linux: Alpine, Debian, Ubuntu (ELF format)
-- Windows: 8, 10, 11 (PE format)
-**Architecture Distribution**:
-- x86-64 (primary)
-- x86-32
-- ARM64
-### Training Parameters
-Trained using chunked BPE with deduplication and support-based merge combination:
-```bash
-cargo run --release --bin bbpe -- chunk-train \
-  --output glaurung-tokenizer-002.json \
-  /nas4/data/glaurung-data/binaries-small/ \
-  --vocab-size 65536 \
-  --min-frequency 4 \
-  --chunk-size 4194304 \
-  --combine-mode support \
-  --duplicate-mode count
-```
-**Key Training Features**:
-- **Chunk Size**: 4 MB (4,194,304 bytes) per chunk
-- **Deduplication**: 8,454 duplicate chunks reused to improve merge quality
-- **Combine Mode**: Support-based union (65,273 merges realized)
-- **Entropy Filtering**: Skipped 27 low-entropy + 3,383 high-entropy chunks
 ---
-## Use Cases
-### ✅ Recommended For
-- Binary neural language models
-- Malware analysis and classification
-- Reverse engineering tools
-- Binary similarity detection
-- Code pattern recognition
-- Vulnerability research
-- Firmware analysis
-### ❌ Not Recommended For
-- Text/source code (use text tokenizer like GPT-2, 100%+ penalty)
-- Very small binaries <1KB (overhead too high)
-- Real-time streaming (load time ~100ms)
 ---
-## Comparison: 32K vs 64K Vocabulary
-This tokenizer uses a 64K vocabulary compared to the 32K baseline:
-| Metric | 32K Baseline | 64K (this tokenizer) | Improvement |
-|--------|--------------|---------------------|-------------|
-| Vocabulary | 32,761 | 65,536 | 2.0x larger |
-| Training method | Standard BPE | Chunked + deduplication | Advanced |
-| bytes/token | 1.925 | 2.590 | +34.6% |
-| Tokens for /usr/bin/ls | 74,519 | 54,537 | -26.8% fewer |
-| String-rich tokens | ~6% | 11.81% | Better semantic capture |
-| Avg token length | ~2.2 bytes | 3.749 bytes | Longer patterns |
-**Key Advantages of 64K Vocabulary**:
-- **36.6% fewer tokens** needed to encode binaries (faster processing)
-- **Captures longer patterns**: Instructions + operands, string literals, common sequences
-- **Better semantic understanding**: More function names, library paths, and meaningful strings
-- **Higher compression ratio**: 2.590 bytes/token vs 1.925 bytes/token
-- **Chunked training**: Deduplication-aware training improves merge quality
 ---
-## Advanced Usage
-### Batch Processing Multiple Files
 ```python
 from tokenizers import Tokenizer
-from pathlib import Path
-import numpy as np
 tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
-def tokenize_binary_file(file_path):
-    """Tokenize a single binary file."""
-    raw_bytes = Path(file_path).read_bytes()
-    text = raw_bytes.decode('latin-1')
-    encoded = tokenizer.encode(text)
-    return {
-        'file': file_path,
-        'size_bytes': len(raw_bytes),
-        'token_count': len(encoded.ids),
-        'compression_ratio': len(raw_bytes) / len(encoded.ids),
-        'token_ids': encoded.ids
-    }
-# Process directory
-binary_dir = Path("/usr/bin")
-results = []
-for binary_path in binary_dir.glob("*"):
-    if binary_path.is_file():
-        try:
-            result = tokenize_binary_file(binary_path)
-            results.append(result)
-        except Exception as e:
-            print(f"Error processing {binary_path}: {e}")
-# Analyze compression statistics
-compression_ratios = [r['compression_ratio'] for r in results]
-print(f"Mean compression: {np.mean(compression_ratios):.3f} bytes/token")
-print(f"Std deviation: {np.std(compression_ratios):.3f}")
 ```
-### Using with PyTorch/TensorFlow Models
 ```python
 from tokenizers import Tokenizer
-import torch
-from pathlib import Path
 tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
-def prepare_binary_for_model(file_path, max_length=512):
-    """Prepare binary data for neural network input."""
-    raw_bytes = Path(file_path).read_bytes()
-    text = raw_bytes.decode('latin-1')
-    # Tokenize
-    encoded = tokenizer.encode(text)
-    token_ids = encoded.ids
-    # Truncate or pad to max_length
-    if len(token_ids) > max_length:
-        token_ids = token_ids[:max_length]
-    else:
-        # Pad with token ID 0 (or use a dedicated padding token)
-        token_ids = token_ids + [0] * (max_length - len(token_ids))
-    # Convert to tensor
-    return torch.tensor(token_ids, dtype=torch.long)
-# Use in model
-binary_tensor = prepare_binary_for_model("/usr/bin/ls", max_length=1024)
-print(f"Tensor shape: {binary_tensor.shape}")  # torch.Size([1024])
 ```
----
-## Troubleshooting
-### Issue: UnicodeDecodeError when processing binary
-**Solution**: Always use `latin-1` encoding, never `utf-8`:
-```python
-# ✅ Correct
-text = raw_bytes.decode('latin-1')
-# ❌ Wrong
-text = raw_bytes.decode('utf-8')  # Will fail on non-UTF-8 bytes
 ```
-### Issue: Decoded output doesn't match original
-**Cause**: BPE tokenizers add spaces between tokens during decoding.
-**Solution**: Use the raw token IDs and decode manually if exact byte recovery is needed:
-```python
-# Get tokens without spaces
-tokens_no_spaces = ''.join(encoded.tokens)
-original_bytes = tokens_no_spaces.encode('latin-1')
 ```
-### Issue: Poor compression on specific binary types
-**Cause**: The tokenizer may not be optimized for highly specialized formats (e.g., bytecode for Python .pyc, Java .class).
-**Solution**: Consider domain-specific tokenizers for specialized formats, or use this as a general-purpose baseline.
 ---
-## Related Projects
-- **Predecessor**: [mjbommar/binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005) - Earlier 64K binary tokenizer
-- **Framework**: [mjbommar/glaurung](https://github.com/mjbommar/glaurung) - Binary analysis framework
-- **Training Code**: [mjbommar/glaurung-models](https://github.com/mjbommar/glaurung-models) - Binary embedding models and tokenizers
----
-## Technical Architecture
-### Vocabulary Structure
-- **Base tokens**: 256 single-byte tokens (0x00 to 0xFF)
-- **Merged tokens**: 65,273 learned byte-pair combinations
-- **Special tokens**: 7 reserved tokens
-- **Total**: 65,536 tokens (exactly 2^16)
-### Special Tokens
-The tokenizer includes 7 special tokens for various modeling tasks:
-- `<|start|>` (ID: 65529) - Start of sequence marker
-- `<|end|>` (ID: 65530) - End of sequence marker
-- `<|pad|>` (ID: 65531) - Padding token
-- `<|unk|>` (ID: 65532) - Unknown token
-- `<|cls|>` (ID: 65533) - Classification token
-- `<|sep|>` (ID: 65534) - Separator token
-- `<|mask|>` (ID: 65535) - Mask token for masked language modeling
-These enable various training objectives (classification, masked LM, sequence-to-sequence) and help models distinguish between concatenated files.
-### Token Properties
-**Instruction-aware patterns** (x86-64 examples):
-- REX prefixes: `0x48`, `0x4c`, `0x4d`
 - Common opcodes: `0x8b` (MOV), `0x89` (MOV), `0xe8` (CALL)
 - ModR/M patterns: `0xc0`, `0x45`, `0x5d`
-**Common patterns**:
-- Padding: `0xcc 0xcc` (int3), `0x90 0x90` (nop)
-- Alignment: `0x00 0x00 0x00 0x00`
 - String terminators: `0x00` at word boundaries
----
-## Performance Characteristics
-### Load Time
-- **Tokenizer size**: 2.4 MB on disk
-- **Load time**: ~71 ms average (tested with Python tokenizers library)
-### Encoding Speed
-Tested with Python tokenizers library on actual binaries:
-| Binary | Size | Encoding Time | Throughput |
-|--------|------|---------------|------------|
-| ls | 0.14 MB | 34 ms | 4.0 MB/s |
-| grep | 0.18 MB | 51 ms | 3.5 MB/s |
-| bash | 1.38 MB | 596 ms | 2.3 MB/s |
-**Average throughput**: ~2-4 MB/second (Python implementation)
----
-## Limitations
-1. **Cross-domain penalty**: Using on text data causes 100-140% efficiency loss
-2. **Small file overhead**: Files <1KB have proportionally higher tokenization overhead
-3. **Deterministic decoding**: Spaces inserted between tokens during decode (BPE behavior)
-4. **Architecture bias**: Trained primarily on x86-64; may be less optimal for RISC-V, MIPS, etc.
 ---
 ## Citation
-If you use this tokenizer in research, please cite:
 ```
 Glaurung Binary Tokenizer 002
 64K Binary Tokenizer for Neural Language Models
@@ -481,25 +266,19 @@ Performance: 2.590 bytes/token (95% of theoretical optimum)
 HuggingFace: mjbommar/glaurung-binary-tokenizer-002
 ```
 ---
 ## License
-Apache License 2.0
-This tokenizer is part of the [Glaurung](https://github.com/mjbommar/glaurung) project. See the [glaurung-models repository](https://github.com/mjbommar/glaurung-models) for full license details.
----
-## Support & Issues
-- **GitHub Issues**: [mjbommar/glaurung-models/issues](https://github.com/mjbommar/glaurung-models/issues)
-- **Documentation**: Full training report available in the [glaurung-models repository](https://github.com/mjbommar/glaurung-models/tree/master/tokenizers/tokenizer-002)
-- **Email**: Contact maintainer via GitHub
 ---
-**Production Status**: ✅ Ready for deployment
-**Version**: 2.0.0
-**Release Date**: November 2025
-**Training Tool**: bbpe v0.3.2 (https://crates.io/crates/bbpe)

+# glaurung-binary-tokenizer-002
+A cross-platform BPE tokenizer for binary executables and machine code. Trained using advanced chunked training with deduplication on 23 GB of diverse binaries spanning Linux and Windows platforms.
+**🔗 Model**: [`mjbommar/glaurung-binary-tokenizer-002`](https://huggingface.co/mjbommar/glaurung-binary-tokenizer-002)
+**📊 Dataset**: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized)
+**📄 Paper**: *Binary BPE: Cross-Platform Tokenization for Binary Analysis* (arXiv preprint coming soon)
 ## Overview
+- **Vocabulary Size**: 65,536 tokens (2^16)
+- **Token Composition**: 256 base bytes + 65,273 learned merges + 7 special tokens
+- **Average Token Length**: 3.749 bytes
+- **3-byte Instructions**: 16.5% of vocabulary (10,800 tokens)
+- **Compression Ratio**: ~2.6 bytes/token on typical binaries
 ---
+## Training Configuration
+**Training Corpus**:
+- Source: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized)
+- Size: ~23 GB (24.7 billion bytes)
+- Processed Chunks: 40,574 total (37,083 unique + 8,454 duplicates reused)
+- Platforms: Linux (Alpine, Debian, Ubuntu - ELF), Windows (8, 10, 11 - PE)
+- Architectures: x86-64, x86-32, ARM64
+**Training Parameters**:
+- Vocabulary size: 65,536 (including 7 special tokens)
+- Min frequency: 4
+- Chunk size: 4,194,304 bytes (4 MB)
+- Training method: Chunked BPE with deduplication and support-based merge combination
+- Allowed lengths: DEFAULT (1-16 bytes)
+- Training duration: ~8-9 hours
 ---
+## Vocabulary Statistics
+**Composition**:
+- Base bytes (0-255): 256 tokens
+- Learned merges: 65,273 tokens
+- Special tokens: 7 tokens (`<|start|>`, `<|end|>`, `<|pad|>`, `<|unk|>`, `<|cls|>`, `<|sep|>`, `<|mask|>`)
+- **Total**: 65,536 tokens
+**Quality Metrics**:
+- All tokens reachable: ✓ Yes
+- Valid merges: 65,273 / 65,273
+- Power-of-2 size: ✓ Yes (2^16)
 ---
+## Token Length Distribution
+| Length | Count | Percentage | Description |
+|--------|-------|------------|-------------|
+| 1 byte | 256 | 0.4% | Base bytes |
+| 2 bytes | 28,561 | 43.6% | Byte pairs (most common patterns) |
+| 3 bytes | 10,800 | 16.5% | Complete x86-64 instructions |
+| 4 bytes | 14,376 | 21.9% | Instructions with operands |
+| 5 bytes | 2,780 | 4.2% | Complex patterns |
+| 6 bytes | 2,213 | 3.4% | Complex patterns |
+| 7 bytes | 1,167 | 1.8% | Complex patterns |
+| 8 bytes | 2,329 | 3.6% | Multi-byte sequences |
+| 9+ bytes | 3,045 | 4.6% | Long patterns |
+**Average Token Length**: 3.749 bytes
 ---
+## Byte Content Analysis
+**Content Categories**:
+- Contains NULL byte (0x00): 17,418 tokens (26.6%)
+- ASCII printable (0x20-0x7E): 9,478 tokens (14.5%)
+- All ASCII (<0x80): 20,816 tokens (31.8%)
+- High bytes (≥0x80): 44,711 tokens (68.2%)
+**Most Common Bytes in Tokens**:
+- `0x00` (NULL): 34,482 occurrences - Padding and alignment
+- `0xFF`: 6,545 occurrences - Sentinel values
+- `0x48` (REX.W): 3,419 occurrences - x86-64 REX prefix
+- `0x8B` (MOV): 2,486 occurrences - x86-64 MOV opcode
+- `0x40` (@): 4,538 occurrences - ASCII and instruction patterns
 ---
+## Sequence Coverage
+**N-byte Sequence Diversity**:
+| Length | Learned Tokens | Possible Sequences | Coverage |
+|--------|----------------|-------------------|----------|
+| 1-byte | 256 | 256 | 100.00% |
+| 2-byte | 28,561 | 65,536 | 43.58% |
+| 3-byte | 10,800 | 16,777,216 | 0.064% |
+| 4-byte | 14,376 | 4,294,967,296 | 0.00034% |
+**Notable Achievement**: 43.6% coverage of all possible 2-byte sequences - excellent for pattern recognition.
 ---
+## Files
+- `tokenizer-65536.json` - Trained tokenizer model (2.4 MB)
+- `analysis_results.json` - Detailed analysis statistics
+- `original_README.md` - Original README from HuggingFace
 ---
+## Usage
+**Load from HuggingFace Hub**:
 ```python
 from tokenizers import Tokenizer
+# Load directly from HuggingFace
 tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
 ```
+**Load from local file**:
+```bash
+# With bbpe CLI
+bbpe encode --tokenizer tokenizer-65536.json /path/to/binary
+bbpe info tokenizer-65536.json
+```
+**Complete Python Example**:
 ```python
 from tokenizers import Tokenizer
+# Load from HuggingFace or local file
 tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
+# OR: tokenizer = Tokenizer.from_file("tokenizer-65536.json")
+# Read binary file and decode as latin-1 (preserves all byte values 0-255)
+with open("/usr/bin/ls", "rb") as f:
+    data = f.read()
+    data_str = data.decode("latin-1")
+# Encode the binary data
+encoding = tokenizer.encode(data_str)
+print(f"File size: {len(data)} bytes")
+print(f"Total tokens: {len(encoding.ids)}")
+print(f"Compression: {len(data) / len(encoding.ids):.3f} bytes/token")
+# First 10 tokens
+for i, (token_id, token) in enumerate(zip(encoding.ids[:10], encoding.tokens[:10])):
+    token_bytes = token.encode("latin-1")
+    print(f"  Token {i}: ID={token_id:5d} hex={token_bytes.hex():20s} ({len(token_bytes)} bytes)")
+# Decode tokens back to bytes
+decoded_str = tokenizer.decode(encoding.ids)
+decoded_bytes = decoded_str.encode("latin-1")
+assert decoded_bytes == data  # Perfect reconstruction
 ```
+**Example output for `/usr/bin/ls` (142,312 bytes)**:
 ```
+File size: 142312 bytes
+Total tokens: 54537
+Compression: 2.609 bytes/token
+First 10 tokens:
+  Token 0: ID=  127 hex=7f                   (1 bytes)
+  Token 1: ID= 2382 hex=454c                 (2 bytes)
+  Token 2: ID= 5923 hex=4602                 (2 bytes)
+  Token 3: ID=  394 hex=0101                 (2 bytes)
+  Token 4: ID=  268 hex=000000000000         (6 bytes)
+  Token 5: ID=  259 hex=000000               (3 bytes)
+  Token 6: ID=  295 hex=0300                 (2 bytes)
+  Token 7: ID= 2124 hex=3e00                 (2 bytes)
+  Token 8: ID=  271 hex=01000000             (4 bytes)
+  Token 9: ID=59106 hex=306d                 (2 bytes)
+Decoded: 7f454c4602010100000000000000000003003e0001000000306d...
+(ELF header: 7f 45 4c 46 = ELF magic bytes)
 ```
 ---
+## Performance Characteristics
+**Compression on Real-World Binaries**:
+| Binary | Size | Tokens | bytes/token |
+|--------|------|--------|-------------|
+| bash | 1.38 MB | 602,719 | 2.399 |
+| python3.12 | 7.65 MB | 2,997,303 | 2.676 |
+| gcc-13 | 0.98 MB | 375,331 | 2.726 |
+| ls | 0.14 MB | 54,537 | 2.609 |
+| grep | 0.18 MB | 73,500 | 2.542 |
+**Average**: 2.590 bytes/token
+**Information-Theoretic Efficiency**:
+- Binary entropy: ~6.5 bits/byte
+- Theoretical optimal: 2.46 bytes/token
+- Actual performance: 2.590 bytes/token
+- **Efficiency: 95.0%** of theoretical optimum
+---
+## Key Features
+**Instruction-Aware Patterns**:
+- REX prefixes: `0x48`, `0x4c`, `0x4d` (x86-64 64-bit operands)
 - Common opcodes: `0x8b` (MOV), `0x89` (MOV), `0xe8` (CALL)
 - ModR/M patterns: `0xc0`, `0x45`, `0x5d`
+**Common Binary Patterns**:
+- Padding: `0xcc 0xcc` (INT3 debug breakpoints), `0x90 0x90` (NOP sleds)
+- Alignment: `0x00 0x00 0x00 0x00` (NULL padding)
 - String terminators: `0x00` at word boundaries
+**String-Rich Vocabulary**:
+- 11.81% of vocabulary contains function names, paths, and library references
+- Better semantic understanding than standard BPE
+---
+## Comparison with Other Tokenizers
+**vs. binary-tokenizer-001 Series** (this repository):
+| Metric | 4K | 8K | 16K | 64K (this) | Improvement |
+|--------|----|----|-----|------------|-------------|
+| Vocab size | 4,096 | 8,192 | 16,384 | 65,536 | 4-16x larger |
+| Avg token length | 3.000 | 3.312 | 3.498 | 3.749 | +25% vs 4K |
+| 3-byte tokens % | 20.6% | 21.7% | 20.5% | 16.5% | Different focus |
+| 2-byte coverage | 3.0% | 5.6% | 10.9% | 43.6% | 14x better |
+| Compression (ls) | 2.00 | 2.17 | 2.39 | 2.61 | +30% vs 4K |
+| Training method | Standard | Standard | Standard | Chunked+dedup | Advanced |
+**Key Advantages of 64K Vocabulary**:
+- **43.6% 2-byte coverage**: Captures nearly half of all possible byte pairs
+- **Chunked training**: Deduplication-aware training improves merge quality
+- **Better compression**: 2.609 bytes/token vs 2.0 bytes/token (4K)
+- **Longer patterns**: 3.749 byte average vs 3.0 bytes (4K)
+- **String-rich**: 11.81% vocabulary contains semantic strings
 ---
 ## Citation
+If you use this tokenizer in your research, please cite:
+```bibtex
+@article{bommarito2025binarybpe,
+  title={Binary BPE: Cross-Platform Tokenization for Binary Analysis},
+  author={Bommarito II, Michael J.},
+  journal={arXiv preprint},
+  year={2025},
+  note={Preprint coming soon}
+}
+```
+**Also cite the original Glaurung tokenizer**:
 ```
 Glaurung Binary Tokenizer 002
 64K Binary Tokenizer for Neural Language Models
 HuggingFace: mjbommar/glaurung-binary-tokenizer-002
 ```
+**Author**: Michael J. Bommarito II ([michael.bommarito@gmail.com](mailto:michael.bommarito@gmail.com))
 ---
 ## License
+**Apache License 2.0**
+This tokenizer is part of the [Glaurung](https://github.com/mjbommar/glaurung) project.
 ---
+**Generated**: November 13, 2025
+**Original Model**: `mjbommar/glaurung-binary-tokenizer-002`
+**Training Tool**: bbpe v0.3.2
+**Analysis Script**: `analyze_tokenizer.py`