mjbommar commited on
Commit
a6eed73
·
verified ·
1 Parent(s): 40f6ac5

Update README with improved formatting and verified statistics

Browse files
Files changed (1) hide show
  1. README.md +184 -405
README.md CHANGED
@@ -1,475 +1,260 @@
1
- ---
2
- language:
3
- - code
4
- license: apache-2.0
5
- tags:
6
- - binary-analysis
7
- - tokenizer
8
- - bpe
9
- - malware-analysis
10
- - reverse-engineering
11
- - security
12
- - x86-64
13
- - arm64
14
- - elf
15
- - pe
16
- library_name: tokenizers
17
- pipeline_tag: feature-extraction
18
- ---
19
-
20
- # Glaurung Binary Tokenizer 002
21
 
22
- **BPE tokenizer for binary executables**
23
 
24
- 🔗 **GitHub**: [mjbommar/glaurung](https://github.com/mjbommar/glaurung)
25
-
26
- ---
27
 
28
  ## Overview
29
 
30
- **Glaurung Binary Tokenizer 002** is a specialized Byte Pair Encoding (BPE) tokenizer optimized for compiled binary data across multiple architectures (x86-64, ARM64, Windows PE, Linux ELF). This tokenizer uses advanced chunked training with deduplication and support-based merge combination to achieve state-of-the-art compression on binary executables.
31
-
32
- ### Key Specifications
33
-
34
- - **Vocabulary Size**: 65,536 tokens (256 base + 65,273 merges + 7 special tokens)
35
- - **Compression**: 2.590 bytes/token average (tested on real-world binaries)
36
- - **Training Data**: 23.0 GB corpus, 40,574 processed chunks with deduplication
37
- - **Training Method**: Chunked training with 4MB chunks, support-based merge combination
38
- - **Architectures**: x86-64, x86-32, ARM64
39
- - **Platforms**: Linux (Alpine, Debian, Ubuntu), Windows (8, 10, 11)
40
- - **Encoding**: Latin-1 (each byte 0-255 maps to a single character)
41
-
42
- ### Performance Highlights
43
-
44
- - **36.6% better compression** than 32K baseline (uses 36.6% fewer tokens)
45
- - **95% of theoretical maximum** compression efficiency
46
- - **Instruction-aware**: Captures complete x86-64 instructions (REX + opcode + ModR/M)
47
- - **String-rich**: 11.81% of vocabulary contains function names, paths, library references
48
-
49
- ---
50
-
51
- ## Installation
52
-
53
- ```bash
54
- pip install tokenizers transformers
55
- ```
56
-
57
- ---
58
-
59
- ## Quick Start
60
-
61
- ### Method 1: Using the tokenizers library (Recommended)
62
-
63
- ```python
64
- from tokenizers import Tokenizer
65
- from pathlib import Path
66
-
67
- # Load tokenizer directly from Hugging Face Hub
68
- tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
69
-
70
- # Process binary data - MUST use latin-1 encoding
71
- binary_path = Path("/usr/bin/ls")
72
- raw_bytes = binary_path.read_bytes()
73
- text = raw_bytes.decode('latin-1') # Convert bytes to latin-1 string
74
-
75
- # Tokenize
76
- encoded = tokenizer.encode(text)
77
- tokens = encoded.ids
78
-
79
- print(f"File size: {len(raw_bytes):,} bytes")
80
- print(f"Tokens: {len(tokens):,}")
81
- print(f"Compression: {len(raw_bytes) / len(tokens):.3f} bytes/token")
82
-
83
- # Decode back to text (note: adds spaces between tokens due to BPE behavior)
84
- decoded = tokenizer.decode(tokens)
85
- ```
86
-
87
- **Expected Output** (for `/usr/bin/ls`):
88
- ```
89
- File size: 142,312 bytes
90
- Tokens: 54,537
91
- Compression: 2.609 bytes/token
92
- ```
93
-
94
- ### Method 2: Using transformers library
95
-
96
- ```python
97
- from transformers import PreTrainedTokenizerFast
98
- from tokenizers import Tokenizer
99
-
100
- # Load the base tokenizer
101
- base_tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
102
-
103
- # Wrap with PreTrainedTokenizerFast for transformers compatibility
104
- tokenizer = PreTrainedTokenizerFast(tokenizer_object=base_tokenizer)
105
-
106
- # Process binary data
107
- with open("/usr/bin/ls", "rb") as f:
108
- raw_bytes = f.read()
109
- text = raw_bytes.decode('latin-1')
110
-
111
- # Tokenize (returns dict with input_ids, attention_mask, etc.)
112
- result = tokenizer(text)
113
- tokens = result["input_ids"]
114
- ```
115
 
116
  ---
117
 
118
- ## Important: Data Format
119
-
120
- The tokenizer expects binary data encoded as **latin-1 strings**, NOT hex strings:
121
 
122
- ```python
123
- # ✅ CORRECT - Use latin-1 encoded bytes
124
- raw_bytes = b'\x7fELF\x01\x01'
125
- text = raw_bytes.decode('latin-1') # '\x7fELF\x01\x01'
126
- encoded = tokenizer.encode(text)
127
-
128
- # ❌ WRONG - Do not use hex strings
129
- hex_str = "7f 45 4c 46 01 01" # Will not work correctly
130
- ```
131
 
132
- **Why latin-1?** Every byte value (0-255) maps to exactly one latin-1 character, ensuring lossless round-trip conversion between bytes and text.
 
 
 
 
 
 
133
 
134
  ---
135
 
136
- ## Performance Benchmarks
137
 
138
- ### Compression on Real-World Binaries
 
 
 
 
139
 
140
- Tested on `/usr/bin` binaries (not in training set):
141
-
142
- | Binary | Size | Tokens | bytes/token |
143
- |--------|------|--------|-------------|
144
- | bash | 1.38 MB | 602,719 | 2.399 |
145
- | python3.12 | 7.65 MB | 2,997,303 | 2.676 |
146
- | gcc-13 | 0.98 MB | 375,331 | 2.726 |
147
- | ls | 0.14 MB | 54,537 | 2.609 |
148
- | grep | 0.18 MB | 73,500 | 2.542 |
149
-
150
- **Average**: 2.590 bytes/token
151
-
152
- ### Information-Theoretic Efficiency
153
-
154
- - Binary entropy: ~6.5 bits/byte (estimated from empirical binary distributions)
155
- - Theoretical optimal: 2.46 bytes/token (16 bits / 6.5 bits/byte)
156
- - Our performance: 2.590 bytes/token
157
- - **Efficiency: 95.0%** of theoretical optimum
158
-
159
- ---
160
-
161
- ## Example: Tokenizing an ELF Header
162
-
163
- ```python
164
- from tokenizers import Tokenizer
165
-
166
- # Load tokenizer
167
- tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
168
-
169
- # ELF header bytes
170
- elf_header = b'\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00'
171
- text = elf_header.decode('latin-1')
172
-
173
- # Tokenize
174
- encoded = tokenizer.encode(text)
175
- print(f"Original bytes: {elf_header.hex()}")
176
- print(f"Tokens: {encoded.ids}")
177
- print(f"Token count: {len(encoded.ids)}")
178
- print(f"Compression: {len(elf_header) / len(encoded.ids):.2f} bytes/token")
179
-
180
- # Examine individual tokens
181
- for token_id, token_str in zip(encoded.ids, encoded.tokens):
182
- token_bytes = token_str.encode('latin-1')
183
- print(f" Token {token_id:5d}: {token_bytes.hex():16s} ({len(token_bytes)} bytes)")
184
- ```
185
 
186
  ---
187
 
188
- ## Token Distribution
189
 
190
- | Length | Count | Percentage | Examples |
191
- |--------|-------|------------|----------|
192
- | 1 byte | 256 | 0.4% | Base vocabulary (0x00-0xFF) |
193
- | 2 bytes | 28,562 | 43.6% | `0x48 0x8b` (REX.W prefix), `0xcc 0xcc` (int3 padding) |
194
- | 3 bytes | 10,801 | 16.5% | `0x48 0x8b 0xc0` (MOV rax, rax) |
195
- | 4 bytes | 14,376 | 21.9% | `0x48 0x89 0x45 0xf8` (MOV [rbp-8], rax) |
196
- | 5-8 bytes | 8,489 | 13.0% | Multi-byte instructions, short strings |
197
- | 9+ bytes | 3,045 | 4.6% | Multi-instruction sequences, string literals |
 
 
 
198
 
199
- **Average token length**: 3.749 bytes
200
 
201
  ---
202
 
203
- ## Training Details
204
 
205
- ### Dataset
 
 
 
 
206
 
207
- **Source**: `/nas4/data/glaurung-data/binaries-small/`
208
-
209
- - **Size**: 23.0 GB (24,737,776,360 bytes)
210
- - **Processed Chunks**: 40,574 total (37,083 unique + 8,454 duplicates reused)
211
- - **Content**: Real-world compiled binaries including system utilities, libraries, and applications
212
-
213
- **Platform Distribution**:
214
- - Linux: Alpine, Debian, Ubuntu (ELF format)
215
- - Windows: 8, 10, 11 (PE format)
216
-
217
- **Architecture Distribution**:
218
- - x86-64 (primary)
219
- - x86-32
220
- - ARM64
221
-
222
- ### Training Parameters
223
-
224
- Trained using chunked BPE with deduplication and support-based merge combination:
225
-
226
- ```bash
227
- cargo run --release --bin bbpe -- chunk-train \
228
- --output glaurung-tokenizer-002.json \
229
- /nas4/data/glaurung-data/binaries-small/ \
230
- --vocab-size 65536 \
231
- --min-frequency 4 \
232
- --chunk-size 4194304 \
233
- --combine-mode support \
234
- --duplicate-mode count
235
- ```
236
-
237
- **Key Training Features**:
238
- - **Chunk Size**: 4 MB (4,194,304 bytes) per chunk
239
- - **Deduplication**: 8,454 duplicate chunks reused to improve merge quality
240
- - **Combine Mode**: Support-based union (65,273 merges realized)
241
- - **Entropy Filtering**: Skipped 27 low-entropy + 3,383 high-entropy chunks
242
 
243
  ---
244
 
245
- ## Use Cases
246
 
247
- ### Recommended For
 
 
 
 
 
 
248
 
249
- - Binary neural language models
250
- - Malware analysis and classification
251
- - Reverse engineering tools
252
- - Binary similarity detection
253
- - Code pattern recognition
254
- - Vulnerability research
255
- - Firmware analysis
256
-
257
- ### ❌ Not Recommended For
258
-
259
- - Text/source code (use text tokenizer like GPT-2, 100%+ penalty)
260
- - Very small binaries <1KB (overhead too high)
261
- - Real-time streaming (load time ~100ms)
262
 
263
  ---
264
 
265
- ## Comparison: 32K vs 64K Vocabulary
266
-
267
- This tokenizer uses a 64K vocabulary compared to the 32K baseline:
268
 
269
- | Metric | 32K Baseline | 64K (this tokenizer) | Improvement |
270
- |--------|--------------|---------------------|-------------|
271
- | Vocabulary | 32,761 | 65,536 | 2.0x larger |
272
- | Training method | Standard BPE | Chunked + deduplication | Advanced |
273
- | bytes/token | 1.925 | 2.590 | +34.6% |
274
- | Tokens for /usr/bin/ls | 74,519 | 54,537 | -26.8% fewer |
275
- | String-rich tokens | ~6% | 11.81% | Better semantic capture |
276
- | Avg token length | ~2.2 bytes | 3.749 bytes | Longer patterns |
277
-
278
- **Key Advantages of 64K Vocabulary**:
279
- - **36.6% fewer tokens** needed to encode binaries (faster processing)
280
- - **Captures longer patterns**: Instructions + operands, string literals, common sequences
281
- - **Better semantic understanding**: More function names, library paths, and meaningful strings
282
- - **Higher compression ratio**: 2.590 bytes/token vs 1.925 bytes/token
283
- - **Chunked training**: Deduplication-aware training improves merge quality
284
 
285
  ---
286
 
287
- ## Advanced Usage
288
-
289
- ### Batch Processing Multiple Files
290
 
 
291
  ```python
292
  from tokenizers import Tokenizer
293
- from pathlib import Path
294
- import numpy as np
295
 
 
296
  tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
297
-
298
- def tokenize_binary_file(file_path):
299
- """Tokenize a single binary file."""
300
- raw_bytes = Path(file_path).read_bytes()
301
- text = raw_bytes.decode('latin-1')
302
- encoded = tokenizer.encode(text)
303
- return {
304
- 'file': file_path,
305
- 'size_bytes': len(raw_bytes),
306
- 'token_count': len(encoded.ids),
307
- 'compression_ratio': len(raw_bytes) / len(encoded.ids),
308
- 'token_ids': encoded.ids
309
- }
310
-
311
- # Process directory
312
- binary_dir = Path("/usr/bin")
313
- results = []
314
- for binary_path in binary_dir.glob("*"):
315
- if binary_path.is_file():
316
- try:
317
- result = tokenize_binary_file(binary_path)
318
- results.append(result)
319
- except Exception as e:
320
- print(f"Error processing {binary_path}: {e}")
321
-
322
- # Analyze compression statistics
323
- compression_ratios = [r['compression_ratio'] for r in results]
324
- print(f"Mean compression: {np.mean(compression_ratios):.3f} bytes/token")
325
- print(f"Std deviation: {np.std(compression_ratios):.3f}")
326
  ```
327
 
328
- ### Using with PyTorch/TensorFlow Models
 
 
 
 
 
329
 
 
330
  ```python
331
  from tokenizers import Tokenizer
332
- import torch
333
- from pathlib import Path
334
 
 
335
  tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
 
336
 
337
- def prepare_binary_for_model(file_path, max_length=512):
338
- """Prepare binary data for neural network input."""
339
- raw_bytes = Path(file_path).read_bytes()
340
- text = raw_bytes.decode('latin-1')
341
-
342
- # Tokenize
343
- encoded = tokenizer.encode(text)
344
- token_ids = encoded.ids
345
-
346
- # Truncate or pad to max_length
347
- if len(token_ids) > max_length:
348
- token_ids = token_ids[:max_length]
349
- else:
350
- # Pad with token ID 0 (or use a dedicated padding token)
351
- token_ids = token_ids + [0] * (max_length - len(token_ids))
352
-
353
- # Convert to tensor
354
- return torch.tensor(token_ids, dtype=torch.long)
355
-
356
- # Use in model
357
- binary_tensor = prepare_binary_for_model("/usr/bin/ls", max_length=1024)
358
- print(f"Tensor shape: {binary_tensor.shape}") # torch.Size([1024])
359
  ```
360
 
361
- ---
362
-
363
- ## Troubleshooting
364
-
365
- ### Issue: UnicodeDecodeError when processing binary
366
-
367
- **Solution**: Always use `latin-1` encoding, never `utf-8`:
368
- ```python
369
- # ✅ Correct
370
- text = raw_bytes.decode('latin-1')
371
-
372
- # ❌ Wrong
373
- text = raw_bytes.decode('utf-8') # Will fail on non-UTF-8 bytes
374
  ```
 
 
 
375
 
376
- ### Issue: Decoded output doesn't match original
377
-
378
- **Cause**: BPE tokenizers add spaces between tokens during decoding.
379
-
380
- **Solution**: Use the raw token IDs and decode manually if exact byte recovery is needed:
381
- ```python
382
- # Get tokens without spaces
383
- tokens_no_spaces = ''.join(encoded.tokens)
384
- original_bytes = tokens_no_spaces.encode('latin-1')
 
 
 
 
 
385
  ```
386
 
387
- ### Issue: Poor compression on specific binary types
388
-
389
- **Cause**: The tokenizer may not be optimized for highly specialized formats (e.g., bytecode for Python .pyc, Java .class).
390
-
391
- **Solution**: Consider domain-specific tokenizers for specialized formats, or use this as a general-purpose baseline.
392
-
393
  ---
394
 
395
- ## Related Projects
396
-
397
- - **Predecessor**: [mjbommar/binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005) - Earlier 64K binary tokenizer
398
- - **Framework**: [mjbommar/glaurung](https://github.com/mjbommar/glaurung) - Binary analysis framework
399
- - **Training Code**: [mjbommar/glaurung-models](https://github.com/mjbommar/glaurung-models) - Binary embedding models and tokenizers
400
-
401
- ---
402
-
403
- ## Technical Architecture
404
 
405
- ### Vocabulary Structure
406
 
407
- - **Base tokens**: 256 single-byte tokens (0x00 to 0xFF)
408
- - **Merged tokens**: 65,273 learned byte-pair combinations
409
- - **Special tokens**: 7 reserved tokens
410
- - **Total**: 65,536 tokens (exactly 2^16)
 
 
 
411
 
412
- ### Special Tokens
413
 
414
- The tokenizer includes 7 special tokens for various modeling tasks:
415
- - `<|start|>` (ID: 65529) - Start of sequence marker
416
- - `<|end|>` (ID: 65530) - End of sequence marker
417
- - `<|pad|>` (ID: 65531) - Padding token
418
- - `<|unk|>` (ID: 65532) - Unknown token
419
- - `<|cls|>` (ID: 65533) - Classification token
420
- - `<|sep|>` (ID: 65534) - Separator token
421
- - `<|mask|>` (ID: 65535) - Mask token for masked language modeling
422
 
423
- These enable various training objectives (classification, masked LM, sequence-to-sequence) and help models distinguish between concatenated files.
424
 
425
- ### Token Properties
426
 
427
- **Instruction-aware patterns** (x86-64 examples):
428
- - REX prefixes: `0x48`, `0x4c`, `0x4d`
429
  - Common opcodes: `0x8b` (MOV), `0x89` (MOV), `0xe8` (CALL)
430
  - ModR/M patterns: `0xc0`, `0x45`, `0x5d`
431
 
432
- **Common patterns**:
433
- - Padding: `0xcc 0xcc` (int3), `0x90 0x90` (nop)
434
- - Alignment: `0x00 0x00 0x00 0x00`
435
  - String terminators: `0x00` at word boundaries
436
 
437
- ---
 
 
438
 
439
- ## Performance Characteristics
440
-
441
- ### Load Time
442
-
443
- - **Tokenizer size**: 2.4 MB on disk
444
- - **Load time**: ~71 ms average (tested with Python tokenizers library)
445
-
446
- ### Encoding Speed
447
-
448
- Tested with Python tokenizers library on actual binaries:
449
-
450
- | Binary | Size | Encoding Time | Throughput |
451
- |--------|------|---------------|------------|
452
- | ls | 0.14 MB | 34 ms | 4.0 MB/s |
453
- | grep | 0.18 MB | 51 ms | 3.5 MB/s |
454
- | bash | 1.38 MB | 596 ms | 2.3 MB/s |
455
 
456
- **Average throughput**: ~2-4 MB/second (Python implementation)
457
 
458
- ---
459
 
460
- ## Limitations
 
 
 
 
 
 
 
461
 
462
- 1. **Cross-domain penalty**: Using on text data causes 100-140% efficiency loss
463
- 2. **Small file overhead**: Files <1KB have proportionally higher tokenization overhead
464
- 3. **Deterministic decoding**: Spaces inserted between tokens during decode (BPE behavior)
465
- 4. **Architecture bias**: Trained primarily on x86-64; may be less optimal for RISC-V, MIPS, etc.
 
 
466
 
467
  ---
468
 
469
  ## Citation
470
 
471
- If you use this tokenizer in research, please cite:
 
 
 
 
 
 
 
 
 
 
472
 
 
473
  ```
474
  Glaurung Binary Tokenizer 002
475
  64K Binary Tokenizer for Neural Language Models
@@ -481,25 +266,19 @@ Performance: 2.590 bytes/token (95% of theoretical optimum)
481
  HuggingFace: mjbommar/glaurung-binary-tokenizer-002
482
  ```
483
 
 
 
484
  ---
485
 
486
  ## License
487
 
488
- Apache License 2.0
489
-
490
- This tokenizer is part of the [Glaurung](https://github.com/mjbommar/glaurung) project. See the [glaurung-models repository](https://github.com/mjbommar/glaurung-models) for full license details.
491
-
492
- ---
493
-
494
- ## Support & Issues
495
 
496
- - **GitHub Issues**: [mjbommar/glaurung-models/issues](https://github.com/mjbommar/glaurung-models/issues)
497
- - **Documentation**: Full training report available in the [glaurung-models repository](https://github.com/mjbommar/glaurung-models/tree/master/tokenizers/tokenizer-002)
498
- - **Email**: Contact maintainer via GitHub
499
 
500
  ---
501
 
502
- **Production Status**: Ready for deployment
503
- **Version**: 2.0.0
504
- **Release Date**: November 2025
505
- **Training Tool**: bbpe v0.3.2 (https://crates.io/crates/bbpe)
 
1
+ # glaurung-binary-tokenizer-002
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
+ A cross-platform BPE tokenizer for binary executables and machine code. Trained using advanced chunked training with deduplication on 23 GB of diverse binaries spanning Linux and Windows platforms.
4
 
5
+ **🔗 Model**: [`mjbommar/glaurung-binary-tokenizer-002`](https://huggingface.co/mjbommar/glaurung-binary-tokenizer-002)
6
+ **📊 Dataset**: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized)
7
+ **📄 Paper**: *Binary BPE: Cross-Platform Tokenization for Binary Analysis* (arXiv preprint coming soon)
8
 
9
  ## Overview
10
 
11
+ - **Vocabulary Size**: 65,536 tokens (2^16)
12
+ - **Token Composition**: 256 base bytes + 65,273 learned merges + 7 special tokens
13
+ - **Average Token Length**: 3.749 bytes
14
+ - **3-byte Instructions**: 16.5% of vocabulary (10,800 tokens)
15
+ - **Compression Ratio**: ~2.6 bytes/token on typical binaries
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ---
18
 
19
+ ## Training Configuration
 
 
20
 
21
+ **Training Corpus**:
22
+ - Source: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized)
23
+ - Size: ~23 GB (24.7 billion bytes)
24
+ - Processed Chunks: 40,574 total (37,083 unique + 8,454 duplicates reused)
25
+ - Platforms: Linux (Alpine, Debian, Ubuntu - ELF), Windows (8, 10, 11 - PE)
26
+ - Architectures: x86-64, x86-32, ARM64
 
 
 
27
 
28
+ **Training Parameters**:
29
+ - Vocabulary size: 65,536 (including 7 special tokens)
30
+ - Min frequency: 4
31
+ - Chunk size: 4,194,304 bytes (4 MB)
32
+ - Training method: Chunked BPE with deduplication and support-based merge combination
33
+ - Allowed lengths: DEFAULT (1-16 bytes)
34
+ - Training duration: ~8-9 hours
35
 
36
  ---
37
 
38
+ ## Vocabulary Statistics
39
 
40
+ **Composition**:
41
+ - Base bytes (0-255): 256 tokens
42
+ - Learned merges: 65,273 tokens
43
+ - Special tokens: 7 tokens (`<|start|>`, `<|end|>`, `<|pad|>`, `<|unk|>`, `<|cls|>`, `<|sep|>`, `<|mask|>`)
44
+ - **Total**: 65,536 tokens
45
 
46
+ **Quality Metrics**:
47
+ - All tokens reachable: ✓ Yes
48
+ - Valid merges: 65,273 / 65,273
49
+ - Power-of-2 size: ✓ Yes (2^16)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
  ---
52
 
53
+ ## Token Length Distribution
54
 
55
+ | Length | Count | Percentage | Description |
56
+ |--------|-------|------------|-------------|
57
+ | 1 byte | 256 | 0.4% | Base bytes |
58
+ | 2 bytes | 28,561 | 43.6% | Byte pairs (most common patterns) |
59
+ | 3 bytes | 10,800 | 16.5% | Complete x86-64 instructions |
60
+ | 4 bytes | 14,376 | 21.9% | Instructions with operands |
61
+ | 5 bytes | 2,780 | 4.2% | Complex patterns |
62
+ | 6 bytes | 2,213 | 3.4% | Complex patterns |
63
+ | 7 bytes | 1,167 | 1.8% | Complex patterns |
64
+ | 8 bytes | 2,329 | 3.6% | Multi-byte sequences |
65
+ | 9+ bytes | 3,045 | 4.6% | Long patterns |
66
 
67
+ **Average Token Length**: 3.749 bytes
68
 
69
  ---
70
 
71
+ ## Byte Content Analysis
72
 
73
+ **Content Categories**:
74
+ - Contains NULL byte (0x00): 17,418 tokens (26.6%)
75
+ - ASCII printable (0x20-0x7E): 9,478 tokens (14.5%)
76
+ - All ASCII (<0x80): 20,816 tokens (31.8%)
77
+ - High bytes (≥0x80): 44,711 tokens (68.2%)
78
 
79
+ **Most Common Bytes in Tokens**:
80
+ - `0x00` (NULL): 34,482 occurrences - Padding and alignment
81
+ - `0xFF`: 6,545 occurrences - Sentinel values
82
+ - `0x48` (REX.W): 3,419 occurrences - x86-64 REX prefix
83
+ - `0x8B` (MOV): 2,486 occurrences - x86-64 MOV opcode
84
+ - `0x40` (@): 4,538 occurrences - ASCII and instruction patterns
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
 
86
  ---
87
 
88
+ ## Sequence Coverage
89
 
90
+ **N-byte Sequence Diversity**:
91
+ | Length | Learned Tokens | Possible Sequences | Coverage |
92
+ |--------|----------------|-------------------|----------|
93
+ | 1-byte | 256 | 256 | 100.00% |
94
+ | 2-byte | 28,561 | 65,536 | 43.58% |
95
+ | 3-byte | 10,800 | 16,777,216 | 0.064% |
96
+ | 4-byte | 14,376 | 4,294,967,296 | 0.00034% |
97
 
98
+ **Notable Achievement**: 43.6% coverage of all possible 2-byte sequences - excellent for pattern recognition.
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
  ---
101
 
102
+ ## Files
 
 
103
 
104
+ - `tokenizer-65536.json` - Trained tokenizer model (2.4 MB)
105
+ - `analysis_results.json` - Detailed analysis statistics
106
+ - `original_README.md` - Original README from HuggingFace
 
 
 
 
 
 
 
 
 
 
 
 
107
 
108
  ---
109
 
110
+ ## Usage
 
 
111
 
112
+ **Load from HuggingFace Hub**:
113
  ```python
114
  from tokenizers import Tokenizer
 
 
115
 
116
+ # Load directly from HuggingFace
117
  tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
  ```
119
 
120
+ **Load from local file**:
121
+ ```bash
122
+ # With bbpe CLI
123
+ bbpe encode --tokenizer tokenizer-65536.json /path/to/binary
124
+ bbpe info tokenizer-65536.json
125
+ ```
126
 
127
+ **Complete Python Example**:
128
  ```python
129
  from tokenizers import Tokenizer
 
 
130
 
131
+ # Load from HuggingFace or local file
132
  tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
133
+ # OR: tokenizer = Tokenizer.from_file("tokenizer-65536.json")
134
 
135
+ # Read binary file and decode as latin-1 (preserves all byte values 0-255)
136
+ with open("/usr/bin/ls", "rb") as f:
137
+ data = f.read()
138
+ data_str = data.decode("latin-1")
139
+
140
+ # Encode the binary data
141
+ encoding = tokenizer.encode(data_str)
142
+ print(f"File size: {len(data)} bytes")
143
+ print(f"Total tokens: {len(encoding.ids)}")
144
+ print(f"Compression: {len(data) / len(encoding.ids):.3f} bytes/token")
145
+
146
+ # First 10 tokens
147
+ for i, (token_id, token) in enumerate(zip(encoding.ids[:10], encoding.tokens[:10])):
148
+ token_bytes = token.encode("latin-1")
149
+ print(f" Token {i}: ID={token_id:5d} hex={token_bytes.hex():20s} ({len(token_bytes)} bytes)")
150
+
151
+ # Decode tokens back to bytes
152
+ decoded_str = tokenizer.decode(encoding.ids)
153
+ decoded_bytes = decoded_str.encode("latin-1")
154
+ assert decoded_bytes == data # Perfect reconstruction
 
 
155
  ```
156
 
157
+ **Example output for `/usr/bin/ls` (142,312 bytes)**:
 
 
 
 
 
 
 
 
 
 
 
 
158
  ```
159
+ File size: 142312 bytes
160
+ Total tokens: 54537
161
+ Compression: 2.609 bytes/token
162
 
163
+ First 10 tokens:
164
+ Token 0: ID= 127 hex=7f (1 bytes)
165
+ Token 1: ID= 2382 hex=454c (2 bytes)
166
+ Token 2: ID= 5923 hex=4602 (2 bytes)
167
+ Token 3: ID= 394 hex=0101 (2 bytes)
168
+ Token 4: ID= 268 hex=000000000000 (6 bytes)
169
+ Token 5: ID= 259 hex=000000 (3 bytes)
170
+ Token 6: ID= 295 hex=0300 (2 bytes)
171
+ Token 7: ID= 2124 hex=3e00 (2 bytes)
172
+ Token 8: ID= 271 hex=01000000 (4 bytes)
173
+ Token 9: ID=59106 hex=306d (2 bytes)
174
+
175
+ Decoded: 7f454c4602010100000000000000000003003e0001000000306d...
176
+ (ELF header: 7f 45 4c 46 = ELF magic bytes)
177
  ```
178
 
 
 
 
 
 
 
179
  ---
180
 
181
+ ## Performance Characteristics
 
 
 
 
 
 
 
 
182
 
183
+ **Compression on Real-World Binaries**:
184
 
185
+ | Binary | Size | Tokens | bytes/token |
186
+ |--------|------|--------|-------------|
187
+ | bash | 1.38 MB | 602,719 | 2.399 |
188
+ | python3.12 | 7.65 MB | 2,997,303 | 2.676 |
189
+ | gcc-13 | 0.98 MB | 375,331 | 2.726 |
190
+ | ls | 0.14 MB | 54,537 | 2.609 |
191
+ | grep | 0.18 MB | 73,500 | 2.542 |
192
 
193
+ **Average**: 2.590 bytes/token
194
 
195
+ **Information-Theoretic Efficiency**:
196
+ - Binary entropy: ~6.5 bits/byte
197
+ - Theoretical optimal: 2.46 bytes/token
198
+ - Actual performance: 2.590 bytes/token
199
+ - **Efficiency: 95.0%** of theoretical optimum
 
 
 
200
 
201
+ ---
202
 
203
+ ## Key Features
204
 
205
+ **Instruction-Aware Patterns**:
206
+ - REX prefixes: `0x48`, `0x4c`, `0x4d` (x86-64 64-bit operands)
207
  - Common opcodes: `0x8b` (MOV), `0x89` (MOV), `0xe8` (CALL)
208
  - ModR/M patterns: `0xc0`, `0x45`, `0x5d`
209
 
210
+ **Common Binary Patterns**:
211
+ - Padding: `0xcc 0xcc` (INT3 debug breakpoints), `0x90 0x90` (NOP sleds)
212
+ - Alignment: `0x00 0x00 0x00 0x00` (NULL padding)
213
  - String terminators: `0x00` at word boundaries
214
 
215
+ **String-Rich Vocabulary**:
216
+ - 11.81% of vocabulary contains function names, paths, and library references
217
+ - Better semantic understanding than standard BPE
218
 
219
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
220
 
221
+ ## Comparison with Other Tokenizers
222
 
223
+ **vs. binary-tokenizer-001 Series** (this repository):
224
 
225
+ | Metric | 4K | 8K | 16K | 64K (this) | Improvement |
226
+ |--------|----|----|-----|------------|-------------|
227
+ | Vocab size | 4,096 | 8,192 | 16,384 | 65,536 | 4-16x larger |
228
+ | Avg token length | 3.000 | 3.312 | 3.498 | 3.749 | +25% vs 4K |
229
+ | 3-byte tokens % | 20.6% | 21.7% | 20.5% | 16.5% | Different focus |
230
+ | 2-byte coverage | 3.0% | 5.6% | 10.9% | 43.6% | 14x better |
231
+ | Compression (ls) | 2.00 | 2.17 | 2.39 | 2.61 | +30% vs 4K |
232
+ | Training method | Standard | Standard | Standard | Chunked+dedup | Advanced |
233
 
234
+ **Key Advantages of 64K Vocabulary**:
235
+ - **43.6% 2-byte coverage**: Captures nearly half of all possible byte pairs
236
+ - **Chunked training**: Deduplication-aware training improves merge quality
237
+ - **Better compression**: 2.609 bytes/token vs 2.0 bytes/token (4K)
238
+ - **Longer patterns**: 3.749 byte average vs 3.0 bytes (4K)
239
+ - **String-rich**: 11.81% vocabulary contains semantic strings
240
 
241
  ---
242
 
243
  ## Citation
244
 
245
+ If you use this tokenizer in your research, please cite:
246
+
247
+ ```bibtex
248
+ @article{bommarito2025binarybpe,
249
+ title={Binary BPE: Cross-Platform Tokenization for Binary Analysis},
250
+ author={Bommarito II, Michael J.},
251
+ journal={arXiv preprint},
252
+ year={2025},
253
+ note={Preprint coming soon}
254
+ }
255
+ ```
256
 
257
+ **Also cite the original Glaurung tokenizer**:
258
  ```
259
  Glaurung Binary Tokenizer 002
260
  64K Binary Tokenizer for Neural Language Models
 
266
  HuggingFace: mjbommar/glaurung-binary-tokenizer-002
267
  ```
268
 
269
+ **Author**: Michael J. Bommarito II ([michael.bommarito@gmail.com](mailto:michael.bommarito@gmail.com))
270
+
271
  ---
272
 
273
  ## License
274
 
275
+ **Apache License 2.0**
 
 
 
 
 
 
276
 
277
+ This tokenizer is part of the [Glaurung](https://github.com/mjbommar/glaurung) project.
 
 
278
 
279
  ---
280
 
281
+ **Generated**: November 13, 2025
282
+ **Original Model**: `mjbommar/glaurung-binary-tokenizer-002`
283
+ **Training Tool**: bbpe v0.3.2
284
+ **Analysis Script**: `analyze_tokenizer.py`