mjbommar commited on
Commit
503bcf9
·
verified ·
1 Parent(s): 630967b

Upload Glaurung Binary Tokenizer 002 artifacts

Browse files
Files changed (3) hide show
  1. README.md +505 -0
  2. README.md~ +485 -0
  3. tokenizer.json +0 -0
README.md ADDED
@@ -0,0 +1,505 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ --
2
+ language:
3
+ - code
4
+ license: apache-2.0
5
+ tags:
6
+ - binary-analysis
7
+ - tokenizer
8
+ - bpe
9
+ - malware-analysis
10
+ - reverse-engineering
11
+ - security
12
+ - x86-64
13
+ - arm64
14
+ - elf
15
+ - pe
16
+ library_name: tokenizers
17
+ pipeline_tag: feature-extraction
18
+ ---
19
+
20
+ # Glaurung Binary Tokenizer 002
21
+
22
+ **BPE tokenizer for binary executables**
23
+
24
+ 🔗 **GitHub**: [mjbommar/glaurung](https://github.com/mjbommar/glaurung)
25
+
26
+ ---
27
+
28
+ ## Overview
29
+
30
+ **Glaurung Binary Tokenizer 002** is a specialized Byte Pair Encoding (BPE) tokenizer optimized for compiled binary data across multiple architectures (x86-64, ARM64, Windows PE, Linux ELF). This tokenizer uses advanced chunked training with deduplication and support-based merge combination to achieve state-of-the-art compression on binary executables.
31
+
32
+ ### Key Specifications
33
+
34
+ - **Vocabulary Size**: 65,536 tokens (256 base + 65,273 merges + 7 special tokens)
35
+ - **Compression**: 2.590 bytes/token average (tested on real-world binaries)
36
+ - **Training Data**: 23.0 GB corpus, 40,574 processed chunks with deduplication
37
+ - **Training Method**: Chunked training with 4MB chunks, support-based merge combination
38
+ - **Architectures**: x86-64, x86-32, ARM64
39
+ - **Platforms**: Linux (Alpine, Debian, Ubuntu), Windows (8, 10, 11)
40
+ - **Encoding**: Latin-1 (each byte 0-255 maps to a single character)
41
+
42
+ ### Performance Highlights
43
+
44
+ - **36.6% better compression** than 32K baseline (uses 36.6% fewer tokens)
45
+ - **95% of theoretical maximum** compression efficiency
46
+ - **Instruction-aware**: Captures complete x86-64 instructions (REX + opcode + ModR/M)
47
+ - **String-rich**: 11.81% of vocabulary contains function names, paths, library references
48
+
49
+ ---
50
+
51
+ ## Installation
52
+
53
+ ```bash
54
+ pip install tokenizers transformers
55
+ ```
56
+
57
+ ---
58
+
59
+ ## Quick Start
60
+
61
+ ### Method 1: Using the tokenizers library (Recommended)
62
+
63
+ ```python
64
+ from tokenizers import Tokenizer
65
+ from pathlib import Path
66
+
67
+ # Load tokenizer directly from Hugging Face Hub
68
+ tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
69
+
70
+ # Process binary data - MUST use latin-1 encoding
71
+ binary_path = Path("/usr/bin/ls")
72
+ raw_bytes = binary_path.read_bytes()
73
+ text = raw_bytes.decode('latin-1') # Convert bytes to latin-1 string
74
+
75
+ # Tokenize
76
+ encoded = tokenizer.encode(text)
77
+ tokens = encoded.ids
78
+
79
+ print(f"File size: {len(raw_bytes):,} bytes")
80
+ print(f"Tokens: {len(tokens):,}")
81
+ print(f"Compression: {len(raw_bytes) / len(tokens):.3f} bytes/token")
82
+
83
+ # Decode back to text (note: adds spaces between tokens due to BPE behavior)
84
+ decoded = tokenizer.decode(tokens)
85
+ ```
86
+
87
+ **Expected Output** (for `/usr/bin/ls`):
88
+ ```
89
+ File size: 142,312 bytes
90
+ Tokens: 54,537
91
+ Compression: 2.609 bytes/token
92
+ ```
93
+
94
+ ### Method 2: Using transformers library
95
+
96
+ ```python
97
+ from transformers import PreTrainedTokenizerFast
98
+ from tokenizers import Tokenizer
99
+
100
+ # Load the base tokenizer
101
+ base_tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
102
+
103
+ # Wrap with PreTrainedTokenizerFast for transformers compatibility
104
+ tokenizer = PreTrainedTokenizerFast(tokenizer_object=base_tokenizer)
105
+
106
+ # Process binary data
107
+ with open("/usr/bin/ls", "rb") as f:
108
+ raw_bytes = f.read()
109
+ text = raw_bytes.decode('latin-1')
110
+
111
+ # Tokenize (returns dict with input_ids, attention_mask, etc.)
112
+ result = tokenizer(text)
113
+ tokens = result["input_ids"]
114
+ ```
115
+
116
+ ---
117
+
118
+ ## Important: Data Format
119
+
120
+ The tokenizer expects binary data encoded as **latin-1 strings**, NOT hex strings:
121
+
122
+ ```python
123
+ # ✅ CORRECT - Use latin-1 encoded bytes
124
+ raw_bytes = b'\x7fELF\x01\x01'
125
+ text = raw_bytes.decode('latin-1') # → '\x7fELF\x01\x01'
126
+ encoded = tokenizer.encode(text)
127
+
128
+ # ❌ WRONG - Do not use hex strings
129
+ hex_str = "7f 45 4c 46 01 01" # Will not work correctly
130
+ ```
131
+
132
+ **Why latin-1?** Every byte value (0-255) maps to exactly one latin-1 character, ensuring lossless round-trip conversion between bytes and text.
133
+
134
+ ---
135
+
136
+ ## Performance Benchmarks
137
+
138
+ ### Compression on Real-World Binaries
139
+
140
+ Tested on `/usr/bin` binaries (not in training set):
141
+
142
+ | Binary | Size | Tokens | bytes/token |
143
+ |--------|------|--------|-------------|
144
+ | bash | 1.38 MB | 602,719 | 2.399 |
145
+ | python3.12 | 7.65 MB | 2,997,303 | 2.676 |
146
+ | gcc-13 | 0.98 MB | 375,331 | 2.726 |
147
+ | ls | 0.14 MB | 54,537 | 2.609 |
148
+ | grep | 0.18 MB | 73,500 | 2.542 |
149
+
150
+ **Average**: 2.590 bytes/token
151
+
152
+ ### Information-Theoretic Efficiency
153
+
154
+ - Binary entropy: ~6.5 bits/byte (estimated from empirical binary distributions)
155
+ - Theoretical optimal: 2.46 bytes/token (16 bits / 6.5 bits/byte)
156
+ - Our performance: 2.590 bytes/token
157
+ - **Efficiency: 95.0%** of theoretical optimum
158
+
159
+ ---
160
+
161
+ ## Example: Tokenizing an ELF Header
162
+
163
+ ```python
164
+ from tokenizers import Tokenizer
165
+
166
+ # Load tokenizer
167
+ tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
168
+
169
+ # ELF header bytes
170
+ elf_header = b'\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00'
171
+ text = elf_header.decode('latin-1')
172
+
173
+ # Tokenize
174
+ encoded = tokenizer.encode(text)
175
+ print(f"Original bytes: {elf_header.hex()}")
176
+ print(f"Tokens: {encoded.ids}")
177
+ print(f"Token count: {len(encoded.ids)}")
178
+ print(f"Compression: {len(elf_header) / len(encoded.ids):.2f} bytes/token")
179
+
180
+ # Examine individual tokens
181
+ for token_id, token_str in zip(encoded.ids, encoded.tokens):
182
+ token_bytes = token_str.encode('latin-1')
183
+ print(f" Token {token_id:5d}: {token_bytes.hex():16s} ({len(token_bytes)} bytes)")
184
+ ```
185
+
186
+ ---
187
+
188
+ ## Token Distribution
189
+
190
+ | Length | Count | Percentage | Examples |
191
+ |--------|-------|------------|----------|
192
+ | 1 byte | 256 | 0.4% | Base vocabulary (0x00-0xFF) |
193
+ | 2 bytes | 28,562 | 43.6% | `0x48 0x8b` (REX.W prefix), `0xcc 0xcc` (int3 padding) |
194
+ | 3 bytes | 10,801 | 16.5% | `0x48 0x8b 0xc0` (MOV rax, rax) |
195
+ | 4 bytes | 14,376 | 21.9% | `0x48 0x89 0x45 0xf8` (MOV [rbp-8], rax) |
196
+ | 5-8 bytes | 8,489 | 13.0% | Multi-byte instructions, short strings |
197
+ | 9+ bytes | 3,045 | 4.6% | Multi-instruction sequences, string literals |
198
+
199
+ **Average token length**: 3.749 bytes
200
+
201
+ ---
202
+
203
+ ## Training Details
204
+
205
+ ### Dataset
206
+
207
+ **Source**: `/nas4/data/glaurung-data/binaries-small/`
208
+
209
+ - **Size**: 23.0 GB (24,737,776,360 bytes)
210
+ - **Processed Chunks**: 40,574 total (37,083 unique + 8,454 duplicates reused)
211
+ - **Content**: Real-world compiled binaries including system utilities, libraries, and applications
212
+
213
+ **Platform Distribution**:
214
+ - Linux: Alpine, Debian, Ubuntu (ELF format)
215
+ - Windows: 8, 10, 11 (PE format)
216
+
217
+ **Architecture Distribution**:
218
+ - x86-64 (primary)
219
+ - x86-32
220
+ - ARM64
221
+
222
+ ### Training Parameters
223
+
224
+ Trained using chunked BPE with deduplication and support-based merge combination:
225
+
226
+ ```bash
227
+ cargo run --release --bin bbpe -- chunk-train \
228
+ --output glaurung-tokenizer-002.json \
229
+ /nas4/data/glaurung-data/binaries-small/ \
230
+ --vocab-size 65536 \
231
+ --min-frequency 4 \
232
+ --chunk-size 4194304 \
233
+ --combine-mode support \
234
+ --duplicate-mode count
235
+ ```
236
+
237
+ **Key Training Features**:
238
+ - **Chunk Size**: 4 MB (4,194,304 bytes) per chunk
239
+ - **Deduplication**: 8,454 duplicate chunks reused to improve merge quality
240
+ - **Combine Mode**: Support-based union (65,273 merges realized)
241
+ - **Entropy Filtering**: Skipped 27 low-entropy + 3,383 high-entropy chunks
242
+
243
+ ---
244
+
245
+ ## Use Cases
246
+
247
+ ### ✅ Recommended For
248
+
249
+ - Binary neural language models
250
+ - Malware analysis and classification
251
+ - Reverse engineering tools
252
+ - Binary similarity detection
253
+ - Code pattern recognition
254
+ - Vulnerability research
255
+ - Firmware analysis
256
+
257
+ ### ❌ Not Recommended For
258
+
259
+ - Text/source code (use text tokenizer like GPT-2, 100%+ penalty)
260
+ - Very small binaries <1KB (overhead too high)
261
+ - Real-time streaming (load time ~100ms)
262
+
263
+ ---
264
+
265
+ ## Comparison: 32K vs 64K Vocabulary
266
+
267
+ This tokenizer uses a 64K vocabulary compared to the 32K baseline:
268
+
269
+ | Metric | 32K Baseline | 64K (this tokenizer) | Improvement |
270
+ |--------|--------------|---------------------|-------------|
271
+ | Vocabulary | 32,761 | 65,536 | 2.0x larger |
272
+ | Training method | Standard BPE | Chunked + deduplication | Advanced |
273
+ | bytes/token | 1.925 | 2.590 | +34.6% |
274
+ | Tokens for /usr/bin/ls | 74,519 | 54,537 | -26.8% fewer |
275
+ | String-rich tokens | ~6% | 11.81% | Better semantic capture |
276
+ | Avg token length | ~2.2 bytes | 3.749 bytes | Longer patterns |
277
+
278
+ **Key Advantages of 64K Vocabulary**:
279
+ - **36.6% fewer tokens** needed to encode binaries (faster processing)
280
+ - **Captures longer patterns**: Instructions + operands, string literals, common sequences
281
+ - **Better semantic understanding**: More function names, library paths, and meaningful strings
282
+ - **Higher compression ratio**: 2.590 bytes/token vs 1.925 bytes/token
283
+ - **Chunked training**: Deduplication-aware training improves merge quality
284
+
285
+ ---
286
+
287
+ ## Advanced Usage
288
+
289
+ ### Batch Processing Multiple Files
290
+
291
+ ```python
292
+ from tokenizers import Tokenizer
293
+ from pathlib import Path
294
+ import numpy as np
295
+
296
+ tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
297
+
298
+ def tokenize_binary_file(file_path):
299
+ """Tokenize a single binary file."""
300
+ raw_bytes = Path(file_path).read_bytes()
301
+ text = raw_bytes.decode('latin-1')
302
+ encoded = tokenizer.encode(text)
303
+ return {
304
+ 'file': file_path,
305
+ 'size_bytes': len(raw_bytes),
306
+ 'token_count': len(encoded.ids),
307
+ 'compression_ratio': len(raw_bytes) / len(encoded.ids),
308
+ 'token_ids': encoded.ids
309
+ }
310
+
311
+ # Process directory
312
+ binary_dir = Path("/usr/bin")
313
+ results = []
314
+ for binary_path in binary_dir.glob("*"):
315
+ if binary_path.is_file():
316
+ try:
317
+ result = tokenize_binary_file(binary_path)
318
+ results.append(result)
319
+ except Exception as e:
320
+ print(f"Error processing {binary_path}: {e}")
321
+
322
+ # Analyze compression statistics
323
+ compression_ratios = [r['compression_ratio'] for r in results]
324
+ print(f"Mean compression: {np.mean(compression_ratios):.3f} bytes/token")
325
+ print(f"Std deviation: {np.std(compression_ratios):.3f}")
326
+ ```
327
+
328
+ ### Using with PyTorch/TensorFlow Models
329
+
330
+ ```python
331
+ from tokenizers import Tokenizer
332
+ import torch
333
+ from pathlib import Path
334
+
335
+ tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-002")
336
+
337
+ def prepare_binary_for_model(file_path, max_length=512):
338
+ """Prepare binary data for neural network input."""
339
+ raw_bytes = Path(file_path).read_bytes()
340
+ text = raw_bytes.decode('latin-1')
341
+
342
+ # Tokenize
343
+ encoded = tokenizer.encode(text)
344
+ token_ids = encoded.ids
345
+
346
+ # Truncate or pad to max_length
347
+ if len(token_ids) > max_length:
348
+ token_ids = token_ids[:max_length]
349
+ else:
350
+ # Pad with token ID 0 (or use a dedicated padding token)
351
+ token_ids = token_ids + [0] * (max_length - len(token_ids))
352
+
353
+ # Convert to tensor
354
+ return torch.tensor(token_ids, dtype=torch.long)
355
+
356
+ # Use in model
357
+ binary_tensor = prepare_binary_for_model("/usr/bin/ls", max_length=1024)
358
+ print(f"Tensor shape: {binary_tensor.shape}") # torch.Size([1024])
359
+ ```
360
+
361
+ ---
362
+
363
+ ## Troubleshooting
364
+
365
+ ### Issue: UnicodeDecodeError when processing binary
366
+
367
+ **Solution**: Always use `latin-1` encoding, never `utf-8`:
368
+ ```python
369
+ # ✅ Correct
370
+ text = raw_bytes.decode('latin-1')
371
+
372
+ # ❌ Wrong
373
+ text = raw_bytes.decode('utf-8') # Will fail on non-UTF-8 bytes
374
+ ```
375
+
376
+ ### Issue: Decoded output doesn't match original
377
+
378
+ **Cause**: BPE tokenizers add spaces between tokens during decoding.
379
+
380
+ **Solution**: Use the raw token IDs and decode manually if exact byte recovery is needed:
381
+ ```python
382
+ # Get tokens without spaces
383
+ tokens_no_spaces = ''.join(encoded.tokens)
384
+ original_bytes = tokens_no_spaces.encode('latin-1')
385
+ ```
386
+
387
+ ### Issue: Poor compression on specific binary types
388
+
389
+ **Cause**: The tokenizer may not be optimized for highly specialized formats (e.g., bytecode for Python .pyc, Java .class).
390
+
391
+ **Solution**: Consider domain-specific tokenizers for specialized formats, or use this as a general-purpose baseline.
392
+
393
+ ---
394
+
395
+ ## Related Projects
396
+
397
+ - **Predecessor**: [mjbommar/binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005) - Earlier 64K binary tokenizer
398
+ - **Framework**: [mjbommar/glaurung](https://github.com/mjbommar/glaurung) - Binary analysis framework
399
+ - **Training Code**: [mjbommar/glaurung-models](https://github.com/mjbommar/glaurung-models) - Binary embedding models and tokenizers
400
+
401
+ ---
402
+
403
+ ## Technical Architecture
404
+
405
+ ### Vocabulary Structure
406
+
407
+ - **Base tokens**: 256 single-byte tokens (0x00 to 0xFF)
408
+ - **Merged tokens**: 65,273 learned byte-pair combinations
409
+ - **Special tokens**: 7 reserved tokens
410
+ - **Total**: 65,536 tokens (exactly 2^16)
411
+
412
+ ### Special Tokens
413
+
414
+ The tokenizer includes 7 special tokens for various modeling tasks:
415
+ - `<|start|>` (ID: 65529) - Start of sequence marker
416
+ - `<|end|>` (ID: 65530) - End of sequence marker
417
+ - `<|pad|>` (ID: 65531) - Padding token
418
+ - `<|unk|>` (ID: 65532) - Unknown token
419
+ - `<|cls|>` (ID: 65533) - Classification token
420
+ - `<|sep|>` (ID: 65534) - Separator token
421
+ - `<|mask|>` (ID: 65535) - Mask token for masked language modeling
422
+
423
+ These enable various training objectives (classification, masked LM, sequence-to-sequence) and help models distinguish between concatenated files.
424
+
425
+ ### Token Properties
426
+
427
+ **Instruction-aware patterns** (x86-64 examples):
428
+ - REX prefixes: `0x48`, `0x4c`, `0x4d`
429
+ - Common opcodes: `0x8b` (MOV), `0x89` (MOV), `0xe8` (CALL)
430
+ - ModR/M patterns: `0xc0`, `0x45`, `0x5d`
431
+
432
+ **Common patterns**:
433
+ - Padding: `0xcc 0xcc` (int3), `0x90 0x90` (nop)
434
+ - Alignment: `0x00 0x00 0x00 0x00`
435
+ - String terminators: `0x00` at word boundaries
436
+
437
+ ---
438
+
439
+ ## Performance Characteristics
440
+
441
+ ### Load Time
442
+
443
+ - **Tokenizer size**: 2.4 MB on disk
444
+ - **Load time**: ~71 ms average (tested with Python tokenizers library)
445
+
446
+ ### Encoding Speed
447
+
448
+ Tested with Python tokenizers library on actual binaries:
449
+
450
+ | Binary | Size | Encoding Time | Throughput |
451
+ |--------|------|---------------|------------|
452
+ | ls | 0.14 MB | 34 ms | 4.0 MB/s |
453
+ | grep | 0.18 MB | 51 ms | 3.5 MB/s |
454
+ | bash | 1.38 MB | 596 ms | 2.3 MB/s |
455
+
456
+ **Average throughput**: ~2-4 MB/second (Python implementation)
457
+
458
+ ---
459
+
460
+ ## Limitations
461
+
462
+ 1. **Cross-domain penalty**: Using on text data causes 100-140% efficiency loss
463
+ 2. **Small file overhead**: Files <1KB have proportionally higher tokenization overhead
464
+ 3. **Deterministic decoding**: Spaces inserted between tokens during decode (BPE behavior)
465
+ 4. **Architecture bias**: Trained primarily on x86-64; may be less optimal for RISC-V, MIPS, etc.
466
+
467
+ ---
468
+
469
+ ## Citation
470
+
471
+ If you use this tokenizer in research, please cite:
472
+
473
+ ```
474
+ Glaurung Binary Tokenizer 002
475
+ 64K Binary Tokenizer for Neural Language Models
476
+ Vocabulary: 65,536 tokens (256 base + 65,273 merges + 7 special)
477
+ Training: October-November 2025
478
+ Training Method: Chunked BPE with deduplication (bbpe v0.3.2)
479
+ Dataset: 23GB binaries-small (40,574 chunks, 8,454 duplicates)
480
+ Performance: 2.590 bytes/token (95% of theoretical optimum)
481
+ HuggingFace: mjbommar/glaurung-binary-tokenizer-002
482
+ ```
483
+
484
+ ---
485
+
486
+ ## License
487
+
488
+ Apache License 2.0
489
+
490
+ This tokenizer is part of the [Glaurung](https://github.com/mjbommar/glaurung) project. See the [glaurung-models repository](https://github.com/mjbommar/glaurung-models) for full license details.
491
+
492
+ ---
493
+
494
+ ## Support & Issues
495
+
496
+ - **GitHub Issues**: [mjbommar/glaurung-models/issues](https://github.com/mjbommar/glaurung-models/issues)
497
+ - **Documentation**: Full training report available in the [glaurung-models repository](https://github.com/mjbommar/glaurung-models/tree/master/tokenizers/tokenizer-002)
498
+ - **Email**: Contact maintainer via GitHub
499
+
500
+ ---
501
+
502
+ **Production Status**: ✅ Ready for deployment
503
+ **Version**: 2.0.0
504
+ **Release Date**: November 2025
505
+ **Training Tool**: bbpe v0.3.2 (https://crates.io/crates/bbpe)
README.md~ ADDED
@@ -0,0 +1,485 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - code
4
+ license: apache-2.0
5
+ tags:
6
+ - binary-analysis
7
+ - tokenizer
8
+ - bpe
9
+ - malware-analysis
10
+ - reverse-engineering
11
+ - security
12
+ - x86-64
13
+ - arm64
14
+ - elf
15
+ - pe
16
+ library_name: tokenizers
17
+ pipeline_tag: feature-extraction
18
+ ---
19
+
20
+ # Glaurung Binary Tokenizer 001
21
+
22
+ **Production-ready 64K vocabulary BPE tokenizer for binary executables**
23
+
24
+ 🔗 **GitHub**: [mjbommar/glaurung](https://github.com/mjbommar/glaurung)
25
+
26
+ ---
27
+
28
+ ## Overview
29
+
30
+ **Glaurung Binary Tokenizer 001** is a specialized Byte Pair Encoding (BPE) tokenizer optimized for compiled binary data across multiple architectures (x86-64, ARM64, Windows PE, Linux ELF). This is the production successor to [binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005).
31
+
32
+ ### Key Specifications
33
+
34
+ - **Vocabulary Size**: 65,536 tokens (exactly 2^16 = 64K)
35
+ - **Compression**: 2.849 bytes/token average
36
+ - **Training Data**: 13GB corpus, 30,738 binaries
37
+ - **Architectures**: x86-64, x86-32, ARM64
38
+ - **Platforms**: Linux (Alpine, Debian, Ubuntu), Windows (8, 10, 11)
39
+ - **Encoding**: Latin-1 (each byte 0-255 maps to a single character)
40
+
41
+ ### Performance Highlights
42
+
43
+ - **9-10% better compression** than 32K baseline
44
+ - **86% of theoretical maximum** compression efficiency
45
+ - **Instruction-aware**: Captures complete x86-64 instructions (REX + opcode + ModR/M)
46
+ - **String-rich**: 5.76% of vocabulary contains function names, paths, library references
47
+
48
+ ---
49
+
50
+ ## Installation
51
+
52
+ ```bash
53
+ pip install tokenizers transformers
54
+ ```
55
+
56
+ ---
57
+
58
+ ## Quick Start
59
+
60
+ ### Method 1: Using the tokenizers library (Recommended)
61
+
62
+ ```python
63
+ from tokenizers import Tokenizer
64
+ from pathlib import Path
65
+
66
+ # Load tokenizer directly from Hugging Face Hub
67
+ tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
68
+
69
+ # Process binary data - MUST use latin-1 encoding
70
+ binary_path = Path("/usr/bin/ls")
71
+ raw_bytes = binary_path.read_bytes()
72
+ text = raw_bytes.decode('latin-1') # Convert bytes to latin-1 string
73
+
74
+ # Tokenize
75
+ encoded = tokenizer.encode(text)
76
+ tokens = encoded.ids
77
+
78
+ print(f"File size: {len(raw_bytes):,} bytes")
79
+ print(f"Tokens: {len(tokens):,}")
80
+ print(f"Compression: {len(raw_bytes) / len(tokens):.3f} bytes/token")
81
+
82
+ # Decode back to text (note: adds spaces between tokens due to BPE behavior)
83
+ decoded = tokenizer.decode(tokens)
84
+ ```
85
+
86
+ **Expected Output** (for `/usr/bin/ls`):
87
+ ```
88
+ File size: 142,144 bytes
89
+ Tokens: 49,574
90
+ Compression: 2.866 bytes/token
91
+ ```
92
+
93
+ ### Method 2: Using transformers library
94
+
95
+ ```python
96
+ from transformers import PreTrainedTokenizerFast
97
+ from tokenizers import Tokenizer
98
+
99
+ # Load the base tokenizer
100
+ base_tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
101
+
102
+ # Wrap with PreTrainedTokenizerFast for transformers compatibility
103
+ tokenizer = PreTrainedTokenizerFast(tokenizer_object=base_tokenizer)
104
+
105
+ # Process binary data
106
+ with open("/usr/bin/ls", "rb") as f:
107
+ raw_bytes = f.read()
108
+ text = raw_bytes.decode('latin-1')
109
+
110
+ # Tokenize (returns dict with input_ids, attention_mask, etc.)
111
+ result = tokenizer(text)
112
+ tokens = result["input_ids"]
113
+ ```
114
+
115
+ ---
116
+
117
+ ## Important: Data Format
118
+
119
+ The tokenizer expects binary data encoded as **latin-1 strings**, NOT hex strings:
120
+
121
+ ```python
122
+ # ✅ CORRECT - Use latin-1 encoded bytes
123
+ raw_bytes = b'\x7fELF\x01\x01'
124
+ text = raw_bytes.decode('latin-1') # → '\x7fELF\x01\x01'
125
+ encoded = tokenizer.encode(text)
126
+
127
+ # ❌ WRONG - Do not use hex strings
128
+ hex_str = "7f 45 4c 46 01 01" # Will not work correctly
129
+ ```
130
+
131
+ **Why latin-1?** Every byte value (0-255) maps to exactly one latin-1 character, ensuring lossless round-trip conversion between bytes and text.
132
+
133
+ ---
134
+
135
+ ## Performance Benchmarks
136
+
137
+ ### Compression on Real-World Binaries
138
+
139
+ Tested on `/usr/bin` binaries (not in training set):
140
+
141
+ | Binary | Size | Tokens | bytes/token |
142
+ |--------|------|--------|-------------|
143
+ | bash | 1.38 MB | 535,541 | 2.698 |
144
+ | python3.12 | 7.65 MB | 2,801,226 | 2.863 |
145
+ | gcc-13 | 0.98 MB | 344,201 | 2.986 |
146
+ | ls | 0.14 MB | 49,574 | 2.866 |
147
+ | grep | 0.18 MB | 67,567 | 2.667 |
148
+
149
+ **Average**: 2.849 bytes/token
150
+
151
+ ### Information-Theoretic Efficiency
152
+
153
+ - Binary entropy: ~6.5 bits/byte
154
+ - Theoretical optimal: 2.46 bytes/token
155
+ - Our performance: 2.849 bytes/token
156
+ - **Efficiency: 86%** of theoretical optimum
157
+
158
+ ---
159
+
160
+ ## Example: Tokenizing an ELF Header
161
+
162
+ ```python
163
+ from tokenizers import Tokenizer
164
+
165
+ # Load tokenizer
166
+ tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
167
+
168
+ # ELF header bytes
169
+ elf_header = b'\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00'
170
+ text = elf_header.decode('latin-1')
171
+
172
+ # Tokenize
173
+ encoded = tokenizer.encode(text)
174
+ print(f"Original bytes: {elf_header.hex()}")
175
+ print(f"Tokens: {encoded.ids}")
176
+ print(f"Token count: {len(encoded.ids)}")
177
+ print(f"Compression: {len(elf_header) / len(encoded.ids):.2f} bytes/token")
178
+
179
+ # Examine individual tokens
180
+ for token_id, token_str in zip(encoded.ids, encoded.tokens):
181
+ token_bytes = token_str.encode('latin-1')
182
+ print(f" Token {token_id:5d}: {token_bytes.hex():16s} ({len(token_bytes)} bytes)")
183
+ ```
184
+
185
+ ---
186
+
187
+ ## Token Distribution
188
+
189
+ | Length | Count | Percentage | Examples |
190
+ |--------|-------|------------|----------|
191
+ | 2 bytes | 31,528 | 48.3% | `0x48 0x8b` (REX.W prefix), `0xcc 0xcc` (int3 padding) |
192
+ | 3 bytes | 9,261 | 14.2% | `0x48 0x8b 0xc0` (MOV rax, rax) |
193
+ | 4 bytes | 11,520 | 17.6% | `0x48 0x89 0x45 0xf8` (MOV [rbp-8], rax) |
194
+ | 5+ bytes | 13,164 | 20.2% | Multi-instruction sequences, string literals |
195
+
196
+ **Average token length**: 3.651 bytes
197
+
198
+ ---
199
+
200
+ ## Training Details
201
+
202
+ ### Dataset
203
+
204
+ **Source**: `/nas4/data/glaurung-data/binaries-small/`
205
+
206
+ - **Size**: 13 GB
207
+ - **Files**: 30,738 binaries
208
+ - **Content**: Real-world compiled binaries including system utilities, libraries, and applications
209
+
210
+ **Platform Distribution**:
211
+ - Linux: Alpine, Debian, Ubuntu (ELF format)
212
+ - Windows: 8, 10, 11 (PE format)
213
+
214
+ **Architecture Distribution**:
215
+ - x86-64 (primary)
216
+ - x86-32
217
+ - ARM64
218
+
219
+ ### Training Parameters
220
+
221
+ ```bash
222
+ cargo run --release --bin train -- \
223
+ --output glaurung-tokenizer-002.json \
224
+ /nas4/data/glaurung-data/binaries-small/ \
225
+ --vocab-size 65536 \
226
+ --min-frequency 4 \
227
+ --chunk-size 8192
228
+ ```
229
+
230
+ **Training Duration**: 8.46 hours on 24 cores
231
+ **Peak Memory**: 70 GB
232
+
233
+ ---
234
+
235
+ ## Use Cases
236
+
237
+ ### ✅ Recommended For
238
+
239
+ - Binary neural language models
240
+ - Malware analysis and classification
241
+ - Reverse engineering tools
242
+ - Binary similarity detection
243
+ - Code pattern recognition
244
+ - Vulnerability research
245
+ - Firmware analysis
246
+
247
+ ### ❌ Not Recommended For
248
+
249
+ - Text/source code (use text tokenizer like GPT-2, 100%+ penalty)
250
+ - Very small binaries <1KB (overhead too high)
251
+ - Real-time streaming (load time ~100ms)
252
+
253
+ ---
254
+
255
+ ## Comparison with Predecessor
256
+
257
+ | Metric | binary-tokenizer-005 | glaurung-binary-tokenizer-001 | Improvement |
258
+ |--------|---------------------|------------------------------|-------------|
259
+ | Vocabulary | 65,536 | 65,536 | Same |
260
+ | Training data | ~5GB mixed | 13GB binaries-small | 2.6x larger |
261
+ | bytes/token | ~2.6 | 2.849 | +9.6% |
262
+ | Platforms | Mixed | Multi-OS (Linux, Windows) | More diverse |
263
+ | Architecture awareness | Basic | Advanced (instruction-aware) | Significant |
264
+ | Documentation | Basic | Comprehensive | Extensive |
265
+
266
+ **Key Improvements**:
267
+ - Larger, more diverse training corpus
268
+ - Better cross-platform coverage
269
+ - Instruction-boundary awareness
270
+ - Production-ready quality
271
+
272
+ ---
273
+
274
+ ## Advanced Usage
275
+
276
+ ### Batch Processing Multiple Files
277
+
278
+ ```python
279
+ from tokenizers import Tokenizer
280
+ from pathlib import Path
281
+ import numpy as np
282
+
283
+ tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
284
+
285
+ def tokenize_binary_file(file_path):
286
+ """Tokenize a single binary file."""
287
+ raw_bytes = Path(file_path).read_bytes()
288
+ text = raw_bytes.decode('latin-1')
289
+ encoded = tokenizer.encode(text)
290
+ return {
291
+ 'file': file_path,
292
+ 'size_bytes': len(raw_bytes),
293
+ 'token_count': len(encoded.ids),
294
+ 'compression_ratio': len(raw_bytes) / len(encoded.ids),
295
+ 'token_ids': encoded.ids
296
+ }
297
+
298
+ # Process directory
299
+ binary_dir = Path("/usr/bin")
300
+ results = []
301
+ for binary_path in binary_dir.glob("*"):
302
+ if binary_path.is_file():
303
+ try:
304
+ result = tokenize_binary_file(binary_path)
305
+ results.append(result)
306
+ except Exception as e:
307
+ print(f"Error processing {binary_path}: {e}")
308
+
309
+ # Analyze compression statistics
310
+ compression_ratios = [r['compression_ratio'] for r in results]
311
+ print(f"Mean compression: {np.mean(compression_ratios):.3f} bytes/token")
312
+ print(f"Std deviation: {np.std(compression_ratios):.3f}")
313
+ ```
314
+
315
+ ### Using with PyTorch/TensorFlow Models
316
+
317
+ ```python
318
+ from tokenizers import Tokenizer
319
+ import torch
320
+ from pathlib import Path
321
+
322
+ tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
323
+
324
+ def prepare_binary_for_model(file_path, max_length=512):
325
+ """Prepare binary data for neural network input."""
326
+ raw_bytes = Path(file_path).read_bytes()
327
+ text = raw_bytes.decode('latin-1')
328
+
329
+ # Tokenize
330
+ encoded = tokenizer.encode(text)
331
+ token_ids = encoded.ids
332
+
333
+ # Truncate or pad to max_length
334
+ if len(token_ids) > max_length:
335
+ token_ids = token_ids[:max_length]
336
+ else:
337
+ # Pad with token ID 0 (or use a dedicated padding token)
338
+ token_ids = token_ids + [0] * (max_length - len(token_ids))
339
+
340
+ # Convert to tensor
341
+ return torch.tensor(token_ids, dtype=torch.long)
342
+
343
+ # Use in model
344
+ binary_tensor = prepare_binary_for_model("/usr/bin/ls", max_length=1024)
345
+ print(f"Tensor shape: {binary_tensor.shape}") # torch.Size([1024])
346
+ ```
347
+
348
+ ---
349
+
350
+ ## Troubleshooting
351
+
352
+ ### Issue: UnicodeDecodeError when processing binary
353
+
354
+ **Solution**: Always use `latin-1` encoding, never `utf-8`:
355
+ ```python
356
+ # ✅ Correct
357
+ text = raw_bytes.decode('latin-1')
358
+
359
+ # ❌ Wrong
360
+ text = raw_bytes.decode('utf-8') # Will fail on non-UTF-8 bytes
361
+ ```
362
+
363
+ ### Issue: Decoded output doesn't match original
364
+
365
+ **Cause**: BPE tokenizers add spaces between tokens during decoding.
366
+
367
+ **Solution**: Use the raw token IDs and decode manually if exact byte recovery is needed:
368
+ ```python
369
+ # Get tokens without spaces
370
+ tokens_no_spaces = ''.join(encoded.tokens)
371
+ original_bytes = tokens_no_spaces.encode('latin-1')
372
+ ```
373
+
374
+ ### Issue: Poor compression on specific binary types
375
+
376
+ **Cause**: The tokenizer may not be optimized for highly specialized formats (e.g., bytecode for Python .pyc, Java .class).
377
+
378
+ **Solution**: Consider domain-specific tokenizers for specialized formats, or use this as a general-purpose baseline.
379
+
380
+ ---
381
+
382
+ ## Related Projects
383
+
384
+ - **Predecessor**: [mjbommar/binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005) - Earlier 64K binary tokenizer
385
+ - **Framework**: [mjbommar/glaurung](https://github.com/mjbommar/glaurung) - Binary analysis framework
386
+ - **Training Code**: [mjbommar/glaurung-models](https://github.com/mjbommar/glaurung-models) - Binary embedding models and tokenizers
387
+
388
+ ---
389
+
390
+ ## Technical Architecture
391
+
392
+ ### Vocabulary Structure
393
+
394
+ - **Base tokens**: 256 single-byte tokens (0x00 to 0xFF)
395
+ - **Merged tokens**: 65,280 learned byte-pair combinations
396
+ - **Total**: 65,536 tokens (exactly 2^16)
397
+
398
+ ### Special Tokens
399
+
400
+ The tokenizer includes boundary markers for file-level segmentation:
401
+ - `<|start|>` (ID: 0)
402
+ - `<|end|>` (ID: 1)
403
+
404
+ These help models distinguish between concatenated files and identify file headers.
405
+
406
+ ### Token Properties
407
+
408
+ **Instruction-aware patterns** (x86-64 examples):
409
+ - REX prefixes: `0x48`, `0x4c`, `0x4d`
410
+ - Common opcodes: `0x8b` (MOV), `0x89` (MOV), `0xe8` (CALL)
411
+ - ModR/M patterns: `0xc0`, `0x45`, `0x5d`
412
+
413
+ **Common patterns**:
414
+ - Padding: `0xcc 0xcc` (int3), `0x90 0x90` (nop)
415
+ - Alignment: `0x00 0x00 0x00 0x00`
416
+ - String terminators: `0x00` at word boundaries
417
+
418
+ ---
419
+
420
+ ## Performance Characteristics
421
+
422
+ ### Load Time
423
+
424
+ - **Tokenizer size**: 2.3 MB on disk
425
+ - **Load time**: ~100ms (cold), ~20ms (cached)
426
+ - **Memory footprint**: ~15 MB in RAM
427
+
428
+ ### Encoding Speed
429
+
430
+ On a modern CPU (tested on Intel i9-12900K):
431
+
432
+ | Operation | Speed |
433
+ |-----------|-------|
434
+ | Encode 1 MB binary | ~50 ms |
435
+ | Encode 10 MB binary | ~450 ms |
436
+ | Encode 100 MB binary | ~4.2 s |
437
+
438
+ **Throughput**: ~20-25 MB/second
439
+
440
+ ---
441
+
442
+ ## Limitations
443
+
444
+ 1. **Cross-domain penalty**: Using on text data causes 100-140% efficiency loss
445
+ 2. **Small file overhead**: Files <1KB have proportionally higher tokenization overhead
446
+ 3. **Deterministic decoding**: Spaces inserted between tokens during decode (BPE behavior)
447
+ 4. **Architecture bias**: Trained primarily on x86-64; may be less optimal for RISC-V, MIPS, etc.
448
+
449
+ ---
450
+
451
+ ## Citation
452
+
453
+ If you use this tokenizer in research, please cite:
454
+
455
+ ```
456
+ Glaurung Binary Tokenizer 001
457
+ 64K Binary Tokenizer for Neural Language Models
458
+ Vocabulary: 65,536 tokens (exactly 2^16)
459
+ Training: October 2025
460
+ Dataset: 13GB binaries-small (30,738 files)
461
+ Performance: 2.849 bytes/token (86% of theoretical optimum)
462
+ HuggingFace: mjbommar/glaurung-binary-tokenizer-001
463
+ ```
464
+
465
+ ---
466
+
467
+ ## License
468
+
469
+ Apache License 2.0
470
+
471
+ This tokenizer is part of the [Glaurung](https://github.com/mjbommar/glaurung) project. See the [glaurung-models repository](https://github.com/mjbommar/glaurung-models) for full license details.
472
+
473
+ ---
474
+
475
+ ## Support & Issues
476
+
477
+ - **GitHub Issues**: [mjbommar/glaurung-models/issues](https://github.com/mjbommar/glaurung-models/issues)
478
+ - **Documentation**: Full training report available in the [glaurung-models repository](https://github.com/mjbommar/glaurung-models/tree/master/tokenizers/tokenizer-002)
479
+ - **Email**: Contact maintainer via GitHub
480
+
481
+ ---
482
+
483
+ **Production Status**: ✅ Ready for deployment
484
+ **Version**: 1.0.0
485
+ **Release Date**: October 2025
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff