mjbommar commited on
Commit
c3ffebd
·
verified ·
1 Parent(s): 9dda4b4

Add README

Browse files
Files changed (1) hide show
  1. README.md +462 -0
README.md ADDED
@@ -0,0 +1,462 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Glaurung Binary Tokenizer 001
2
+
3
+ **Production-ready 64K vocabulary BPE tokenizer for binary executables**
4
+
5
+ ---
6
+
7
+ ## Overview
8
+
9
+ **Glaurung Binary Tokenizer 001** is a specialized Byte Pair Encoding (BPE) tokenizer optimized for compiled binary data across multiple architectures (x86-64, ARM64, Windows PE, Linux ELF). This is the production successor to [binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005).
10
+
11
+ ### Key Specifications
12
+
13
+ - **Vocabulary Size**: 65,536 tokens (exactly 2^16 = 64K)
14
+ - **Compression**: 2.849 bytes/token average
15
+ - **Training Data**: 13GB corpus, 30,738 binaries
16
+ - **Architectures**: x86-64, x86-32, ARM64
17
+ - **Platforms**: Linux (Alpine, Debian, Ubuntu), Windows (8, 10, 11)
18
+ - **Encoding**: Latin-1 (each byte 0-255 maps to a single character)
19
+
20
+ ### Performance Highlights
21
+
22
+ - **9-10% better compression** than 32K baseline
23
+ - **86% of theoretical maximum** compression efficiency
24
+ - **Instruction-aware**: Captures complete x86-64 instructions (REX + opcode + ModR/M)
25
+ - **String-rich**: 5.76% of vocabulary contains function names, paths, library references
26
+
27
+ ---
28
+
29
+ ## Installation
30
+
31
+ ```bash
32
+ pip install tokenizers transformers
33
+ ```
34
+
35
+ ---
36
+
37
+ ## Quick Start
38
+
39
+ ### Method 1: Using the tokenizers library (Recommended)
40
+
41
+ ```python
42
+ from tokenizers import Tokenizer
43
+ from pathlib import Path
44
+
45
+ # Load tokenizer directly from Hugging Face Hub
46
+ tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
47
+
48
+ # Process binary data - MUST use latin-1 encoding
49
+ binary_path = Path("/usr/bin/ls")
50
+ raw_bytes = binary_path.read_bytes()
51
+ text = raw_bytes.decode('latin-1') # Convert bytes to latin-1 string
52
+
53
+ # Tokenize
54
+ encoded = tokenizer.encode(text)
55
+ tokens = encoded.ids
56
+
57
+ print(f"File size: {len(raw_bytes):,} bytes")
58
+ print(f"Tokens: {len(tokens):,}")
59
+ print(f"Compression: {len(raw_bytes) / len(tokens):.3f} bytes/token")
60
+
61
+ # Decode back to text (note: adds spaces between tokens due to BPE behavior)
62
+ decoded = tokenizer.decode(tokens)
63
+ ```
64
+
65
+ **Expected Output** (for `/usr/bin/ls`):
66
+ ```
67
+ File size: 142,144 bytes
68
+ Tokens: 49,574
69
+ Compression: 2.866 bytes/token
70
+ ```
71
+
72
+ ### Method 2: Using transformers library
73
+
74
+ ```python
75
+ from transformers import PreTrainedTokenizerFast
76
+ from tokenizers import Tokenizer
77
+
78
+ # Load the base tokenizer
79
+ base_tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
80
+
81
+ # Wrap with PreTrainedTokenizerFast for transformers compatibility
82
+ tokenizer = PreTrainedTokenizerFast(tokenizer_object=base_tokenizer)
83
+
84
+ # Process binary data
85
+ with open("/usr/bin/ls", "rb") as f:
86
+ raw_bytes = f.read()
87
+ text = raw_bytes.decode('latin-1')
88
+
89
+ # Tokenize (returns dict with input_ids, attention_mask, etc.)
90
+ result = tokenizer(text)
91
+ tokens = result["input_ids"]
92
+ ```
93
+
94
+ ---
95
+
96
+ ## Important: Data Format
97
+
98
+ The tokenizer expects binary data encoded as **latin-1 strings**, NOT hex strings:
99
+
100
+ ```python
101
+ # ✅ CORRECT - Use latin-1 encoded bytes
102
+ raw_bytes = b'\x7fELF\x01\x01'
103
+ text = raw_bytes.decode('latin-1') # → '\x7fELF\x01\x01'
104
+ encoded = tokenizer.encode(text)
105
+
106
+ # ❌ WRONG - Do not use hex strings
107
+ hex_str = "7f 45 4c 46 01 01" # Will not work correctly
108
+ ```
109
+
110
+ **Why latin-1?** Every byte value (0-255) maps to exactly one latin-1 character, ensuring lossless round-trip conversion between bytes and text.
111
+
112
+ ---
113
+
114
+ ## Performance Benchmarks
115
+
116
+ ### Compression on Real-World Binaries
117
+
118
+ Tested on `/usr/bin` binaries (not in training set):
119
+
120
+ | Binary | Size | Tokens | bytes/token |
121
+ |--------|------|--------|-------------|
122
+ | bash | 1.38 MB | 535,541 | 2.698 |
123
+ | python3.12 | 7.65 MB | 2,801,226 | 2.863 |
124
+ | gcc-13 | 0.98 MB | 344,201 | 2.986 |
125
+ | ls | 0.14 MB | 49,574 | 2.866 |
126
+ | grep | 0.18 MB | 67,567 | 2.667 |
127
+
128
+ **Average**: 2.849 bytes/token
129
+
130
+ ### Information-Theoretic Efficiency
131
+
132
+ - Binary entropy: ~6.5 bits/byte
133
+ - Theoretical optimal: 2.46 bytes/token
134
+ - Our performance: 2.849 bytes/token
135
+ - **Efficiency: 86%** of theoretical optimum
136
+
137
+ ---
138
+
139
+ ## Example: Tokenizing an ELF Header
140
+
141
+ ```python
142
+ from tokenizers import Tokenizer
143
+
144
+ # Load tokenizer
145
+ tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
146
+
147
+ # ELF header bytes
148
+ elf_header = b'\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00'
149
+ text = elf_header.decode('latin-1')
150
+
151
+ # Tokenize
152
+ encoded = tokenizer.encode(text)
153
+ print(f"Original bytes: {elf_header.hex()}")
154
+ print(f"Tokens: {encoded.ids}")
155
+ print(f"Token count: {len(encoded.ids)}")
156
+ print(f"Compression: {len(elf_header) / len(encoded.ids):.2f} bytes/token")
157
+
158
+ # Examine individual tokens
159
+ for token_id, token_str in zip(encoded.ids, encoded.tokens):
160
+ token_bytes = token_str.encode('latin-1')
161
+ print(f" Token {token_id:5d}: {token_bytes.hex():16s} ({len(token_bytes)} bytes)")
162
+ ```
163
+
164
+ ---
165
+
166
+ ## Token Distribution
167
+
168
+ | Length | Count | Percentage | Examples |
169
+ |--------|-------|------------|----------|
170
+ | 2 bytes | 31,528 | 48.3% | `0x48 0x8b` (REX.W prefix), `0xcc 0xcc` (int3 padding) |
171
+ | 3 bytes | 9,261 | 14.2% | `0x48 0x8b 0xc0` (MOV rax, rax) |
172
+ | 4 bytes | 11,520 | 17.6% | `0x48 0x89 0x45 0xf8` (MOV [rbp-8], rax) |
173
+ | 5+ bytes | 13,164 | 20.2% | Multi-instruction sequences, string literals |
174
+
175
+ **Average token length**: 3.651 bytes
176
+
177
+ ---
178
+
179
+ ## Training Details
180
+
181
+ ### Dataset
182
+
183
+ **Source**: `/nas4/data/glaurung-data/binaries-small/`
184
+
185
+ - **Size**: 13 GB
186
+ - **Files**: 30,738 binaries
187
+ - **Content**: Real-world compiled binaries including system utilities, libraries, and applications
188
+
189
+ **Platform Distribution**:
190
+ - Linux: Alpine, Debian, Ubuntu (ELF format)
191
+ - Windows: 8, 10, 11 (PE format)
192
+
193
+ **Architecture Distribution**:
194
+ - x86-64 (primary)
195
+ - x86-32
196
+ - ARM64
197
+
198
+ ### Training Parameters
199
+
200
+ ```bash
201
+ cargo run --release --bin train -- \
202
+ --output glaurung-tokenizer-002.json \
203
+ /nas4/data/glaurung-data/binaries-small/ \
204
+ --vocab-size 65536 \
205
+ --min-frequency 4 \
206
+ --chunk-size 8192
207
+ ```
208
+
209
+ **Training Duration**: 8.46 hours on 24 cores
210
+ **Peak Memory**: 70 GB
211
+
212
+ ---
213
+
214
+ ## Use Cases
215
+
216
+ ### ✅ Recommended For
217
+
218
+ - Binary neural language models
219
+ - Malware analysis and classification
220
+ - Reverse engineering tools
221
+ - Binary similarity detection
222
+ - Code pattern recognition
223
+ - Vulnerability research
224
+ - Firmware analysis
225
+
226
+ ### ❌ Not Recommended For
227
+
228
+ - Text/source code (use text tokenizer like GPT-2, 100%+ penalty)
229
+ - Very small binaries <1KB (overhead too high)
230
+ - Real-time streaming (load time ~100ms)
231
+
232
+ ---
233
+
234
+ ## Comparison with Predecessor
235
+
236
+ | Metric | binary-tokenizer-005 | glaurung-binary-tokenizer-001 | Improvement |
237
+ |--------|---------------------|------------------------------|-------------|
238
+ | Vocabulary | 65,536 | 65,536 | Same |
239
+ | Training data | ~5GB mixed | 13GB binaries-small | 2.6x larger |
240
+ | bytes/token | ~2.6 | 2.849 | +9.6% |
241
+ | Platforms | Mixed | Multi-OS (Linux, Windows) | More diverse |
242
+ | Architecture awareness | Basic | Advanced (instruction-aware) | Significant |
243
+ | Documentation | Basic | Comprehensive | Extensive |
244
+
245
+ **Key Improvements**:
246
+ - Larger, more diverse training corpus
247
+ - Better cross-platform coverage
248
+ - Instruction-boundary awareness
249
+ - Production-ready quality
250
+
251
+ ---
252
+
253
+ ## Advanced Usage
254
+
255
+ ### Batch Processing Multiple Files
256
+
257
+ ```python
258
+ from tokenizers import Tokenizer
259
+ from pathlib import Path
260
+ import numpy as np
261
+
262
+ tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
263
+
264
+ def tokenize_binary_file(file_path):
265
+ """Tokenize a single binary file."""
266
+ raw_bytes = Path(file_path).read_bytes()
267
+ text = raw_bytes.decode('latin-1')
268
+ encoded = tokenizer.encode(text)
269
+ return {
270
+ 'file': file_path,
271
+ 'size_bytes': len(raw_bytes),
272
+ 'token_count': len(encoded.ids),
273
+ 'compression_ratio': len(raw_bytes) / len(encoded.ids),
274
+ 'token_ids': encoded.ids
275
+ }
276
+
277
+ # Process directory
278
+ binary_dir = Path("/usr/bin")
279
+ results = []
280
+ for binary_path in binary_dir.glob("*"):
281
+ if binary_path.is_file():
282
+ try:
283
+ result = tokenize_binary_file(binary_path)
284
+ results.append(result)
285
+ except Exception as e:
286
+ print(f"Error processing {binary_path}: {e}")
287
+
288
+ # Analyze compression statistics
289
+ compression_ratios = [r['compression_ratio'] for r in results]
290
+ print(f"Mean compression: {np.mean(compression_ratios):.3f} bytes/token")
291
+ print(f"Std deviation: {np.std(compression_ratios):.3f}")
292
+ ```
293
+
294
+ ### Using with PyTorch/TensorFlow Models
295
+
296
+ ```python
297
+ from tokenizers import Tokenizer
298
+ import torch
299
+ from pathlib import Path
300
+
301
+ tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
302
+
303
+ def prepare_binary_for_model(file_path, max_length=512):
304
+ """Prepare binary data for neural network input."""
305
+ raw_bytes = Path(file_path).read_bytes()
306
+ text = raw_bytes.decode('latin-1')
307
+
308
+ # Tokenize
309
+ encoded = tokenizer.encode(text)
310
+ token_ids = encoded.ids
311
+
312
+ # Truncate or pad to max_length
313
+ if len(token_ids) > max_length:
314
+ token_ids = token_ids[:max_length]
315
+ else:
316
+ # Pad with token ID 0 (or use a dedicated padding token)
317
+ token_ids = token_ids + [0] * (max_length - len(token_ids))
318
+
319
+ # Convert to tensor
320
+ return torch.tensor(token_ids, dtype=torch.long)
321
+
322
+ # Use in model
323
+ binary_tensor = prepare_binary_for_model("/usr/bin/ls", max_length=1024)
324
+ print(f"Tensor shape: {binary_tensor.shape}") # torch.Size([1024])
325
+ ```
326
+
327
+ ---
328
+
329
+ ## Troubleshooting
330
+
331
+ ### Issue: UnicodeDecodeError when processing binary
332
+
333
+ **Solution**: Always use `latin-1` encoding, never `utf-8`:
334
+ ```python
335
+ # ✅ Correct
336
+ text = raw_bytes.decode('latin-1')
337
+
338
+ # ❌ Wrong
339
+ text = raw_bytes.decode('utf-8') # Will fail on non-UTF-8 bytes
340
+ ```
341
+
342
+ ### Issue: Decoded output doesn't match original
343
+
344
+ **Cause**: BPE tokenizers add spaces between tokens during decoding.
345
+
346
+ **Solution**: Use the raw token IDs and decode manually if exact byte recovery is needed:
347
+ ```python
348
+ # Get tokens without spaces
349
+ tokens_no_spaces = ''.join(encoded.tokens)
350
+ original_bytes = tokens_no_spaces.encode('latin-1')
351
+ ```
352
+
353
+ ### Issue: Poor compression on specific binary types
354
+
355
+ **Cause**: The tokenizer may not be optimized for highly specialized formats (e.g., bytecode for Python .pyc, Java .class).
356
+
357
+ **Solution**: Consider domain-specific tokenizers for specialized formats, or use this as a general-purpose baseline.
358
+
359
+ ---
360
+
361
+ ## Related Projects
362
+
363
+ - **Predecessor**: [mjbommar/binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005) - Earlier 64K binary tokenizer
364
+ - **Framework**: [mjbommar/glaurung](https://github.com/mjbommar/glaurung) - Binary analysis framework
365
+ - **Training Code**: [mjbommar/glaurung-models](https://github.com/mjbommar/glaurung-models) - Binary embedding models and tokenizers
366
+
367
+ ---
368
+
369
+ ## Technical Architecture
370
+
371
+ ### Vocabulary Structure
372
+
373
+ - **Base tokens**: 256 single-byte tokens (0x00 to 0xFF)
374
+ - **Merged tokens**: 65,280 learned byte-pair combinations
375
+ - **Total**: 65,536 tokens (exactly 2^16)
376
+
377
+ ### Special Tokens
378
+
379
+ The tokenizer includes boundary markers for file-level segmentation:
380
+ - `<|start|>` (ID: 0)
381
+ - `<|end|>` (ID: 1)
382
+
383
+ These help models distinguish between concatenated files and identify file headers.
384
+
385
+ ### Token Properties
386
+
387
+ **Instruction-aware patterns** (x86-64 examples):
388
+ - REX prefixes: `0x48`, `0x4c`, `0x4d`
389
+ - Common opcodes: `0x8b` (MOV), `0x89` (MOV), `0xe8` (CALL)
390
+ - ModR/M patterns: `0xc0`, `0x45`, `0x5d`
391
+
392
+ **Common patterns**:
393
+ - Padding: `0xcc 0xcc` (int3), `0x90 0x90` (nop)
394
+ - Alignment: `0x00 0x00 0x00 0x00`
395
+ - String terminators: `0x00` at word boundaries
396
+
397
+ ---
398
+
399
+ ## Performance Characteristics
400
+
401
+ ### Load Time
402
+
403
+ - **Tokenizer size**: 2.3 MB on disk
404
+ - **Load time**: ~100ms (cold), ~20ms (cached)
405
+ - **Memory footprint**: ~15 MB in RAM
406
+
407
+ ### Encoding Speed
408
+
409
+ On a modern CPU (tested on Intel i9-12900K):
410
+
411
+ | Operation | Speed |
412
+ |-----------|-------|
413
+ | Encode 1 MB binary | ~50 ms |
414
+ | Encode 10 MB binary | ~450 ms |
415
+ | Encode 100 MB binary | ~4.2 s |
416
+
417
+ **Throughput**: ~20-25 MB/second
418
+
419
+ ---
420
+
421
+ ## Limitations
422
+
423
+ 1. **Cross-domain penalty**: Using on text data causes 100-140% efficiency loss
424
+ 2. **Small file overhead**: Files <1KB have proportionally higher tokenization overhead
425
+ 3. **Deterministic decoding**: Spaces inserted between tokens during decode (BPE behavior)
426
+ 4. **Architecture bias**: Trained primarily on x86-64; may be less optimal for RISC-V, MIPS, etc.
427
+
428
+ ---
429
+
430
+ ## Citation
431
+
432
+ If you use this tokenizer in research, please cite:
433
+
434
+ ```
435
+ Glaurung Binary Tokenizer 001
436
+ 64K Binary Tokenizer for Neural Language Models
437
+ Vocabulary: 65,536 tokens (exactly 2^16)
438
+ Training: October 2025
439
+ Dataset: 13GB binaries-small (30,738 files)
440
+ Performance: 2.849 bytes/token (86% of theoretical optimum)
441
+ HuggingFace: mjbommar/glaurung-binary-tokenizer-001
442
+ ```
443
+
444
+ ---
445
+
446
+ ## License
447
+
448
+ For research and educational purposes. See [mjbommar/glaurung-models](https://github.com/mjbommar/glaurung-models) for license details.
449
+
450
+ ---
451
+
452
+ ## Support & Issues
453
+
454
+ - **GitHub Issues**: [mjbommar/glaurung-models/issues](https://github.com/mjbommar/glaurung-models/issues)
455
+ - **Documentation**: Full training report available in the [glaurung-models repository](https://github.com/mjbommar/glaurung-models/tree/master/tokenizers/tokenizer-002)
456
+ - **Email**: Contact maintainer via GitHub
457
+
458
+ ---
459
+
460
+ **Production Status**: ✅ Ready for deployment
461
+ **Version**: 1.0.0
462
+ **Release Date**: October 2025