mjbommar commited on
Commit
40f6ac5
·
verified ·
1 Parent(s): 9654e3e

Remove backup file

Browse files
Files changed (1) hide show
  1. README.md~ +0 -485
README.md~ DELETED
@@ -1,485 +0,0 @@
1
- ---
2
- language:
3
- - code
4
- license: apache-2.0
5
- tags:
6
- - binary-analysis
7
- - tokenizer
8
- - bpe
9
- - malware-analysis
10
- - reverse-engineering
11
- - security
12
- - x86-64
13
- - arm64
14
- - elf
15
- - pe
16
- library_name: tokenizers
17
- pipeline_tag: feature-extraction
18
- ---
19
-
20
- # Glaurung Binary Tokenizer 001
21
-
22
- **Production-ready 64K vocabulary BPE tokenizer for binary executables**
23
-
24
- 🔗 **GitHub**: [mjbommar/glaurung](https://github.com/mjbommar/glaurung)
25
-
26
- ---
27
-
28
- ## Overview
29
-
30
- **Glaurung Binary Tokenizer 001** is a specialized Byte Pair Encoding (BPE) tokenizer optimized for compiled binary data across multiple architectures (x86-64, ARM64, Windows PE, Linux ELF). This is the production successor to [binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005).
31
-
32
- ### Key Specifications
33
-
34
- - **Vocabulary Size**: 65,536 tokens (exactly 2^16 = 64K)
35
- - **Compression**: 2.849 bytes/token average
36
- - **Training Data**: 13GB corpus, 30,738 binaries
37
- - **Architectures**: x86-64, x86-32, ARM64
38
- - **Platforms**: Linux (Alpine, Debian, Ubuntu), Windows (8, 10, 11)
39
- - **Encoding**: Latin-1 (each byte 0-255 maps to a single character)
40
-
41
- ### Performance Highlights
42
-
43
- - **9-10% better compression** than 32K baseline
44
- - **86% of theoretical maximum** compression efficiency
45
- - **Instruction-aware**: Captures complete x86-64 instructions (REX + opcode + ModR/M)
46
- - **String-rich**: 5.76% of vocabulary contains function names, paths, library references
47
-
48
- ---
49
-
50
- ## Installation
51
-
52
- ```bash
53
- pip install tokenizers transformers
54
- ```
55
-
56
- ---
57
-
58
- ## Quick Start
59
-
60
- ### Method 1: Using the tokenizers library (Recommended)
61
-
62
- ```python
63
- from tokenizers import Tokenizer
64
- from pathlib import Path
65
-
66
- # Load tokenizer directly from Hugging Face Hub
67
- tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
68
-
69
- # Process binary data - MUST use latin-1 encoding
70
- binary_path = Path("/usr/bin/ls")
71
- raw_bytes = binary_path.read_bytes()
72
- text = raw_bytes.decode('latin-1') # Convert bytes to latin-1 string
73
-
74
- # Tokenize
75
- encoded = tokenizer.encode(text)
76
- tokens = encoded.ids
77
-
78
- print(f"File size: {len(raw_bytes):,} bytes")
79
- print(f"Tokens: {len(tokens):,}")
80
- print(f"Compression: {len(raw_bytes) / len(tokens):.3f} bytes/token")
81
-
82
- # Decode back to text (note: adds spaces between tokens due to BPE behavior)
83
- decoded = tokenizer.decode(tokens)
84
- ```
85
-
86
- **Expected Output** (for `/usr/bin/ls`):
87
- ```
88
- File size: 142,144 bytes
89
- Tokens: 49,574
90
- Compression: 2.866 bytes/token
91
- ```
92
-
93
- ### Method 2: Using transformers library
94
-
95
- ```python
96
- from transformers import PreTrainedTokenizerFast
97
- from tokenizers import Tokenizer
98
-
99
- # Load the base tokenizer
100
- base_tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
101
-
102
- # Wrap with PreTrainedTokenizerFast for transformers compatibility
103
- tokenizer = PreTrainedTokenizerFast(tokenizer_object=base_tokenizer)
104
-
105
- # Process binary data
106
- with open("/usr/bin/ls", "rb") as f:
107
- raw_bytes = f.read()
108
- text = raw_bytes.decode('latin-1')
109
-
110
- # Tokenize (returns dict with input_ids, attention_mask, etc.)
111
- result = tokenizer(text)
112
- tokens = result["input_ids"]
113
- ```
114
-
115
- ---
116
-
117
- ## Important: Data Format
118
-
119
- The tokenizer expects binary data encoded as **latin-1 strings**, NOT hex strings:
120
-
121
- ```python
122
- # ✅ CORRECT - Use latin-1 encoded bytes
123
- raw_bytes = b'\x7fELF\x01\x01'
124
- text = raw_bytes.decode('latin-1') # → '\x7fELF\x01\x01'
125
- encoded = tokenizer.encode(text)
126
-
127
- # ❌ WRONG - Do not use hex strings
128
- hex_str = "7f 45 4c 46 01 01" # Will not work correctly
129
- ```
130
-
131
- **Why latin-1?** Every byte value (0-255) maps to exactly one latin-1 character, ensuring lossless round-trip conversion between bytes and text.
132
-
133
- ---
134
-
135
- ## Performance Benchmarks
136
-
137
- ### Compression on Real-World Binaries
138
-
139
- Tested on `/usr/bin` binaries (not in training set):
140
-
141
- | Binary | Size | Tokens | bytes/token |
142
- |--------|------|--------|-------------|
143
- | bash | 1.38 MB | 535,541 | 2.698 |
144
- | python3.12 | 7.65 MB | 2,801,226 | 2.863 |
145
- | gcc-13 | 0.98 MB | 344,201 | 2.986 |
146
- | ls | 0.14 MB | 49,574 | 2.866 |
147
- | grep | 0.18 MB | 67,567 | 2.667 |
148
-
149
- **Average**: 2.849 bytes/token
150
-
151
- ### Information-Theoretic Efficiency
152
-
153
- - Binary entropy: ~6.5 bits/byte
154
- - Theoretical optimal: 2.46 bytes/token
155
- - Our performance: 2.849 bytes/token
156
- - **Efficiency: 86%** of theoretical optimum
157
-
158
- ---
159
-
160
- ## Example: Tokenizing an ELF Header
161
-
162
- ```python
163
- from tokenizers import Tokenizer
164
-
165
- # Load tokenizer
166
- tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
167
-
168
- # ELF header bytes
169
- elf_header = b'\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00'
170
- text = elf_header.decode('latin-1')
171
-
172
- # Tokenize
173
- encoded = tokenizer.encode(text)
174
- print(f"Original bytes: {elf_header.hex()}")
175
- print(f"Tokens: {encoded.ids}")
176
- print(f"Token count: {len(encoded.ids)}")
177
- print(f"Compression: {len(elf_header) / len(encoded.ids):.2f} bytes/token")
178
-
179
- # Examine individual tokens
180
- for token_id, token_str in zip(encoded.ids, encoded.tokens):
181
- token_bytes = token_str.encode('latin-1')
182
- print(f" Token {token_id:5d}: {token_bytes.hex():16s} ({len(token_bytes)} bytes)")
183
- ```
184
-
185
- ---
186
-
187
- ## Token Distribution
188
-
189
- | Length | Count | Percentage | Examples |
190
- |--------|-------|------------|----------|
191
- | 2 bytes | 31,528 | 48.3% | `0x48 0x8b` (REX.W prefix), `0xcc 0xcc` (int3 padding) |
192
- | 3 bytes | 9,261 | 14.2% | `0x48 0x8b 0xc0` (MOV rax, rax) |
193
- | 4 bytes | 11,520 | 17.6% | `0x48 0x89 0x45 0xf8` (MOV [rbp-8], rax) |
194
- | 5+ bytes | 13,164 | 20.2% | Multi-instruction sequences, string literals |
195
-
196
- **Average token length**: 3.651 bytes
197
-
198
- ---
199
-
200
- ## Training Details
201
-
202
- ### Dataset
203
-
204
- **Source**: `/nas4/data/glaurung-data/binaries-small/`
205
-
206
- - **Size**: 13 GB
207
- - **Files**: 30,738 binaries
208
- - **Content**: Real-world compiled binaries including system utilities, libraries, and applications
209
-
210
- **Platform Distribution**:
211
- - Linux: Alpine, Debian, Ubuntu (ELF format)
212
- - Windows: 8, 10, 11 (PE format)
213
-
214
- **Architecture Distribution**:
215
- - x86-64 (primary)
216
- - x86-32
217
- - ARM64
218
-
219
- ### Training Parameters
220
-
221
- ```bash
222
- cargo run --release --bin train -- \
223
- --output glaurung-tokenizer-002.json \
224
- /nas4/data/glaurung-data/binaries-small/ \
225
- --vocab-size 65536 \
226
- --min-frequency 4 \
227
- --chunk-size 8192
228
- ```
229
-
230
- **Training Duration**: 8.46 hours on 24 cores
231
- **Peak Memory**: 70 GB
232
-
233
- ---
234
-
235
- ## Use Cases
236
-
237
- ### ✅ Recommended For
238
-
239
- - Binary neural language models
240
- - Malware analysis and classification
241
- - Reverse engineering tools
242
- - Binary similarity detection
243
- - Code pattern recognition
244
- - Vulnerability research
245
- - Firmware analysis
246
-
247
- ### ❌ Not Recommended For
248
-
249
- - Text/source code (use text tokenizer like GPT-2, 100%+ penalty)
250
- - Very small binaries <1KB (overhead too high)
251
- - Real-time streaming (load time ~100ms)
252
-
253
- ---
254
-
255
- ## Comparison with Predecessor
256
-
257
- | Metric | binary-tokenizer-005 | glaurung-binary-tokenizer-001 | Improvement |
258
- |--------|---------------------|------------------------------|-------------|
259
- | Vocabulary | 65,536 | 65,536 | Same |
260
- | Training data | ~5GB mixed | 13GB binaries-small | 2.6x larger |
261
- | bytes/token | ~2.6 | 2.849 | +9.6% |
262
- | Platforms | Mixed | Multi-OS (Linux, Windows) | More diverse |
263
- | Architecture awareness | Basic | Advanced (instruction-aware) | Significant |
264
- | Documentation | Basic | Comprehensive | Extensive |
265
-
266
- **Key Improvements**:
267
- - Larger, more diverse training corpus
268
- - Better cross-platform coverage
269
- - Instruction-boundary awareness
270
- - Production-ready quality
271
-
272
- ---
273
-
274
- ## Advanced Usage
275
-
276
- ### Batch Processing Multiple Files
277
-
278
- ```python
279
- from tokenizers import Tokenizer
280
- from pathlib import Path
281
- import numpy as np
282
-
283
- tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
284
-
285
- def tokenize_binary_file(file_path):
286
- """Tokenize a single binary file."""
287
- raw_bytes = Path(file_path).read_bytes()
288
- text = raw_bytes.decode('latin-1')
289
- encoded = tokenizer.encode(text)
290
- return {
291
- 'file': file_path,
292
- 'size_bytes': len(raw_bytes),
293
- 'token_count': len(encoded.ids),
294
- 'compression_ratio': len(raw_bytes) / len(encoded.ids),
295
- 'token_ids': encoded.ids
296
- }
297
-
298
- # Process directory
299
- binary_dir = Path("/usr/bin")
300
- results = []
301
- for binary_path in binary_dir.glob("*"):
302
- if binary_path.is_file():
303
- try:
304
- result = tokenize_binary_file(binary_path)
305
- results.append(result)
306
- except Exception as e:
307
- print(f"Error processing {binary_path}: {e}")
308
-
309
- # Analyze compression statistics
310
- compression_ratios = [r['compression_ratio'] for r in results]
311
- print(f"Mean compression: {np.mean(compression_ratios):.3f} bytes/token")
312
- print(f"Std deviation: {np.std(compression_ratios):.3f}")
313
- ```
314
-
315
- ### Using with PyTorch/TensorFlow Models
316
-
317
- ```python
318
- from tokenizers import Tokenizer
319
- import torch
320
- from pathlib import Path
321
-
322
- tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
323
-
324
- def prepare_binary_for_model(file_path, max_length=512):
325
- """Prepare binary data for neural network input."""
326
- raw_bytes = Path(file_path).read_bytes()
327
- text = raw_bytes.decode('latin-1')
328
-
329
- # Tokenize
330
- encoded = tokenizer.encode(text)
331
- token_ids = encoded.ids
332
-
333
- # Truncate or pad to max_length
334
- if len(token_ids) > max_length:
335
- token_ids = token_ids[:max_length]
336
- else:
337
- # Pad with token ID 0 (or use a dedicated padding token)
338
- token_ids = token_ids + [0] * (max_length - len(token_ids))
339
-
340
- # Convert to tensor
341
- return torch.tensor(token_ids, dtype=torch.long)
342
-
343
- # Use in model
344
- binary_tensor = prepare_binary_for_model("/usr/bin/ls", max_length=1024)
345
- print(f"Tensor shape: {binary_tensor.shape}") # torch.Size([1024])
346
- ```
347
-
348
- ---
349
-
350
- ## Troubleshooting
351
-
352
- ### Issue: UnicodeDecodeError when processing binary
353
-
354
- **Solution**: Always use `latin-1` encoding, never `utf-8`:
355
- ```python
356
- # ✅ Correct
357
- text = raw_bytes.decode('latin-1')
358
-
359
- # ❌ Wrong
360
- text = raw_bytes.decode('utf-8') # Will fail on non-UTF-8 bytes
361
- ```
362
-
363
- ### Issue: Decoded output doesn't match original
364
-
365
- **Cause**: BPE tokenizers add spaces between tokens during decoding.
366
-
367
- **Solution**: Use the raw token IDs and decode manually if exact byte recovery is needed:
368
- ```python
369
- # Get tokens without spaces
370
- tokens_no_spaces = ''.join(encoded.tokens)
371
- original_bytes = tokens_no_spaces.encode('latin-1')
372
- ```
373
-
374
- ### Issue: Poor compression on specific binary types
375
-
376
- **Cause**: The tokenizer may not be optimized for highly specialized formats (e.g., bytecode for Python .pyc, Java .class).
377
-
378
- **Solution**: Consider domain-specific tokenizers for specialized formats, or use this as a general-purpose baseline.
379
-
380
- ---
381
-
382
- ## Related Projects
383
-
384
- - **Predecessor**: [mjbommar/binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005) - Earlier 64K binary tokenizer
385
- - **Framework**: [mjbommar/glaurung](https://github.com/mjbommar/glaurung) - Binary analysis framework
386
- - **Training Code**: [mjbommar/glaurung-models](https://github.com/mjbommar/glaurung-models) - Binary embedding models and tokenizers
387
-
388
- ---
389
-
390
- ## Technical Architecture
391
-
392
- ### Vocabulary Structure
393
-
394
- - **Base tokens**: 256 single-byte tokens (0x00 to 0xFF)
395
- - **Merged tokens**: 65,280 learned byte-pair combinations
396
- - **Total**: 65,536 tokens (exactly 2^16)
397
-
398
- ### Special Tokens
399
-
400
- The tokenizer includes boundary markers for file-level segmentation:
401
- - `<|start|>` (ID: 0)
402
- - `<|end|>` (ID: 1)
403
-
404
- These help models distinguish between concatenated files and identify file headers.
405
-
406
- ### Token Properties
407
-
408
- **Instruction-aware patterns** (x86-64 examples):
409
- - REX prefixes: `0x48`, `0x4c`, `0x4d`
410
- - Common opcodes: `0x8b` (MOV), `0x89` (MOV), `0xe8` (CALL)
411
- - ModR/M patterns: `0xc0`, `0x45`, `0x5d`
412
-
413
- **Common patterns**:
414
- - Padding: `0xcc 0xcc` (int3), `0x90 0x90` (nop)
415
- - Alignment: `0x00 0x00 0x00 0x00`
416
- - String terminators: `0x00` at word boundaries
417
-
418
- ---
419
-
420
- ## Performance Characteristics
421
-
422
- ### Load Time
423
-
424
- - **Tokenizer size**: 2.3 MB on disk
425
- - **Load time**: ~100ms (cold), ~20ms (cached)
426
- - **Memory footprint**: ~15 MB in RAM
427
-
428
- ### Encoding Speed
429
-
430
- On a modern CPU (tested on Intel i9-12900K):
431
-
432
- | Operation | Speed |
433
- |-----------|-------|
434
- | Encode 1 MB binary | ~50 ms |
435
- | Encode 10 MB binary | ~450 ms |
436
- | Encode 100 MB binary | ~4.2 s |
437
-
438
- **Throughput**: ~20-25 MB/second
439
-
440
- ---
441
-
442
- ## Limitations
443
-
444
- 1. **Cross-domain penalty**: Using on text data causes 100-140% efficiency loss
445
- 2. **Small file overhead**: Files <1KB have proportionally higher tokenization overhead
446
- 3. **Deterministic decoding**: Spaces inserted between tokens during decode (BPE behavior)
447
- 4. **Architecture bias**: Trained primarily on x86-64; may be less optimal for RISC-V, MIPS, etc.
448
-
449
- ---
450
-
451
- ## Citation
452
-
453
- If you use this tokenizer in research, please cite:
454
-
455
- ```
456
- Glaurung Binary Tokenizer 001
457
- 64K Binary Tokenizer for Neural Language Models
458
- Vocabulary: 65,536 tokens (exactly 2^16)
459
- Training: October 2025
460
- Dataset: 13GB binaries-small (30,738 files)
461
- Performance: 2.849 bytes/token (86% of theoretical optimum)
462
- HuggingFace: mjbommar/glaurung-binary-tokenizer-001
463
- ```
464
-
465
- ---
466
-
467
- ## License
468
-
469
- Apache License 2.0
470
-
471
- This tokenizer is part of the [Glaurung](https://github.com/mjbommar/glaurung) project. See the [glaurung-models repository](https://github.com/mjbommar/glaurung-models) for full license details.
472
-
473
- ---
474
-
475
- ## Support & Issues
476
-
477
- - **GitHub Issues**: [mjbommar/glaurung-models/issues](https://github.com/mjbommar/glaurung-models/issues)
478
- - **Documentation**: Full training report available in the [glaurung-models repository](https://github.com/mjbommar/glaurung-models/tree/master/tokenizers/tokenizer-002)
479
- - **Email**: Contact maintainer via GitHub
480
-
481
- ---
482
-
483
- **Production Status**: ✅ Ready for deployment
484
- **Version**: 1.0.0
485
- **Release Date**: October 2025