mjbommar commited on
Commit
f709006
·
verified ·
1 Parent(s): 124c8c2

Upload binary-tokenizer-001-32k tokenizer

Browse files
Files changed (4) hide show
  1. .gitattributes +2 -35
  2. README.md +217 -0
  3. analysis_results.json +141 -0
  4. tokenizer.json +0 -0
.gitattributes CHANGED
@@ -1,35 +1,2 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
- *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
1
+ *.json filter=lfs diff=lfs merge=lfs -text
2
+ *.txt filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - code
4
+ tags:
5
+ - tokenizer
6
+ - binary-analysis
7
+ - binary-tokenization
8
+ - bpe
9
+ - byte-pair-encoding
10
+ - reverse-engineering
11
+ - malware-analysis
12
+ - cybersecurity
13
+ - executable-analysis
14
+ license: mit
15
+ pipeline_tag: feature-extraction
16
+ library_name: tokenizers
17
+ ---
18
+
19
+ # binary-tokenizer-001-32k
20
+
21
+ A cross-platform BPE tokenizer for binary executables and machine code. Trained on 13 GB of diverse binaries spanning Linux, Windows, macOS, and Android platforms.
22
+
23
+ **🔗 Model**: [`mjbommar/binary-tokenizer-001-32k`](https://huggingface.co/mjbommar/binary-tokenizer-001-32k)
24
+ **📊 Dataset**: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized)
25
+ **📄 Paper**: *Binary BPE: Cross-Platform Tokenization for Binary Analysis* (arXiv preprint coming soon)
26
+
27
+ ## Overview
28
+
29
+ - **Vocabulary Size**: 32,768 tokens (2^15)
30
+ - **Token Composition**: 256 base bytes + 32,505 learned merges + 7 special tokens
31
+ - **Average Token Length**: 3.812 bytes
32
+ - **3-byte Instructions**: 19.5% of vocabulary (6,380 tokens)
33
+ - **Compression Ratio**: ~2.7 bytes/token on typical binaries
34
+
35
+ ---
36
+
37
+ ## Training Configuration
38
+
39
+ **Training Corpus**:
40
+ - Source: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized)
41
+ - Size: ~13 GB
42
+ - Files: 30,738 binary files
43
+ - Platforms: Linux (ELF), Windows (PE), macOS (Mach-O), Android (APK)
44
+ - Architectures: x86-64, x86, ARM64, ARM, MIPS, RISC-V
45
+
46
+ **Training Parameters**:
47
+ - Vocabulary size: 32,768 (including 7 special tokens)
48
+ - Min frequency: 10
49
+ - Chunk size: 8,192 bytes
50
+ - Allowed lengths: DEFAULT (1-16 bytes)
51
+ - Training duration: ~6-7 hours
52
+
53
+ ---
54
+
55
+ ## Vocabulary Statistics
56
+
57
+ **Composition**:
58
+ - Base bytes (0-255): 256 tokens
59
+ - Learned merges: 32,505 tokens
60
+ - Special tokens: 7 tokens (`<|start|>`, `<|end|>`, `<|pad|>`, `<|unk|>`, `<|cls|>`, `<|sep|>`, `<|mask|>`)
61
+ - **Total**: 32,768 tokens
62
+
63
+ **Quality Metrics**:
64
+ - All tokens reachable: ✓ Yes
65
+ - Valid merges: 32,505 / 32,505
66
+ - Power-of-2 size: ✓ Yes (2^15)
67
+
68
+ ---
69
+
70
+ ## Token Length Distribution
71
+
72
+ | Length | Count | Percentage | Description |
73
+ |--------|-------|------------|-------------|
74
+ | 1 byte | 256 | 0.8% | Base bytes |
75
+ | 2 bytes | 13,428 | 41.0% | Byte pairs |
76
+ | 3 bytes | 6,380 | 19.5% | Complete x86-64 instructions |
77
+ | 4 bytes | 6,236 | 19.0% | Instructions with operands |
78
+ | 5 bytes | 1,763 | 5.4% | Complex patterns |
79
+ | 6 bytes | 1,395 | 4.3% | Complex patterns |
80
+ | 7 bytes | 676 | 2.1% | Complex patterns |
81
+ | 8 bytes | 963 | 2.9% | Complex patterns |
82
+ | 9+ bytes | 1,467 | 4.5% | Long patterns |
83
+
84
+ **Average Token Length**: 3.812 bytes
85
+
86
+ ---
87
+
88
+ ## Byte Content Analysis
89
+
90
+ **Content Categories**:
91
+ - Contains NULL byte (0x00): 8,350 tokens (25.5%)
92
+ - ASCII printable (0x20-0x7E): 6,460 tokens (19.7%)
93
+ - All ASCII (<0x80): 13,796 tokens (42.1%)
94
+ - High bytes (≥0x80): 18,964 tokens (57.9%)
95
+
96
+ **Most Common Bytes in Tokens**:
97
+ - `0x00` (NULL): 20,462 occurrences - Padding and alignment
98
+ - `0xFF`: 3,502 occurrences - Sentinel values
99
+ - `0x48` (REX.W): 2,883 occurrences - x86-64 REX prefix
100
+ - `0x8B` (MOV): 1,934 occurrences - x86-64 MOV opcode
101
+ - `0xCC` (INT3): 1,366 occurrences - Debug breakpoint padding
102
+
103
+ ---
104
+
105
+ ## Sequence Coverage
106
+
107
+ **N-byte Sequence Diversity**:
108
+ | Length | Learned Tokens | Possible Sequences | Coverage |
109
+ |--------|----------------|-------------------|----------|
110
+ | 1-byte | 256 | 256 | 100.00% |
111
+ | 2-byte | 13,428 | 65,536 | 20.49% |
112
+ | 3-byte | 6,380 | 16,777,216 | 0.038% |
113
+ | 4-byte | 6,236 | 4,294,967,296 | 0.00015% |
114
+
115
+ ---
116
+
117
+ ## Files
118
+
119
+ - `tokenizer-32768.json` - Trained tokenizer model (2.5 MB)
120
+ - `analysis_results.json` - Detailed analysis statistics
121
+ - `training.log` - Training output log (if available)
122
+ - `training_stats.txt` - Training summary (if available)
123
+
124
+ ---
125
+
126
+ ## Usage
127
+
128
+ **Load from HuggingFace Hub**:
129
+ ```python
130
+ from tokenizers import Tokenizer
131
+
132
+ # Load directly from HuggingFace
133
+ tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-32k")
134
+ ```
135
+
136
+ **Load from local file**:
137
+ ```bash
138
+ # With bbpe CLI
139
+ bbpe encode --tokenizer tokenizer-32768.json /path/to/binary
140
+ bbpe info tokenizer-32768.json
141
+ ```
142
+
143
+ **Complete Python Example**:
144
+ ```python
145
+ from tokenizers import Tokenizer
146
+
147
+ # Load from HuggingFace or local file
148
+ tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-32k")
149
+ # OR: tokenizer = Tokenizer.from_file("tokenizer-32768.json")
150
+
151
+ # Read binary file and decode as latin-1 (preserves all byte values 0-255)
152
+ with open("/usr/bin/ls", "rb") as f:
153
+ data = f.read()
154
+ data_str = data.decode("latin-1")
155
+
156
+ # Encode the binary data
157
+ encoding = tokenizer.encode(data_str)
158
+ print(f"File size: {len(data)} bytes")
159
+ print(f"Total tokens: {len(encoding.ids)}")
160
+ print(f"Compression: {len(data) / len(encoding.ids):.3f} bytes/token")
161
+
162
+ # First 10 tokens
163
+ for i, (token_id, token) in enumerate(zip(encoding.ids[:10], encoding.tokens[:10])):
164
+ token_bytes = token.encode("latin-1")
165
+ print(f" Token {i}: ID={token_id:5d} hex={token_bytes.hex():20s} ({len(token_bytes)} bytes)")
166
+
167
+ # Decode tokens back to bytes
168
+ decoded_str = tokenizer.decode(encoding.ids)
169
+ decoded_bytes = decoded_str.encode("latin-1")
170
+ assert decoded_bytes == data # Perfect reconstruction
171
+ ```
172
+
173
+ **Example output for `/usr/bin/ls` (142,312 bytes)**:
174
+ ```
175
+ File size: 142312 bytes
176
+ Total tokens: 53184
177
+ Compression: 2.676 bytes/token
178
+
179
+ First 10 tokens:
180
+ Token 0: ID= 127 hex=7f (1 bytes)
181
+ Token 1: ID= 3732 hex=454c (2 bytes)
182
+ Token 2: ID= 4707 hex=4602 (2 bytes)
183
+ Token 3: ID= 392 hex=0101 (2 bytes)
184
+ Token 4: ID= 662 hex=000000000000000000 (9 bytes)
185
+ Token 5: ID= 265 hex=0300 (2 bytes)
186
+ Token 6: ID= 1369 hex=3e00 (2 bytes)
187
+ Token 7: ID= 279 hex=01000000 (4 bytes)
188
+ Token 8: ID= 48 hex=30 (1 bytes)
189
+ Token 9: ID= 109 hex=6d (1 bytes)
190
+
191
+ Decoded: 7f454c4602010100000000000000000003003e0001000000306d...
192
+ (ELF header: 7f 45 4c 46 = ELF magic bytes)
193
+ ```
194
+
195
+ ---
196
+
197
+ ## Citation
198
+
199
+ If you use this tokenizer in your research, please cite:
200
+
201
+ ```bibtex
202
+ @article{bommarito2025binarybpe,
203
+ title={Binary BPE: Cross-Platform Tokenization for Binary Analysis},
204
+ author={Bommarito II, Michael J.},
205
+ journal={arXiv preprint},
206
+ year={2025},
207
+ note={Preprint coming soon}
208
+ }
209
+ ```
210
+
211
+ **Author**: Michael J. Bommarito II ([michael.bommarito@gmail.com](mailto:michael.bommarito@gmail.com))
212
+
213
+ ---
214
+
215
+ **Generated**: November 13, 2025
216
+ **Training Script**: `train_tokenizers.sh`
217
+ **Analysis Script**: `analyze_tokenizer.py`
analysis_results.json ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size": {
3
+ "total": 32761,
4
+ "total_with_special": 32768,
5
+ "base": 256,
6
+ "merges": 32505,
7
+ "special": 7,
8
+ "is_power_of_2": true,
9
+ "power": 15,
10
+ "matches_expected": true
11
+ },
12
+ "reachability": {
13
+ "valid_merges": 32505,
14
+ "invalid_merges": 0,
15
+ "reachable": 32761,
16
+ "unreachable": 0,
17
+ "all_reachable": true
18
+ },
19
+ "length_dist": {
20
+ "distribution": {
21
+ "1": 256,
22
+ "2": 13428,
23
+ "3": 6380,
24
+ "4": 6236,
25
+ "5": 1763,
26
+ "6": 1395,
27
+ "7": 676,
28
+ "8": 963,
29
+ "9": 191,
30
+ "10": 220,
31
+ "11": 109,
32
+ "12": 318,
33
+ "13": 86,
34
+ "14": 102,
35
+ "15": 69,
36
+ "16": 233,
37
+ "17": 26,
38
+ "18": 31,
39
+ "19": 23,
40
+ "20": 58,
41
+ "21": 16,
42
+ "22": 16,
43
+ "23": 19,
44
+ "24": 44,
45
+ "25": 6,
46
+ "26": 7,
47
+ "27": 8,
48
+ "28": 13,
49
+ "29": 7,
50
+ "30": 4,
51
+ "31": 3,
52
+ "32": 54
53
+ },
54
+ "avg_length": 3.812393162393162,
55
+ "min_length": 1,
56
+ "max_length": 32,
57
+ "length_3_count": 6380,
58
+ "length_3_percent": 19.474969474969473
59
+ },
60
+ "byte_content": {
61
+ "null_tokens": 8350,
62
+ "ascii_printable": 6460,
63
+ "ascii_only": 13796,
64
+ "high_byte": 18964,
65
+ "mixed": 10141,
66
+ "byte_distribution": {
67
+ "0": 20462,
68
+ "255": 3502,
69
+ "72": 2883,
70
+ "1": 2622,
71
+ "3": 1967,
72
+ "139": 1934,
73
+ "32": 1901,
74
+ "2": 1856,
75
+ "64": 1609,
76
+ "116": 1546,
77
+ "101": 1482,
78
+ "36": 1435,
79
+ "204": 1366,
80
+ "128": 1212,
81
+ "65": 1186,
82
+ "4": 1150,
83
+ "97": 1109,
84
+ "114": 1088,
85
+ "249": 1069,
86
+ "137": 1059,
87
+ "111": 990,
88
+ "8": 978,
89
+ "105": 964,
90
+ "115": 940,
91
+ "15": 917,
92
+ "110": 917,
93
+ "99": 879,
94
+ "16": 837,
95
+ "192": 814,
96
+ "232": 810,
97
+ "108": 798,
98
+ "131": 788,
99
+ "68": 777,
100
+ "84": 740,
101
+ "224": 737,
102
+ "112": 732,
103
+ "117": 723,
104
+ "48": 701,
105
+ "5": 690,
106
+ "169": 687,
107
+ "76": 684,
108
+ "69": 663,
109
+ "100": 653,
110
+ "95": 650,
111
+ "6": 647,
112
+ "73": 623,
113
+ "141": 614,
114
+ "10": 570,
115
+ "7": 562,
116
+ "66": 546
117
+ }
118
+ },
119
+ "diversity": {
120
+ "1": {
121
+ "learned": 256,
122
+ "possible": 256,
123
+ "coverage": 100.0
124
+ },
125
+ "2": {
126
+ "learned": 13428,
127
+ "possible": 65536,
128
+ "coverage": 20.489501953125
129
+ },
130
+ "3": {
131
+ "learned": 6380,
132
+ "possible": 16777216,
133
+ "coverage": 0.03802776336669922
134
+ },
135
+ "4": {
136
+ "learned": 6236,
137
+ "possible": 4294967296,
138
+ "coverage": 0.0001451931893825531
139
+ }
140
+ }
141
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff