mjbommar commited on
Commit
1a3f0e1
·
verified ·
1 Parent(s): de2a76f

Upload Glaurung Small 001 - RoBERTa model for binary analysis

Browse files
README.md ADDED
@@ -0,0 +1,297 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Glaurung Small 001
2
+
3
+ A RoBERTa-based masked language model trained on binary executable files for security research and binary analysis.
4
+
5
+ ## Overview
6
+
7
+ **Glaurung Small 001** is a transformer model specifically designed for understanding binary executable files. It uses a custom BPE (Byte Pair Encoding) tokenizer trained on multi-byte patterns from various binary formats across multiple architectures (x86-64, ARM64, etc.) and operating systems (Linux, Alpine, Ubuntu, Debian, Rocky).
8
+
9
+ ### Key Features
10
+ - **Custom Binary Tokenizer**: BPE tokenizer that creates efficient multi-byte tokens from binary data
11
+ - **Binary-Aware**: Trained on actual executable files, not hex strings
12
+ - **Multi-Architecture**: Understands patterns from various CPU architectures and file formats
13
+ - **Latin-1 Encoding**: Preserves all byte values (0-255) without loss
14
+
15
+ ## Model Details
16
+
17
+ - **Architecture**: RoBERTa for Masked Language Modeling
18
+ - **Hidden Size**: 768
19
+ - **Layers**: 12
20
+ - **Attention Heads**: 12
21
+ - **Vocabulary Size**: 65,536 tokens
22
+ - **Max Position Embeddings**: 520
23
+ - **Special Tokens**:
24
+ - `<|start|>` (0): Beginning of sequence
25
+ - `<|end|>` (1): End token
26
+ - `<|sep|>` (2): Separator/EOS
27
+ - `<|cls|>` (3): Classification token
28
+ - `<|pad|>` (4): Padding
29
+ - `<|mask|>` (5): Mask token for MLM
30
+ - `<|unk|>` (6): Unknown token
31
+
32
+ ## Installation & Loading
33
+
34
+ ```python
35
+ from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoModel, pipeline
36
+
37
+ # Method 1: Load with pipeline for fill-mask tasks
38
+ fill_mask = pipeline('fill-mask', model='models/glaurung-small-001/', device=-1)
39
+
40
+ # Method 2: Load model and tokenizer directly for fill-mask
41
+ model = AutoModelForMaskedLM.from_pretrained('models/glaurung-small-001/')
42
+ tokenizer = AutoTokenizer.from_pretrained('models/glaurung-small-001/')
43
+
44
+ # Method 3: Load base model for feature extraction/embeddings
45
+ model_base = AutoModel.from_pretrained('models/glaurung-small-001/')
46
+ ```
47
+
48
+ ## Usage Guide
49
+
50
+ ### 1. Loading Binary Data (Critical!)
51
+
52
+ Binary files MUST be read as bytes and converted to latin-1 encoding:
53
+
54
+ ```python
55
+ # CORRECT: Read as bytes, decode with latin-1
56
+ with open('/usr/bin/ls', 'rb') as f:
57
+ binary_data = f.read() # Read first 512 bytes or as needed
58
+ text = binary_data.decode('latin-1', errors='ignore')
59
+
60
+ # WRONG: Never use hex strings or other encodings
61
+ # hex_string = "7f454c46..." # ❌ Will not work
62
+ # utf8_text = binary_data.decode('utf-8') # ❌ Will lose bytes
63
+ ```
64
+
65
+ ### 2. Understanding the BPE Tokenizer
66
+
67
+ The tokenizer creates multi-byte tokens from common binary patterns:
68
+
69
+ ```python
70
+ from transformers import AutoTokenizer
71
+
72
+ tokenizer = AutoTokenizer.from_pretrained('models/glaurung-small-001/')
73
+
74
+ # Example: ELF header tokenization
75
+ elf_header = b'\x7fELF\x02\x01\x01\x00'
76
+ text = elf_header.decode('latin-1')
77
+
78
+ tokens = tokenizer(text, return_tensors='pt')
79
+ token_ids = tokens['input_ids'][0].tolist()
80
+
81
+ # Decode tokens individually to see multi-byte patterns
82
+ for token_id in token_ids[1:5]: # Skip special tokens
83
+ decoded = tokenizer.decode([token_id], skip_special_tokens=True)
84
+ print(f"Token {token_id}: {repr(decoded)}")
85
+
86
+ # Output:
87
+ # Token 45689: '\x7fEL' # ELF magic compressed to one token!
88
+ # Token 3665: 'F\x02' # Format byte + 64-bit flag
89
+ # Token 458: '\x01\x01' # Little-endian + version
90
+ # Token 600: '\x00\x00\x00\x00\x00\x00\x00\x00\x00' # Padding
91
+ ```
92
+
93
+ ### 3. Fill-Mask Task (Token-Level Prediction)
94
+
95
+ **Important**: Masking works at the TOKEN level, not byte level!
96
+
97
+ ```python
98
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
99
+ import torch
100
+
101
+ model = AutoModelForMaskedLM.from_pretrained('models/glaurung-small-001/')
102
+ tokenizer = AutoTokenizer.from_pretrained('models/glaurung-small-001/')
103
+
104
+ # Read binary file
105
+ with open('/usr/bin/ls', 'rb') as f:
106
+ binary_data = f.read(512)
107
+ text = binary_data.decode('latin-1', errors='ignore')
108
+
109
+ # Tokenize
110
+ tokens = tokenizer(text, return_tensors='pt')
111
+ token_ids = tokens['input_ids'][0].tolist()
112
+
113
+ # Mask the second token (first content token after <|start|>)
114
+ masked_ids = token_ids.copy()
115
+ original_token = masked_ids[1] # Save original
116
+ masked_ids[1] = tokenizer.mask_token_id
117
+
118
+ # Prepare input
119
+ tokens_masked = {
120
+ 'input_ids': torch.tensor([masked_ids]),
121
+ 'attention_mask': torch.tensor([[1]*len(masked_ids)])
122
+ }
123
+
124
+ # Predict
125
+ with torch.no_grad():
126
+ outputs = model(**tokens_masked)
127
+ predictions = outputs.logits[0, 1].softmax(dim=-1)
128
+ top5 = predictions.topk(5)
129
+
130
+ # Show results
131
+ print(f"Original: {repr(tokenizer.decode([original_token]))}")
132
+ for score, token_id in zip(top5.values, top5.indices):
133
+ token_text = tokenizer.decode([token_id.item()], skip_special_tokens=True)
134
+ print(f"Predicted: {repr(token_text)} (confidence: {score:.2%})")
135
+
136
+ # Example output:
137
+ # Original: '\x7fEL'
138
+ # Predicted: '\x7fEL' (confidence: 79.07%) ✓ Correct!
139
+ # Predicted: '\x00\x00\x00\x00\x00\x00\x00\x00' (confidence: 13.62%)
140
+ ```
141
+
142
+ ### 4. Using Pipeline for Fill-Mask
143
+
144
+ The pipeline handles tokenization automatically but requires understanding multi-byte tokens:
145
+
146
+ ```python
147
+ from transformers import pipeline
148
+
149
+ # Load pipeline
150
+ fill_mask = pipeline('fill-mask', model='models/glaurung-small-001/', device=-1)
151
+
152
+ # Read binary
153
+ with open('/usr/bin/ls', 'rb') as f:
154
+ binary_data = f.read(100)
155
+ text = binary_data.decode('latin-1', errors='ignore')
156
+
157
+ # Create masked input at token boundaries
158
+ # First, tokenize to understand token boundaries
159
+ tokenizer = fill_mask.tokenizer
160
+ tokens = tokenizer(text)
161
+ decoded_tokens = [tokenizer.decode([tid], skip_special_tokens=True) for tid in tokens['input_ids']]
162
+
163
+ # Reconstruct with mask at token boundary
164
+ masked_text = ''.join([
165
+ decoded_tokens[0], # <|start|>
166
+ fill_mask.tokenizer.mask_token, # Mask the ELF magic
167
+ ''.join(decoded_tokens[2:]) # Rest of tokens
168
+ ])
169
+
170
+ # Predict
171
+ predictions = fill_mask(masked_text, top_k=3)
172
+ for pred in predictions:
173
+ print(f"{repr(pred['token_str'])}: {pred['score']:.2%}")
174
+ ```
175
+
176
+ ### 5. Feature Extraction & Embedding Similarity
177
+
178
+ Compare binary files by their learned embeddings:
179
+
180
+ ```python
181
+ from transformers import AutoTokenizer, AutoModel
182
+ import torch
183
+ import torch.nn.functional as F
184
+ from pathlib import Path
185
+
186
+ # Load for embeddings (not MaskedLM)
187
+ tokenizer = AutoTokenizer.from_pretrained('models/glaurung-small-001/')
188
+ model = AutoModel.from_pretrained('models/glaurung-small-001/')
189
+ model.eval()
190
+
191
+ def get_binary_embedding(file_path, max_bytes=512):
192
+ """Extract embedding for a binary file using mean pooling"""
193
+ with open(file_path, 'rb') as f:
194
+ binary_data = f.read(max_bytes)
195
+ text = binary_data.decode('latin-1', errors='ignore')
196
+
197
+ # Tokenize
198
+ tokens = tokenizer(text, return_tensors='pt',
199
+ padding=True, truncation=True, max_length=512)
200
+
201
+ # Get embeddings with mean pooling
202
+ with torch.no_grad():
203
+ outputs = model(**tokens)
204
+ # Mean pooling (better than CLS token for this model)
205
+ attention_mask = tokens['attention_mask']
206
+ hidden_states = outputs.last_hidden_state
207
+
208
+ # Mask padding tokens
209
+ mask_expanded = attention_mask.unsqueeze(-1).expand(hidden_states.size()).float()
210
+ sum_embeddings = torch.sum(hidden_states * mask_expanded, dim=1)
211
+ sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
212
+ embedding = sum_embeddings / sum_mask
213
+
214
+ return embedding
215
+
216
+ # Compare multiple binaries
217
+ files = ['/usr/bin/ls', '/usr/bin/cat', '/usr/bin/echo', '/etc/passwd']
218
+ embeddings = {}
219
+
220
+ for file_path in files:
221
+ if Path(file_path).exists():
222
+ name = Path(file_path).name
223
+ embeddings[name] = get_binary_embedding(file_path)
224
+
225
+ # Calculate similarities
226
+ print("Cosine Similarity Matrix:")
227
+ names = list(embeddings.keys())
228
+ for name1 in names:
229
+ similarities = []
230
+ for name2 in names:
231
+ sim = F.cosine_similarity(embeddings[name1], embeddings[name2], dim=-1).item()
232
+ similarities.append(f"{sim:.3f}")
233
+ print(f"{name1:10s}: {' '.join(similarities)}")
234
+
235
+ # Expected output:
236
+ # ELF executables (ls, cat, echo) will have high similarity (0.85-0.95)
237
+ # Text file (passwd) will have low similarity (0.25-0.30) to ELF files
238
+ ```
239
+
240
+ ## Real-World Example: ELF Header Analysis
241
+
242
+ ```python
243
+ # Analyze ELF executable structure
244
+ with open('/usr/bin/ls', 'rb') as f:
245
+ binary_data = f.read(64)
246
+
247
+ print(f"Raw bytes (hex): {binary_data[:16].hex()}")
248
+ # Output: 7f454c46020101000000000000000000
249
+
250
+ # Convert to latin-1 for model
251
+ text = binary_data.decode('latin-1', errors='ignore')
252
+
253
+ # Tokenize to see learned patterns
254
+ tokens = tokenizer(text, return_tensors='pt')
255
+ token_ids = tokens['input_ids'][0].tolist()
256
+
257
+ # Model recognizes these multi-byte patterns:
258
+ # Token 45689: '\x7fEL' - ELF magic number
259
+ # Token 3665: 'F\x02' - 'F' + 64-bit flag
260
+ # Token 458: '\x01\x01' - Little-endian + ELF version
261
+ # Token 600: '\x00\x00\x00\x00\x00\x00\x00\x00\x00' - Padding bytes
262
+
263
+ # Test model's understanding by masking
264
+ for position in [1, 2, 3]: # Test first 3 content tokens
265
+ masked_ids = token_ids.copy()
266
+ masked_ids[position] = tokenizer.mask_token_id
267
+
268
+ # Model correctly predicts with high confidence:
269
+ # Position 1: '\x7fEL' with 79% confidence
270
+ # Position 2: 'F\x02' with 98% confidence
271
+ # Position 3: '\x01\x01' with 89% confidence
272
+ ```
273
+
274
+ ## Training Details
275
+
276
+ - **MLM Objective**: 20% masking probability
277
+ - **Training Data**: Binary executables from various architectures
278
+ - **Optimization**: AdamW with warmup, dropout 0.01
279
+ - **Special Design**: Increased position embeddings (520) to handle RoBERTa's position offset
280
+
281
+ ## Limitations
282
+
283
+ - Maximum sequence length: 512 tokens
284
+ - Optimized for executable files (ELF, PE, Mach-O)
285
+ - Mean pooling recommended for embeddings (pooler layer not specifically trained)
286
+
287
+ ## Citation
288
+
289
+ If using this model in research:
290
+ ```
291
+ @software{glaurung-small-001,
292
+ title = {Glaurung Small 001: Binary Analysis Transformer},
293
+ author = {Glaurung Project},
294
+ year = {2024},
295
+ url = {https://github.com/mjbommar/glaurung-models}
296
+ }
297
+ ```
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "RobertaForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.01,
6
+ "bos_token_id": 0,
7
+ "classifier_dropout": null,
8
+ "dtype": "float32",
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.01,
12
+ "hidden_size": 768,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 3072,
15
+ "layer_norm_eps": 1e-12,
16
+ "max_position_embeddings": 520,
17
+ "model_type": "roberta",
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 12,
20
+ "pad_token_id": 4,
21
+ "position_embedding_type": "absolute",
22
+ "transformers_version": "4.56.1",
23
+ "type_vocab_size": 1,
24
+ "use_cache": true,
25
+ "vocab_size": 65536
26
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b0cbba5fe50e87b04ba1b6616589ce5ee815a80d3842ad02d6d5a915b953c1ce
3
+ size 545805976
special_tokens_map.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|start|>",
3
+ "eos_token": "<|sep|>",
4
+ "sep_token": "<|sep|>",
5
+ "cls_token": "<|cls|>",
6
+ "unk_token": "<|unk|>",
7
+ "pad_token": "<|pad|>",
8
+ "mask_token": "<|mask|>"
9
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tokenizer_class": "PreTrainedTokenizerFast",
3
+ "model_max_length": 512,
4
+ "padding_side": "right",
5
+ "truncation_side": "right",
6
+ "clean_up_tokenization_spaces": false,
7
+ "bos_token": "<|start|>",
8
+ "eos_token": "<|sep|>",
9
+ "sep_token": "<|sep|>",
10
+ "cls_token": "<|cls|>",
11
+ "unk_token": "<|unk|>",
12
+ "pad_token": "<|pad|>",
13
+ "mask_token": "<|mask|>",
14
+ "add_prefix_space": false
15
+ }