Glaurung Small 001

A RoBERTa-based masked language model trained on binary executable files for security research and binary analysis. Part of the Glaurung project: a modern reverse engineering framework with first-class AI integration.

Overview

Glaurung Small 001 is a transformer model specifically designed for understanding binary executable files. It uses a custom BPE (Byte Pair Encoding) tokenizer trained on multi-byte patterns from various binary formats across multiple architectures (x86-64, ARM64, etc.) and operating systems (Linux, Alpine, Ubuntu, Debian, Rocky).

This is the small variant (160M parameters, 12 layers) optimized for faster inference. For enhanced understanding, see glaurung-large-001 (371M parameters).

Key Features

  • Custom Binary Tokenizer: BPE tokenizer that creates efficient multi-byte tokens from binary data
  • Binary-Aware: Trained on actual executable files, not hex strings
  • Multi-Architecture: Understands patterns from various CPU architectures and file formats
  • Latin-1 Encoding: Preserves all byte values (0-255) without loss

Model Details

  • Architecture: RoBERTa for Masked Language Modeling
  • Hidden Size: 768
  • Layers: 12
  • Attention Heads: 12
  • Vocabulary Size: 65,536 tokens
  • Tokenizer: binary-tokenizer-005
  • Max Position Embeddings: 520
  • Special Tokens:
    • <|start|> (0): Beginning of sequence
    • <|end|> (1): End token
    • <|sep|> (2): Separator/EOS
    • <|cls|> (3): Classification token
    • <|pad|> (4): Padding
    • <|mask|> (5): Mask token for MLM
    • <|unk|> (6): Unknown token

Glaurung Ecosystem

This model is part of the Glaurung project ecosystem:

๐Ÿ”ง Main Project

  • Glaurung - A modern reverse engineering framework designed to replace Ghidra with first-class AI integration throughout the analysis pipeline. Built with Rust's performance and Python's accessibility, featuring AI agents integrated at every level from format detection to decompilation.

๐Ÿค– Model Family

๐Ÿ”ค Tokenizer

Installation & Loading

pip install transformers torch
from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoModel, pipeline

# Method 1: Load with pipeline for fill-mask tasks
fill_mask = pipeline('fill-mask', model='mjbommar/glaurung-small-001', device=-1)

# Method 2: Load model and tokenizer directly for fill-mask
model = AutoModelForMaskedLM.from_pretrained('mjbommar/glaurung-small-001')
tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-small-001')

# Method 3: Load base model for feature extraction/embeddings
model_base = AutoModel.from_pretrained('mjbommar/glaurung-small-001')

Usage Guide

1. Loading Binary Data (Critical!)

Binary files MUST be read as bytes and converted to latin-1 encoding:

# CORRECT: Read as bytes, decode with latin-1
with open('/usr/bin/ls', 'rb') as f:
    binary_data = f.read()  # Read first 512 bytes or as needed
    text = binary_data.decode('latin-1', errors='ignore')

# WRONG: Never use hex strings or other encodings
# hex_string = "7f454c46..."  # โŒ Will not work
# utf8_text = binary_data.decode('utf-8')  # โŒ Will lose bytes

2. Understanding the BPE Tokenizer

The tokenizer creates multi-byte tokens from common binary patterns:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-small-001')

# Example: ELF header tokenization
elf_header = b'\x7fELF\x02\x01\x01\x00'
text = elf_header.decode('latin-1')

tokens = tokenizer(text, return_tensors='pt')
token_ids = tokens['input_ids'][0].tolist()

# Decode tokens individually to see multi-byte patterns
for token_id in token_ids[1:5]:  # Skip special tokens
    decoded = tokenizer.decode([token_id], skip_special_tokens=True)
    print(f"Token {token_id}: {repr(decoded)}")

# Output:
# Token 45689: '\x7fEL'    # ELF magic compressed to one token!
# Token 3665:  'F\x02'     # Format byte + 64-bit flag
# Token 458:   '\x01\x01'  # Little-endian + version
# Token 600:   '\x00\x00\x00\x00\x00\x00\x00\x00\x00'  # Padding

3. Fill-Mask Task (Token-Level Prediction)

Important: Masking works at the TOKEN level, not byte level!

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

model = AutoModelForMaskedLM.from_pretrained('mjbommar/glaurung-small-001')
tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-small-001')

# Read binary file
with open('/usr/bin/ls', 'rb') as f:
    binary_data = f.read(512)
    text = binary_data.decode('latin-1', errors='ignore')

# Tokenize
tokens = tokenizer(text, return_tensors='pt')
token_ids = tokens['input_ids'][0].tolist()

# Mask the second token (first content token after <|start|>)
masked_ids = token_ids.copy()
original_token = masked_ids[1]  # Save original
masked_ids[1] = tokenizer.mask_token_id

# Prepare input
tokens_masked = {
    'input_ids': torch.tensor([masked_ids]), 
    'attention_mask': torch.tensor([[1]*len(masked_ids)])
}

# Predict
with torch.no_grad():
    outputs = model(**tokens_masked)
    predictions = outputs.logits[0, 1].softmax(dim=-1)
    top5 = predictions.topk(5)

# Show results
print(f"Original: {repr(tokenizer.decode([original_token]))}")
for score, token_id in zip(top5.values, top5.indices):
    token_text = tokenizer.decode([token_id.item()], skip_special_tokens=True)
    print(f"Predicted: {repr(token_text)} (confidence: {score:.2%})")

# Example output:
# Original: '\x7fEL'
# Predicted: '\x7fEL' (confidence: 79.07%)  โœ“ Correct!
# Predicted: '\x00\x00\x00\x00\x00\x00\x00\x00' (confidence: 13.62%)

4. Using Pipeline for Fill-Mask

The pipeline handles tokenization automatically but requires understanding multi-byte tokens:

from transformers import pipeline

# Load pipeline
fill_mask = pipeline('fill-mask', model='mjbommar/glaurung-small-001', device=-1)

# Read binary
with open('/usr/bin/ls', 'rb') as f:
    binary_data = f.read(100)
    text = binary_data.decode('latin-1', errors='ignore')

# Create masked input at token boundaries
# First, tokenize to understand token boundaries
tokenizer = fill_mask.tokenizer
tokens = tokenizer(text)
decoded_tokens = [tokenizer.decode([tid], skip_special_tokens=True) for tid in tokens['input_ids']]

# Reconstruct with mask at token boundary
masked_text = ''.join([
    decoded_tokens[0],  # <|start|> 
    fill_mask.tokenizer.mask_token,  # Mask the ELF magic
    ''.join(decoded_tokens[2:])  # Rest of tokens
])

# Predict
predictions = fill_mask(masked_text, top_k=3)
for pred in predictions:
    print(f"{repr(pred['token_str'])}: {pred['score']:.2%}")

5. Feature Extraction & Embedding Similarity

Compare binary files by their learned embeddings:

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
from pathlib import Path

# Load for embeddings (not MaskedLM)
tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-small-001')
model = AutoModel.from_pretrained('mjbommar/glaurung-small-001')
model.eval()

def get_binary_embedding(file_path, max_bytes=512):
    """Extract embedding for a binary file using mean pooling"""
    with open(file_path, 'rb') as f:
        binary_data = f.read(max_bytes)
        text = binary_data.decode('latin-1', errors='ignore')
    
    # Tokenize
    tokens = tokenizer(text, return_tensors='pt', 
                      padding=True, truncation=True, max_length=512)
    
    # Get embeddings with mean pooling
    with torch.no_grad():
        outputs = model(**tokens)
        # Mean pooling (better than CLS token for this model)
        attention_mask = tokens['attention_mask']
        hidden_states = outputs.last_hidden_state
        
        # Mask padding tokens
        mask_expanded = attention_mask.unsqueeze(-1).expand(hidden_states.size()).float()
        sum_embeddings = torch.sum(hidden_states * mask_expanded, dim=1)
        sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
        embedding = sum_embeddings / sum_mask
    
    return embedding

# Compare multiple binaries
files = ['/usr/bin/ls', '/usr/bin/cat', '/usr/bin/echo', '/etc/passwd']
embeddings = {}

for file_path in files:
    if Path(file_path).exists():
        name = Path(file_path).name
        embeddings[name] = get_binary_embedding(file_path)
        
# Calculate similarities
print("Cosine Similarity Matrix:")
names = list(embeddings.keys())
for name1 in names:
    similarities = []
    for name2 in names:
        sim = F.cosine_similarity(embeddings[name1], embeddings[name2], dim=-1).item()
        similarities.append(f"{sim:.3f}")
    print(f"{name1:10s}: {' '.join(similarities)}")

# Expected output:
# ELF executables (ls, cat, echo) will have high similarity (0.85-0.95)
# Text file (passwd) will have low similarity (0.25-0.30) to ELF files

Real-World Example: ELF Header Analysis

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained('mjbommar/glaurung-small-001')
tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-small-001')

# Analyze ELF executable structure
with open('/usr/bin/ls', 'rb') as f:
    binary_data = f.read(512)  # Read enough for context

print(f"Raw bytes (hex): {binary_data[:16].hex()}")
# Output: 7f454c46020101000000000000000000

# Convert to latin-1 for model
text = binary_data.decode('latin-1', errors='ignore')

# Tokenize to see learned patterns
tokens = tokenizer(text, return_tensors='pt')
token_ids = tokens['input_ids'][0].tolist()

# Show what tokens the model learned
print("\nTokenized ELF header:")
for i in range(1, min(5, len(token_ids)-1)):  # First few content tokens
    token_text = tokenizer.decode([token_ids[i]], skip_special_tokens=True)
    print(f"Token {i}: {token_ids[i]:5d} = {repr(token_text)}")

# Output:
# Token 1: 45689 = '\x7fEL'  - ELF magic compressed to one token!
# Token 2:  3665 = 'F\x02'   - 'F' + 64-bit flag  
# Token 3:   458 = '\x01\x01' - Little-endian + version
# Token 4:   600 = '\x00\x00\x00\x00\x00\x00\x00\x00\x00' - Padding

# Test model's understanding by masking each token
print("\nTesting model predictions:")
for position in [1, 2, 3]:  # Test first 3 content tokens
    masked_ids = token_ids.copy()
    original_token = masked_ids[position]
    masked_ids[position] = tokenizer.mask_token_id
    
    # Create input tensors
    tokens_masked = {
        'input_ids': torch.tensor([masked_ids]),
        'attention_mask': torch.tensor([[1]*len(masked_ids)])
    }
    
    # Get prediction
    with torch.no_grad():
        outputs = model(**tokens_masked)
        predictions = outputs.logits[0, position].softmax(dim=-1)
        predicted_token = predictions.argmax().item()
        confidence = predictions.max().item()
    
    # Show results
    original_text = tokenizer.decode([original_token], skip_special_tokens=True)
    predicted_text = tokenizer.decode([predicted_token], skip_special_tokens=True)
    correct = "โœ“" if predicted_token == original_token else "โœ—"
    
    print(f"Position {position}: {correct}")
    print(f"  Original:  {repr(original_text)}")
    print(f"  Predicted: {repr(predicted_text)} (confidence: {confidence:.1%})")

# Expected Output:
# Position 1: โœ“
#   Original:  '\x7fEL'
#   Predicted: '\x7fEL' (confidence: 79.1%)
# Position 2: โœ“
#   Original:  'F\x02'
#   Predicted: 'F\x02' (confidence: 97.9%)
# Position 3: โœ“
#   Original:  '\x01\x01'
#   Predicted: '\x01\x01' (confidence: 88.7%)

Training Details

  • MLM Objective: 20% masking probability
  • Training Data: Binary executables from various architectures
  • Optimization: AdamW with warmup, dropout 0.01
  • Special Design: Increased position embeddings (520) to handle RoBERTa's position offset

Limitations

  • Maximum sequence length: 512 tokens
  • Optimized for executable files (ELF, PE, Mach-O)
  • Mean pooling recommended for embeddings (pooler layer not specifically trained)

Citation

If using this model in research:

@software{glaurung-small-001,
  title = {Glaurung Small 001: Binary Analysis Transformer},
  author = {Glaurung Project},
  year = {2024},
  url = {https://github.com/mjbommar/glaurung-models}
}
Downloads last month
131
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support