Remove backup file

Browse files

Files changed (1) hide show

README.md~ +0 -485

README.md~ DELETED Viewed

@@ -1,485 +0,0 @@
----
-language:
-- code
-license: apache-2.0
-tags:
-- binary-analysis
-- tokenizer
-- bpe
-- malware-analysis
-- reverse-engineering
-- security
-- x86-64
-- arm64
-- elf
-- pe
-library_name: tokenizers
-pipeline_tag: feature-extraction
----
-# Glaurung Binary Tokenizer 001
-**Production-ready 64K vocabulary BPE tokenizer for binary executables**
-🔗 **GitHub**: [mjbommar/glaurung](https://github.com/mjbommar/glaurung)
----
-## Overview
-**Glaurung Binary Tokenizer 001** is a specialized Byte Pair Encoding (BPE) tokenizer optimized for compiled binary data across multiple architectures (x86-64, ARM64, Windows PE, Linux ELF). This is the production successor to [binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005).
-### Key Specifications
-- **Vocabulary Size**: 65,536 tokens (exactly 2^16 = 64K)
-- **Compression**: 2.849 bytes/token average
-- **Training Data**: 13GB corpus, 30,738 binaries
-- **Architectures**: x86-64, x86-32, ARM64
-- **Platforms**: Linux (Alpine, Debian, Ubuntu), Windows (8, 10, 11)
-- **Encoding**: Latin-1 (each byte 0-255 maps to a single character)
-### Performance Highlights
-- **9-10% better compression** than 32K baseline
-- **86% of theoretical maximum** compression efficiency
-- **Instruction-aware**: Captures complete x86-64 instructions (REX + opcode + ModR/M)
-- **String-rich**: 5.76% of vocabulary contains function names, paths, library references
----
-## Installation
-```bash
-pip install tokenizers transformers
-```
----
-## Quick Start
-### Method 1: Using the tokenizers library (Recommended)
-```python
-from tokenizers import Tokenizer
-from pathlib import Path
-# Load tokenizer directly from Hugging Face Hub
-tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
-# Process binary data - MUST use latin-1 encoding
-binary_path = Path("/usr/bin/ls")
-raw_bytes = binary_path.read_bytes()
-text = raw_bytes.decode('latin-1')  # Convert bytes to latin-1 string
-# Tokenize
-encoded = tokenizer.encode(text)
-tokens = encoded.ids
-print(f"File size: {len(raw_bytes):,} bytes")
-print(f"Tokens: {len(tokens):,}")
-print(f"Compression: {len(raw_bytes) / len(tokens):.3f} bytes/token")
-# Decode back to text (note: adds spaces between tokens due to BPE behavior)
-decoded = tokenizer.decode(tokens)
-```
-**Expected Output** (for `/usr/bin/ls`):
-```
-File size: 142,144 bytes
-Tokens: 49,574
-Compression: 2.866 bytes/token
-```
-### Method 2: Using transformers library
-```python
-from transformers import PreTrainedTokenizerFast
-from tokenizers import Tokenizer
-# Load the base tokenizer
-base_tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
-# Wrap with PreTrainedTokenizerFast for transformers compatibility
-tokenizer = PreTrainedTokenizerFast(tokenizer_object=base_tokenizer)
-# Process binary data
-with open("/usr/bin/ls", "rb") as f:
-    raw_bytes = f.read()
-    text = raw_bytes.decode('latin-1')
-# Tokenize (returns dict with input_ids, attention_mask, etc.)
-result = tokenizer(text)
-tokens = result["input_ids"]
-```
----
-## Important: Data Format
-The tokenizer expects binary data encoded as **latin-1 strings**, NOT hex strings:
-```python
-# ✅ CORRECT - Use latin-1 encoded bytes
-raw_bytes = b'\x7fELF\x01\x01'
-text = raw_bytes.decode('latin-1')  # → '\x7fELF\x01\x01'
-encoded = tokenizer.encode(text)
-# ❌ WRONG - Do not use hex strings
-hex_str = "7f 45 4c 46 01 01"  # Will not work correctly
-```
-**Why latin-1?** Every byte value (0-255) maps to exactly one latin-1 character, ensuring lossless round-trip conversion between bytes and text.
----
-## Performance Benchmarks
-### Compression on Real-World Binaries
-Tested on `/usr/bin` binaries (not in training set):
-| Binary | Size | Tokens | bytes/token |
-|--------|------|--------|-------------|
-| bash | 1.38 MB | 535,541 | 2.698 |
-| python3.12 | 7.65 MB | 2,801,226 | 2.863 |
-| gcc-13 | 0.98 MB | 344,201 | 2.986 |
-| ls | 0.14 MB | 49,574 | 2.866 |
-| grep | 0.18 MB | 67,567 | 2.667 |
-**Average**: 2.849 bytes/token
-### Information-Theoretic Efficiency
-- Binary entropy: ~6.5 bits/byte
-- Theoretical optimal: 2.46 bytes/token
-- Our performance: 2.849 bytes/token
-- **Efficiency: 86%** of theoretical optimum
----
-## Example: Tokenizing an ELF Header
-```python
-from tokenizers import Tokenizer
-# Load tokenizer
-tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
-# ELF header bytes
-elf_header = b'\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00'
-text = elf_header.decode('latin-1')
-# Tokenize
-encoded = tokenizer.encode(text)
-print(f"Original bytes: {elf_header.hex()}")
-print(f"Tokens: {encoded.ids}")
-print(f"Token count: {len(encoded.ids)}")
-print(f"Compression: {len(elf_header) / len(encoded.ids):.2f} bytes/token")
-# Examine individual tokens
-for token_id, token_str in zip(encoded.ids, encoded.tokens):
-    token_bytes = token_str.encode('latin-1')
-    print(f"  Token {token_id:5d}: {token_bytes.hex():16s} ({len(token_bytes)} bytes)")
-```
----
-## Token Distribution
-| Length | Count | Percentage | Examples |
-|--------|-------|------------|----------|
-| 2 bytes | 31,528 | 48.3% | `0x48 0x8b` (REX.W prefix), `0xcc 0xcc` (int3 padding) |
-| 3 bytes | 9,261 | 14.2% | `0x48 0x8b 0xc0` (MOV rax, rax) |
-| 4 bytes | 11,520 | 17.6% | `0x48 0x89 0x45 0xf8` (MOV [rbp-8], rax) |
-| 5+ bytes | 13,164 | 20.2% | Multi-instruction sequences, string literals |
-**Average token length**: 3.651 bytes
----
-## Training Details
-### Dataset
-**Source**: `/nas4/data/glaurung-data/binaries-small/`
-- **Size**: 13 GB
-- **Files**: 30,738 binaries
-- **Content**: Real-world compiled binaries including system utilities, libraries, and applications
-**Platform Distribution**:
-- Linux: Alpine, Debian, Ubuntu (ELF format)
-- Windows: 8, 10, 11 (PE format)
-**Architecture Distribution**:
-- x86-64 (primary)
-- x86-32
-- ARM64
-### Training Parameters
-```bash
-cargo run --release --bin train -- \
-  --output glaurung-tokenizer-002.json \
-  /nas4/data/glaurung-data/binaries-small/ \
-  --vocab-size 65536 \
-  --min-frequency 4 \
-  --chunk-size 8192
-```
-**Training Duration**: 8.46 hours on 24 cores
-**Peak Memory**: 70 GB
----
-## Use Cases
-### ✅ Recommended For
-- Binary neural language models
-- Malware analysis and classification
-- Reverse engineering tools
-- Binary similarity detection
-- Code pattern recognition
-- Vulnerability research
-- Firmware analysis
-### ❌ Not Recommended For
-- Text/source code (use text tokenizer like GPT-2, 100%+ penalty)
-- Very small binaries <1KB (overhead too high)
-- Real-time streaming (load time ~100ms)
----
-## Comparison with Predecessor
-| Metric | binary-tokenizer-005 | glaurung-binary-tokenizer-001 | Improvement |
-|--------|---------------------|------------------------------|-------------|
-| Vocabulary | 65,536 | 65,536 | Same |
-| Training data | ~5GB mixed | 13GB binaries-small | 2.6x larger |
-| bytes/token | ~2.6 | 2.849 | +9.6% |
-| Platforms | Mixed | Multi-OS (Linux, Windows) | More diverse |
-| Architecture awareness | Basic | Advanced (instruction-aware) | Significant |
-| Documentation | Basic | Comprehensive | Extensive |
-**Key Improvements**:
-- Larger, more diverse training corpus
-- Better cross-platform coverage
-- Instruction-boundary awareness
-- Production-ready quality
----
-## Advanced Usage
-### Batch Processing Multiple Files
-```python
-from tokenizers import Tokenizer
-from pathlib import Path
-import numpy as np
-tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
-def tokenize_binary_file(file_path):
-    """Tokenize a single binary file."""
-    raw_bytes = Path(file_path).read_bytes()
-    text = raw_bytes.decode('latin-1')
-    encoded = tokenizer.encode(text)
-    return {
-        'file': file_path,
-        'size_bytes': len(raw_bytes),
-        'token_count': len(encoded.ids),
-        'compression_ratio': len(raw_bytes) / len(encoded.ids),
-        'token_ids': encoded.ids
-    }
-# Process directory
-binary_dir = Path("/usr/bin")
-results = []
-for binary_path in binary_dir.glob("*"):
-    if binary_path.is_file():
-        try:
-            result = tokenize_binary_file(binary_path)
-            results.append(result)
-        except Exception as e:
-            print(f"Error processing {binary_path}: {e}")
-# Analyze compression statistics
-compression_ratios = [r['compression_ratio'] for r in results]
-print(f"Mean compression: {np.mean(compression_ratios):.3f} bytes/token")
-print(f"Std deviation: {np.std(compression_ratios):.3f}")
-```
-### Using with PyTorch/TensorFlow Models
-```python
-from tokenizers import Tokenizer
-import torch
-from pathlib import Path
-tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
-def prepare_binary_for_model(file_path, max_length=512):
-    """Prepare binary data for neural network input."""
-    raw_bytes = Path(file_path).read_bytes()
-    text = raw_bytes.decode('latin-1')
-    # Tokenize
-    encoded = tokenizer.encode(text)
-    token_ids = encoded.ids
-    # Truncate or pad to max_length
-    if len(token_ids) > max_length:
-        token_ids = token_ids[:max_length]
-    else:
-        # Pad with token ID 0 (or use a dedicated padding token)
-        token_ids = token_ids + [0] * (max_length - len(token_ids))
-    # Convert to tensor
-    return torch.tensor(token_ids, dtype=torch.long)
-# Use in model
-binary_tensor = prepare_binary_for_model("/usr/bin/ls", max_length=1024)
-print(f"Tensor shape: {binary_tensor.shape}")  # torch.Size([1024])
-```
----
-## Troubleshooting
-### Issue: UnicodeDecodeError when processing binary
-**Solution**: Always use `latin-1` encoding, never `utf-8`:
-```python
-# ✅ Correct
-text = raw_bytes.decode('latin-1')
-# ❌ Wrong
-text = raw_bytes.decode('utf-8')  # Will fail on non-UTF-8 bytes
-```
-### Issue: Decoded output doesn't match original
-**Cause**: BPE tokenizers add spaces between tokens during decoding.
-**Solution**: Use the raw token IDs and decode manually if exact byte recovery is needed:
-```python
-# Get tokens without spaces
-tokens_no_spaces = ''.join(encoded.tokens)
-original_bytes = tokens_no_spaces.encode('latin-1')
-```
-### Issue: Poor compression on specific binary types
-**Cause**: The tokenizer may not be optimized for highly specialized formats (e.g., bytecode for Python .pyc, Java .class).
-**Solution**: Consider domain-specific tokenizers for specialized formats, or use this as a general-purpose baseline.
----
-## Related Projects
-- **Predecessor**: [mjbommar/binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005) - Earlier 64K binary tokenizer
-- **Framework**: [mjbommar/glaurung](https://github.com/mjbommar/glaurung) - Binary analysis framework
-- **Training Code**: [mjbommar/glaurung-models](https://github.com/mjbommar/glaurung-models) - Binary embedding models and tokenizers
----
-## Technical Architecture
-### Vocabulary Structure
-- **Base tokens**: 256 single-byte tokens (0x00 to 0xFF)
-- **Merged tokens**: 65,280 learned byte-pair combinations
-- **Total**: 65,536 tokens (exactly 2^16)
-### Special Tokens
-The tokenizer includes boundary markers for file-level segmentation:
-- `<|start|>` (ID: 0)
-- `<|end|>` (ID: 1)
-These help models distinguish between concatenated files and identify file headers.
-### Token Properties
-**Instruction-aware patterns** (x86-64 examples):
-- REX prefixes: `0x48`, `0x4c`, `0x4d`
-- Common opcodes: `0x8b` (MOV), `0x89` (MOV), `0xe8` (CALL)
-- ModR/M patterns: `0xc0`, `0x45`, `0x5d`
-**Common patterns**:
-- Padding: `0xcc 0xcc` (int3), `0x90 0x90` (nop)
-- Alignment: `0x00 0x00 0x00 0x00`
-- String terminators: `0x00` at word boundaries
----
-## Performance Characteristics
-### Load Time
-- **Tokenizer size**: 2.3 MB on disk
-- **Load time**: ~100ms (cold), ~20ms (cached)
-- **Memory footprint**: ~15 MB in RAM
-### Encoding Speed
-On a modern CPU (tested on Intel i9-12900K):
-| Operation | Speed |
-|-----------|-------|
-| Encode 1 MB binary | ~50 ms |
-| Encode 10 MB binary | ~450 ms |
-| Encode 100 MB binary | ~4.2 s |
-**Throughput**: ~20-25 MB/second
----
-## Limitations
-1. **Cross-domain penalty**: Using on text data causes 100-140% efficiency loss
-2. **Small file overhead**: Files <1KB have proportionally higher tokenization overhead
-3. **Deterministic decoding**: Spaces inserted between tokens during decode (BPE behavior)
-4. **Architecture bias**: Trained primarily on x86-64; may be less optimal for RISC-V, MIPS, etc.
----
-## Citation
-If you use this tokenizer in research, please cite:
-```
-Glaurung Binary Tokenizer 001
-64K Binary Tokenizer for Neural Language Models
-Vocabulary: 65,536 tokens (exactly 2^16)
-Training: October 2025
-Dataset: 13GB binaries-small (30,738 files)
-Performance: 2.849 bytes/token (86% of theoretical optimum)
-HuggingFace: mjbommar/glaurung-binary-tokenizer-001
-```
----
-## License
-Apache License 2.0
-This tokenizer is part of the [Glaurung](https://github.com/mjbommar/glaurung) project. See the [glaurung-models repository](https://github.com/mjbommar/glaurung-models) for full license details.
----
-## Support & Issues
-- **GitHub Issues**: [mjbommar/glaurung-models/issues](https://github.com/mjbommar/glaurung-models/issues)
-- **Documentation**: Full training report available in the [glaurung-models repository](https://github.com/mjbommar/glaurung-models/tree/master/tokenizers/tokenizer-002)
-- **Email**: Contact maintainer via GitHub
----
-**Production Status**: ✅ Ready for deployment
-**Version**: 1.0.0
-**Release Date**: October 2025