mjbommar commited on
Commit
90028a3
·
verified ·
1 Parent(s): 8ccf614

Update README with accurate usage examples

Browse files
Files changed (1) hide show
  1. README.md +63 -3
README.md CHANGED
@@ -10,13 +10,21 @@ This tokenizer uses Byte Pair Encoding (BPE) trained on latin-1 encoded binary d
10
  - **Training Data**: System binaries from various OS distributions
11
  - **Encoding**: Latin-1 (each byte 0-255 maps to a single character)
12
 
 
 
 
 
 
 
13
  ## Usage
14
 
 
 
15
  ```python
16
  from tokenizers import Tokenizer
17
 
18
- # Load tokenizer
19
- tokenizer = Tokenizer.from_file("tokenizer.json")
20
 
21
  # Process binary data - MUST use latin-1 encoding
22
  with open("binary_file", "rb") as f:
@@ -24,6 +32,31 @@ with open("binary_file", "rb") as f:
24
  text = raw_bytes.decode('latin-1') # Convert bytes to latin-1 string
25
  encoded = tokenizer.encode(text)
26
  tokens = encoded.ids
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  ```
28
 
29
  ## Important: Data Format
@@ -31,14 +64,41 @@ with open("binary_file", "rb") as f:
31
  The tokenizer expects binary data encoded as latin-1 strings, NOT hex strings:
32
 
33
  ```python
34
- # CORRECT
35
  raw_bytes = b'\x7fELF\x01\x01'
36
  text = raw_bytes.decode('latin-1') # → '\x7fELF\x01\x01'
 
37
 
38
  # WRONG - Do not use hex strings
39
  hex_str = "7f 45 4c 46 01 01" # ❌ Will not work correctly
40
  ```
41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
  ## Related Projects
43
 
44
  - [mjbommar/glaurung](https://github.com/mjbommar/glaurung) - Binary analysis framework
 
10
  - **Training Data**: System binaries from various OS distributions
11
  - **Encoding**: Latin-1 (each byte 0-255 maps to a single character)
12
 
13
+ ## Installation
14
+
15
+ ```bash
16
+ pip install tokenizers transformers
17
+ ```
18
+
19
  ## Usage
20
 
21
+ ### Method 1: Using the tokenizers library (Recommended)
22
+
23
  ```python
24
  from tokenizers import Tokenizer
25
 
26
+ # Load tokenizer directly from Hugging Face Hub
27
+ tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-005")
28
 
29
  # Process binary data - MUST use latin-1 encoding
30
  with open("binary_file", "rb") as f:
 
32
  text = raw_bytes.decode('latin-1') # Convert bytes to latin-1 string
33
  encoded = tokenizer.encode(text)
34
  tokens = encoded.ids
35
+
36
+ # Decode back to text
37
+ decoded = tokenizer.decode(tokens)
38
+ ```
39
+
40
+ ### Method 2: Using transformers library
41
+
42
+ ```python
43
+ from transformers import PreTrainedTokenizerFast
44
+ from tokenizers import Tokenizer
45
+
46
+ # Load the base tokenizer
47
+ base_tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-005")
48
+
49
+ # Wrap with PreTrainedTokenizerFast for transformers compatibility
50
+ tokenizer = PreTrainedTokenizerFast(tokenizer_object=base_tokenizer)
51
+
52
+ # Process binary data
53
+ with open("binary_file", "rb") as f:
54
+ raw_bytes = f.read()
55
+ text = raw_bytes.decode('latin-1')
56
+
57
+ # Tokenize (returns dict with input_ids, attention_mask, etc.)
58
+ result = tokenizer(text)
59
+ tokens = result["input_ids"]
60
  ```
61
 
62
  ## Important: Data Format
 
64
  The tokenizer expects binary data encoded as latin-1 strings, NOT hex strings:
65
 
66
  ```python
67
+ # CORRECT - Use latin-1 encoded bytes
68
  raw_bytes = b'\x7fELF\x01\x01'
69
  text = raw_bytes.decode('latin-1') # → '\x7fELF\x01\x01'
70
+ encoded = tokenizer.encode(text)
71
 
72
  # WRONG - Do not use hex strings
73
  hex_str = "7f 45 4c 46 01 01" # ❌ Will not work correctly
74
  ```
75
 
76
+ ## Example: Tokenizing an ELF Header
77
+
78
+ ```python
79
+ from tokenizers import Tokenizer
80
+
81
+ # Load tokenizer
82
+ tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-005")
83
+
84
+ # ELF header bytes
85
+ elf_header = b'\x7fELF\x01\x01\x01\x00'
86
+ text = elf_header.decode('latin-1')
87
+
88
+ # Tokenize
89
+ encoded = tokenizer.encode(text)
90
+ print(f"Tokens: {encoded.ids}")
91
+ # Output: [0, 45689, 205, 22648, 1]
92
+ # Where: 0='<|start|>', 45689='\x7fEL', 205='F', 22648='\x01\x01\x01\x00', 1='<|end|>'
93
+
94
+ # The tokenizer adds special tokens <|start|> (id=0) and <|end|> (id=1)
95
+ # Content tokens are: [45689, 205, 22648]
96
+
97
+ # Note: Decoding adds spaces between tokens (BPE tokenizer behavior)
98
+ decoded = tokenizer.decode(encoded.ids)
99
+ print(f"Decoded: {repr(decoded)}") # '\x7fEL F \x01\x01\x01\x00'
100
+ ```
101
+
102
  ## Related Projects
103
 
104
  - [mjbommar/glaurung](https://github.com/mjbommar/glaurung) - Binary analysis framework