File size: 1,830 Bytes
6b5e19f
5712dec
 
 
6b5e19f
5712dec
 
 
 
 
 
6b5e19f
 
5712dec
6b5e19f
5712dec
 
6b5e19f
5712dec
6b5e19f
5712dec
 
 
 
6b5e19f
5712dec
6b5e19f
5712dec
 
c485aa9
 
 
 
 
 
 
6b5e19f
5712dec
6b5e19f
5712dec
 
6b5e19f
5712dec
b00d91a
6b5e19f
5712dec
 
 
 
6b5e19f
5712dec
 
 
 
6b5e19f
 
 
5712dec
 
 
 
6b5e19f
5712dec
6b5e19f
5712dec
 
6b5e19f
5712dec
6b5e19f
5712dec
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- tokenizer
- bpe
- binary-analysis
- file-type-detection
- magic-bert
---

# Magic-BERT Tokenizer (8K vocabulary)

A byte-level BPE tokenizer trained on binary file data for the Magic-BERT project.
This tokenizer is designed for binary file classification and analysis tasks.

## Features

- **Vocabulary Size:** 8,192 tokens
- **Type:** Byte-level BPE (Byte Pair Encoding)
- **Training Data:** Binary files from diverse sources (executables, documents, archives, media files, etc.)
- **Encoding:** Latin-1 (ISO-8859-1) - treats each byte as a character

## Special Tokens

| Token | ID | Purpose |
|-------|-----|---------|
| `<\|start\|>` | 0 | Beginning of sequence (BOS) |
| `<\|end\|>` | 1 | End of sequence (EOS) |
| `<\|pad\|>` | 2 | Padding |
| `<\|unk\|>` | 3 | Unknown token |
| `<\|cls\|>` | 4 | Classification token |
| `<\|sep\|>` | 5 | Separator token |
| `<\|mask\|>` | 6 | Mask token (for MLM) |

## Usage

```python
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("mjbommar/magic-bert-tokenizer-8k")

# Tokenize binary data (read as latin-1)
with open("some_file.bin", "rb") as f:
    binary_data = f.read()
text = binary_data.decode("latin-1")

# Encode
tokens = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
print(tokens)
```

## Training Details

This tokenizer was trained using the HuggingFace `tokenizers` library with:
- Byte-level BPE algorithm
- Training on ~100K+ binary file samples
- Diverse file types: executables, documents, archives, media, source code, etc.

## Model

This tokenizer is designed to be used with Magic-BERT models for binary file
MIME type classification. See the main model repository for more details.

## License

Apache 2.0